A compact, highly efficient solution for the Bank OCR Kata

The Bank OCR Kata (link in the first comment) is a simple but non-trivial programming exercise. The problem statement is articulated enough that I will first show you how I solved the first half, and address the second half in my next post.

The basic idea is straightforward: you are given a string split across 3 lines [Note 1], visually representing a 9-digit account number through a combination of pipes, underscores, and spaces. It resembles a 7-segment display would, and in the fictional story of the kata it has been obtained through a scanner:

The scanning process introduces a possibility of errors, so you might encounter glyphs that are not numbers, like:

Even when all glyphs are digits, errors cannot be excluded: therefore, the account number has to be validated using a simple checksum. The second part of the kata deals with error correction: for instance, the glyph above can be patched to an 8, but as mentioned I’ll leave that part for my next post.

?

Enough with strings

The few solutions I’ve seen around frame the problem as one of string transformation and matching. A common approach is to transpose the input string first, to get glyphs organized in a column instead of a row, like:

Then, each glyph is extracted as a string (it’s easy now: just take 3 rows) and matched against a reference representation of 0...9; of course, that means they have, somewhere in their code, a string representing each number. I’ve seen one case where underscores and pipes were first translated to “true” and spaces to “false”, and the reference representation was as a pattern of booleans, but the underlying approach was essentially the same.

The solution I’m presenting here is quite different:

  • It does everything in a single scan of the input, left-to-right, bottom-to-top.
  • It doesn’t contain any explicit representation of the digits 0…9
  • In fact, even while glyphs are being scanned and recognized, the exact shape of each individual glyph being scanned is not even present in memory at all.
  • It does not try to match the input by comparison; it does use a look-up table, but it does not scan through it looking for a match.
  • The code is written in a way that can exploit fast CPU operations.
  • I’ll show you a C# / OO implementation first (as that’s where I started) organized into a handful of small classes, followed by a C version with everything squeezed in a single function. There are a few surprising (or maybe not) observations on portability to be made there.
  • The algorithm itself would be well-suited even for a hardware-only implementation, say using an FPGA without embedding a microprocessor.


Glyphs, Bit Patterns, and Digits

Using dashes, pipes, and spaces is just an aesthetics choice in the kata, so to speak: at each designated position, either we find a space or a non-space (pipe or underscore). The real underlying alphabet is binary, and therefore every glyph can be converted to a bit sequence (or “bit pattern”) by transforming dashes and pipes into 1 and spaces into 0. As an example, a glyph like


becomes 010 110 111, distributed over 3 rows. A straightforward option (already mentioned above) is therefore to encode each correct glyph (the 10 digits 0...9) as a bit pattern and use those as reference representations, but as I’ve anticipated I took a different approach.

I built an index that, given a row index and 3-bit pattern, returns all the “compatible” digits, that is, digits which have that exact 3-bit pattern in that row. Each row has 3 bits, so there are only 8 possible combinations per row [Note 2], making the index quite small (a 3x8 matrix). We have 10 digits, so we can encode the compatible digits using a 16-bit short, by turning on the bits corresponding to compatible numbers (with 0 being the least significant bit). The index therefore requires just 3x8x2 bytes.

For instance, the combination 010 in the first row is compatible with (that is, appears in) digits 0, 2, 3, 5, 6, 7, 8, 9. Therefore, a 010 in the first row maps to 1111101101. Similarly, the combination 110 in the second row is compatible with 5 and 6, so a 110 in the second row maps to 0001100000. The combination 111 on the third row is compatible with 0, 6, 8, so a 111 in the third row maps to 0101000001.

Now, here is the trick (see also the following picture): if we just “and” together the 3 compatibility masks 1111101101, 0001100000, 0101000001, we get 0001000000. The 6th bit is on because the glyph is a 6.


Clearly, we need to fill that matrix, which in my C# code is named “patternToDigits”. Many entries are just zero (no compatible digits), so the initialization is itself quite compact (the first index is the row, the second the bit pattern):

            patternToDigits[0, 0b000] = 0b0000010010;
            patternToDigits[0, 0b010] = 0b1111101101;
            patternToDigits[1, 0b101] = 0b0000000001;
            patternToDigits[1, 0b001] = 0b0010000010;
            patternToDigits[1, 0b011] = 0b0000001100;
            patternToDigits[1, 0b111] = 0b1100010000;
            patternToDigits[1, 0b110] = 0b0001100000;
            patternToDigits[2, 0b001] = 0b0010010010;
            patternToDigits[2, 0b110] = 0b0000000100;
            patternToDigits[2, 0b011] = 0b1000101000;
            patternToDigits[2, 0b111] = 0b0101000001;        

That’s the entire declarative knowledge we need to recognize all 10 digits.

There is more: since (bool, and) is a monoid, we don’t even need to store the (partial) pattern we find for each digit while we scan the input: we can just keep “and-ing” the 3-bit rows as they’re added:

internal void AddSlice(int row, byte pattern)
    {
    and &= Digits.Candidates(row, pattern);
    }        

So, as I’ve mentioned, I don’t need to store the (partial) shape of the glyphs while I’m scanning the input: I can use a short integer per glyph and progressively “and” the compatibility mask of each row.

Validating the glyph is again straightforward: if everything went well (that is, the glyph is indeed a digit) after “and-ing” the 3 compatibility masks we end up with a short integer having a single bit turned on, and the position of that bit is the value of the digit. A binary number with a single bit turned on is a power of 2, which we can test very efficiently; I’m using a library function in C#, but I’ll explain how to do it otherwise in C. Also, we don’t need a loop to find the index of the that bit. We can ask for the number of trailing zeros; again, I’m using a library function here but it boils down to a single CPU instruction as we’ll see later.

internal void Recognize()
    {
    if(BitOperations.IsPow2(and))
        Value = (byte)(BitOperations.TrailingZeroCount(and));
     }        

The kata also requires that we calculate a checksum, which is trivial, so I won’t discuss it here (just see the code). After that, whatever is left is no longer about what the code should do, but about how we structure the code itself. That depends on the paradigm and the language.


A sprinkle of classes

I organized my C# around a handful of classes:

  • Digits (note the plural) knows about all the digits; that’s where you’ll find, among a few other things, the patternToDigits matrix above. It also knows about the geometry of things, like the fact that a digit is spread over 3 rows and 3 columns.
  • BinaryImage: takes the input string and provides access to the bit pattern of a given row of a given glyph. In my initial conception, I actually had two classes (a TextImage and a BinaryImage), but TextImage turned out to have only 1 method, 1 line long. The code seemed easier to understand without that class; to be honest, I could have kept it, the advantage being that I would have had a smaller unit to test (I deleted a couple of tests when I inlined the class; I know this might hurt your feeling, sorry for that).
  • Glyph is actually the only class that is instantiated more than once: we have 9 Glyph instances, that are incrementally built by scanning the BinaryImage. As we scan, we call AddSlice for the relevant glyph, which does then and-ing explained above. The Glyph can also tell if it’s a digit, in which case we also get its value.
  • AccountNumber deals with checksum validation, given all the glyphs. In this version, the presentation aspect is so trivial that I’ve just collapsed it in the ToString method; it got more complicated in the next version (where you can get ambiguities while patching glyphs into digits) and at that point it was largely moved out, but for now I didn’t want to complicate the design without a real need (I know some of you are screaming “single responsibility principle”, but I don’t feel your pain).
  • OCR is the actual engine: it loops over the binary image, adds slices to the glyphs, gives the glyphs to the account number for validation.

?Here is a UML diagram for those who can still read it:

Overall, the code is very short (which later inspired me to collapse everything in a single C function) and very efficient, as it never looks for “matches” by iterating over reference representations, reads the input only once, and all the bit trickery resolve to intrinsic CPU functions. You can go and take a look on github, links is in the 1st comment.

Like my code the for Harry Potter kata, I designed and “coded” all this in my mind while walking in the woods, with a few readjustments (like removing the TextImage class) after typing it on a computer. By now you know I use tests mostly as safeguards (mainly for refactoring), so you’ll find less tests than you would for equivalent code written by someone adopting by-the-book TDD; I privileged end-to-end tests because, realistically, they were the most useful for a small piece of code like this.

?

Bit trickery and portability, or lack thereof

I used a couple of library functions in the C# version, namely BitOperations.IsPow2 and BitOperations.TrailingZeroCount. Both resolve to fast assembly code on modern CPUs. On an Intel processor, TrailingZeroCount usually resolves to the almost homonymous TZCNT instruction. IsPow2 is a higher-level (so to speak) function that could be implemented in different ways. One option is to use an instruction like POPCNT (short for Population Count) which returns the number of bits set to 1: for a power of two, that should equal one. There are also ways to express that logic using some bit trickery with integers, including the one I’ve chosen for my C implementation:

bool isPow2(unsigned short val)
    {
    return (val != 0) && ((val & (-val)) == val);
    }
        

?The trick is: val & (-val) returns the largest power of 2 which divides val without a remainder. If val is in fact a power of 2, it follows that (val & (-val)) == val; zero is an exception to the rule, which we have to cater for.? There is a catch, however: the trick only works if the internal representation of integers is in two’s complement. This is where things get a bit weird.

You might think that languages targeting virtual machines, like Java or C#, would not mention the internal representation of integers at all. That would be wrong: both specify that integers are in two’s complement.

C# gets that from the Common Language Infrastructure specification: in section I.12.1 (Supported data types) of the ECMA-335 6th edition, June 2012, Common Language Infrastructure (CLI), Partitions I–VI, we find (among other things): “int32 32-bit two’s-complement signed value”.

In Java, you just have to look in The Java Virtual Machine Specification, Java SE 22 Edition, 2024-02-09, section 2.3 (Primitive Types and Values), where you find “int, whose values are 32-bit signed two's-complement integers”.

So, the trick above is portable in Java and C# (although I don’t need it there because I have a standard library function for that). C and C++ are different: they have a long history of not wanting to specify the format of signed integers, and have only recently given up on that.

C++ did so in the C++20 specification; looking in section 6.8.1 (Fundamental types) of the ISO International Standard ISO/IEC 14882:2020(E) – Programming Language C++, you will find the relevant text relegated to note 40 as "This is also known as two’s complement representation".

C resisted standardizing the format up to (and including) C18. Only in the latest draft of the standard (ISO/IEC 9899:2024 Information technology — Programming languages — C N3220 working draft), you will find that under section 6.2.6.2 (Integer types), NOTE 2 “The sign representation defined in this document is called two’s complement. Previous editions of this document (specifically ISO/IEC 9899:2018 and prior editions) additionally allowed other sign representations”.

One might think that not specifying an internal representation, would result in “more portable” code, but in practice it’s exactly the opposite. Only after standardizing the representation, C++20 got standard functions in the <bit> standard header, where you will find among other things has_single_bit and? countr_zero, which correspond to IsPow2 and TrailingZeroCount. In C, as far as I know, there are no standard headers where similar functions can be found (some compilers come with non-standard headers). In my C code, I’ve defined isPow2 as above (still not portable up to C18) and I used an intrinsic for TZCNT, making the code definitely not portable.

?

C version

Although I’ve used C extensively over many years, I’m not overly fond of C as a creative material. Still, I occasionally use its compact notation, mirroring a simplified CPU model, as a sort of contrast against more articulated solutions. When I do, I try to avoid dynamic memory allocation (which is a drag in C anyway) and aim for some sort of imperative-minimalistic-yet-readable programming style.

The bit-fiddling nature of the strategy explained above was just asking to be implemented in C; perhaps I could have used some AI assistant to do the translation, but what I did was just to inline all the C# code in one function, remove all the dynamic allocations in favor of a few global arrays, and then convert the resulting code in C. While doing so I made it a little more idiomatic, so while in C# I use Convert.ToByte to parse a binary string, in C I eschew strtol and just used shifts and or. In the end, the only library function I used was strcat_s, which I could have easily avoided. A link to the github repo is again in the 1st comment.

Paradoxically enough for a “low level” language, C refuses to standardize binary literals (some compilers have them as a non-standard extension), so to initialize the table I had to convert all the literals to decimal; many C programmers would have used hexadecimal, but it’s hardly better in this specific case.

Overall, the code is quite compact: the entire recognition algorithm (so, excluding the initialization of the compatibility matrix) ends up being under 50 LOC. That’s not too bad, especially considering that I’m not leveraging any library function and using only primitive data types. By replacing the _tzcnt intrinsic, that code would sit easily on a cheap microcontroller.

?

Coming up next

While I like the overall approach presented here, including the compactness of the representation, the intrinsic efficiency, and the small memory requirements, it should also be noted that by and-ing together the slices we lose information: the exact shape of the glyph is lost, and that doesn’t fare well when you have to “fix” that shape as part of error correction. I’ll deal with that in my next post.

?

Notes

[Note 1] The kata actually speaks of a file, containing multiple account numbers. Since reading a file is hardly interesting, I’ve skipped that part.

[Note 2] To be precise, row 0 theoretically encodes only one bit of information, since the first and third bit are always supposed to be 0. However, on one side it’s easier to treat all the rows uniformly, and on the other side the kata later evolves to consider spurious characters, so I didn’t try to optimize the first row.

Matteo Vaccari

Technical Principal @ Thoughtworks Italia

3 个月

I see some gurus like katas ??

回复
Carlo Pescio

Nerd. Super Saiyan. Per aspera ad scientiam. Also, that #PhysicsOfSoftware guy.

3 个月

Links: - The Bank OCR kata: https://codingdojo.org/kata/BankOCR/ - Github repo for the C# version: https://github.com/carlopescio/BankOcrKataV1cs - Github repo for the C version: https://github.com/carlopescio/BankOcrKataV1c - My previous post on the Harry Potter Kata: https://www.dhirubhai.net/pulse/harry-potter-magic-common-good-carlo-pescio-tnuwf

回复

要查看或添加评论,请登录

Carlo Pescio的更多文章

  • The Bank OCR Kata, part 2: Hamming distance to the rescue

    The Bank OCR Kata, part 2: Hamming distance to the rescue

    In my previous post (link in the first comment as usual) I addressed the first part of the kata: character recognition…

    1 条评论
  • Harry Potter and the Magic of Common Good

    Harry Potter and the Magic of Common Good

    The Harry Potter kata (https://codingdojo.org/kata/Potter/) is a simple optimization problem: there is a series of 5…

  • Diamonds: It's Life Jim, but not as we know it

    Diamonds: It's Life Jim, but not as we know it

    After posting some short C code to solve the Diamond kata (see https://www.linkedin.

    2 条评论
  • Diamonds in C

    Diamonds in C

    A few days ago I was briefly involved in a discussion about the Diamond Kata (see https://blog.cyber-dojo.

    1 条评论

社区洞察

其他会员也浏览了