A compact, highly efficient solution for the Bank OCR Kata
Carlo Pescio
Nerd. Super Saiyan. Per aspera ad scientiam. Also, that #PhysicsOfSoftware guy.
The Bank OCR Kata (link in the first comment) is a simple but non-trivial programming exercise. The problem statement is articulated enough that I will first show you how I solved the first half, and address the second half in my next post.
The basic idea is straightforward: you are given a string split across 3 lines [Note 1], visually representing a 9-digit account number through a combination of pipes, underscores, and spaces. It resembles a 7-segment display would, and in the fictional story of the kata it has been obtained through a scanner:
The scanning process introduces a possibility of errors, so you might encounter glyphs that are not numbers, like:
Even when all glyphs are digits, errors cannot be excluded: therefore, the account number has to be validated using a simple checksum. The second part of the kata deals with error correction: for instance, the glyph above can be patched to an 8, but as mentioned I’ll leave that part for my next post.
?
Enough with strings
The few solutions I’ve seen around frame the problem as one of string transformation and matching. A common approach is to transpose the input string first, to get glyphs organized in a column instead of a row, like:
Then, each glyph is extracted as a string (it’s easy now: just take 3 rows) and matched against a reference representation of 0...9; of course, that means they have, somewhere in their code, a string representing each number. I’ve seen one case where underscores and pipes were first translated to “true” and spaces to “false”, and the reference representation was as a pattern of booleans, but the underlying approach was essentially the same.
The solution I’m presenting here is quite different:
Glyphs, Bit Patterns, and Digits
Using dashes, pipes, and spaces is just an aesthetics choice in the kata, so to speak: at each designated position, either we find a space or a non-space (pipe or underscore). The real underlying alphabet is binary, and therefore every glyph can be converted to a bit sequence (or “bit pattern”) by transforming dashes and pipes into 1 and spaces into 0. As an example, a glyph like
becomes 010 110 111, distributed over 3 rows. A straightforward option (already mentioned above) is therefore to encode each correct glyph (the 10 digits 0...9) as a bit pattern and use those as reference representations, but as I’ve anticipated I took a different approach.
I built an index that, given a row index and 3-bit pattern, returns all the “compatible” digits, that is, digits which have that exact 3-bit pattern in that row. Each row has 3 bits, so there are only 8 possible combinations per row [Note 2], making the index quite small (a 3x8 matrix). We have 10 digits, so we can encode the compatible digits using a 16-bit short, by turning on the bits corresponding to compatible numbers (with 0 being the least significant bit). The index therefore requires just 3x8x2 bytes.
For instance, the combination 010 in the first row is compatible with (that is, appears in) digits 0, 2, 3, 5, 6, 7, 8, 9. Therefore, a 010 in the first row maps to 1111101101. Similarly, the combination 110 in the second row is compatible with 5 and 6, so a 110 in the second row maps to 0001100000. The combination 111 on the third row is compatible with 0, 6, 8, so a 111 in the third row maps to 0101000001.
Now, here is the trick (see also the following picture): if we just “and” together the 3 compatibility masks 1111101101, 0001100000, 0101000001, we get 0001000000. The 6th bit is on because the glyph is a 6.
Clearly, we need to fill that matrix, which in my C# code is named “patternToDigits”. Many entries are just zero (no compatible digits), so the initialization is itself quite compact (the first index is the row, the second the bit pattern):
patternToDigits[0, 0b000] = 0b0000010010;
patternToDigits[0, 0b010] = 0b1111101101;
patternToDigits[1, 0b101] = 0b0000000001;
patternToDigits[1, 0b001] = 0b0010000010;
patternToDigits[1, 0b011] = 0b0000001100;
patternToDigits[1, 0b111] = 0b1100010000;
patternToDigits[1, 0b110] = 0b0001100000;
patternToDigits[2, 0b001] = 0b0010010010;
patternToDigits[2, 0b110] = 0b0000000100;
patternToDigits[2, 0b011] = 0b1000101000;
patternToDigits[2, 0b111] = 0b0101000001;
That’s the entire declarative knowledge we need to recognize all 10 digits.
There is more: since (bool, and) is a monoid, we don’t even need to store the (partial) pattern we find for each digit while we scan the input: we can just keep “and-ing” the 3-bit rows as they’re added:
internal void AddSlice(int row, byte pattern)
{
and &= Digits.Candidates(row, pattern);
}
So, as I’ve mentioned, I don’t need to store the (partial) shape of the glyphs while I’m scanning the input: I can use a short integer per glyph and progressively “and” the compatibility mask of each row.
Validating the glyph is again straightforward: if everything went well (that is, the glyph is indeed a digit) after “and-ing” the 3 compatibility masks we end up with a short integer having a single bit turned on, and the position of that bit is the value of the digit. A binary number with a single bit turned on is a power of 2, which we can test very efficiently; I’m using a library function in C#, but I’ll explain how to do it otherwise in C. Also, we don’t need a loop to find the index of the that bit. We can ask for the number of trailing zeros; again, I’m using a library function here but it boils down to a single CPU instruction as we’ll see later.
internal void Recognize()
{
if(BitOperations.IsPow2(and))
Value = (byte)(BitOperations.TrailingZeroCount(and));
}
The kata also requires that we calculate a checksum, which is trivial, so I won’t discuss it here (just see the code). After that, whatever is left is no longer about what the code should do, but about how we structure the code itself. That depends on the paradigm and the language.
领英推荐
A sprinkle of classes
I organized my C# around a handful of classes:
?Here is a UML diagram for those who can still read it:
Overall, the code is very short (which later inspired me to collapse everything in a single C function) and very efficient, as it never looks for “matches” by iterating over reference representations, reads the input only once, and all the bit trickery resolve to intrinsic CPU functions. You can go and take a look on github, links is in the 1st comment.
Like my code the for Harry Potter kata, I designed and “coded” all this in my mind while walking in the woods, with a few readjustments (like removing the TextImage class) after typing it on a computer. By now you know I use tests mostly as safeguards (mainly for refactoring), so you’ll find less tests than you would for equivalent code written by someone adopting by-the-book TDD; I privileged end-to-end tests because, realistically, they were the most useful for a small piece of code like this.
?
Bit trickery and portability, or lack thereof
I used a couple of library functions in the C# version, namely BitOperations.IsPow2 and BitOperations.TrailingZeroCount. Both resolve to fast assembly code on modern CPUs. On an Intel processor, TrailingZeroCount usually resolves to the almost homonymous TZCNT instruction. IsPow2 is a higher-level (so to speak) function that could be implemented in different ways. One option is to use an instruction like POPCNT (short for Population Count) which returns the number of bits set to 1: for a power of two, that should equal one. There are also ways to express that logic using some bit trickery with integers, including the one I’ve chosen for my C implementation:
bool isPow2(unsigned short val)
{
return (val != 0) && ((val & (-val)) == val);
}
?The trick is: val & (-val) returns the largest power of 2 which divides val without a remainder. If val is in fact a power of 2, it follows that (val & (-val)) == val; zero is an exception to the rule, which we have to cater for.? There is a catch, however: the trick only works if the internal representation of integers is in two’s complement. This is where things get a bit weird.
You might think that languages targeting virtual machines, like Java or C#, would not mention the internal representation of integers at all. That would be wrong: both specify that integers are in two’s complement.
C# gets that from the Common Language Infrastructure specification: in section I.12.1 (Supported data types) of the ECMA-335 6th edition, June 2012, Common Language Infrastructure (CLI), Partitions I–VI, we find (among other things): “int32 32-bit two’s-complement signed value”.
In Java, you just have to look in The Java Virtual Machine Specification, Java SE 22 Edition, 2024-02-09, section 2.3 (Primitive Types and Values), where you find “int, whose values are 32-bit signed two's-complement integers”.
So, the trick above is portable in Java and C# (although I don’t need it there because I have a standard library function for that). C and C++ are different: they have a long history of not wanting to specify the format of signed integers, and have only recently given up on that.
C++ did so in the C++20 specification; looking in section 6.8.1 (Fundamental types) of the ISO International Standard ISO/IEC 14882:2020(E) – Programming Language C++, you will find the relevant text relegated to note 40 as "This is also known as two’s complement representation".
C resisted standardizing the format up to (and including) C18. Only in the latest draft of the standard (ISO/IEC 9899:2024 Information technology — Programming languages — C N3220 working draft), you will find that under section 6.2.6.2 (Integer types), NOTE 2 “The sign representation defined in this document is called two’s complement. Previous editions of this document (specifically ISO/IEC 9899:2018 and prior editions) additionally allowed other sign representations”.
One might think that not specifying an internal representation, would result in “more portable” code, but in practice it’s exactly the opposite. Only after standardizing the representation, C++20 got standard functions in the <bit> standard header, where you will find among other things has_single_bit and? countr_zero, which correspond to IsPow2 and TrailingZeroCount. In C, as far as I know, there are no standard headers where similar functions can be found (some compilers come with non-standard headers). In my C code, I’ve defined isPow2 as above (still not portable up to C18) and I used an intrinsic for TZCNT, making the code definitely not portable.
?
C version
Although I’ve used C extensively over many years, I’m not overly fond of C as a creative material. Still, I occasionally use its compact notation, mirroring a simplified CPU model, as a sort of contrast against more articulated solutions. When I do, I try to avoid dynamic memory allocation (which is a drag in C anyway) and aim for some sort of imperative-minimalistic-yet-readable programming style.
The bit-fiddling nature of the strategy explained above was just asking to be implemented in C; perhaps I could have used some AI assistant to do the translation, but what I did was just to inline all the C# code in one function, remove all the dynamic allocations in favor of a few global arrays, and then convert the resulting code in C. While doing so I made it a little more idiomatic, so while in C# I use Convert.ToByte to parse a binary string, in C I eschew strtol and just used shifts and or. In the end, the only library function I used was strcat_s, which I could have easily avoided. A link to the github repo is again in the 1st comment.
Paradoxically enough for a “low level” language, C refuses to standardize binary literals (some compilers have them as a non-standard extension), so to initialize the table I had to convert all the literals to decimal; many C programmers would have used hexadecimal, but it’s hardly better in this specific case.
Overall, the code is quite compact: the entire recognition algorithm (so, excluding the initialization of the compatibility matrix) ends up being under 50 LOC. That’s not too bad, especially considering that I’m not leveraging any library function and using only primitive data types. By replacing the _tzcnt intrinsic, that code would sit easily on a cheap microcontroller.
?
Coming up next
While I like the overall approach presented here, including the compactness of the representation, the intrinsic efficiency, and the small memory requirements, it should also be noted that by and-ing together the slices we lose information: the exact shape of the glyph is lost, and that doesn’t fare well when you have to “fix” that shape as part of error correction. I’ll deal with that in my next post.
?
Notes
[Note 1] The kata actually speaks of a file, containing multiple account numbers. Since reading a file is hardly interesting, I’ve skipped that part.
[Note 2] To be precise, row 0 theoretically encodes only one bit of information, since the first and third bit are always supposed to be 0. However, on one side it’s easier to treat all the rows uniformly, and on the other side the kata later evolves to consider spurious characters, so I didn’t try to optimize the first row.
Technical Principal @ Thoughtworks Italia
3 个月I see some gurus like katas ??
Nerd. Super Saiyan. Per aspera ad scientiam. Also, that #PhysicsOfSoftware guy.
3 个月Links: - The Bank OCR kata: https://codingdojo.org/kata/BankOCR/ - Github repo for the C# version: https://github.com/carlopescio/BankOcrKataV1cs - Github repo for the C version: https://github.com/carlopescio/BankOcrKataV1c - My previous post on the Harry Potter Kata: https://www.dhirubhai.net/pulse/harry-potter-magic-common-good-carlo-pescio-tnuwf