The Bank OCR Kata, part 2: Hamming distance to the rescue
Carlo Pescio
Nerd. Super Saiyan. Per aspera ad scientiam. Also, that #PhysicsOfSoftware guy.
In my previous post (link in the first comment as usual) I addressed the first part of the kata: character recognition and checksum validation (user stories 1-2-3). We’ve seen that account numbers may turn out to be invalid (“ILL”) if they contain glyphs which are not digits, or erroneous (“ERR”) if all the glyphs are digits but the checksum fails. User story 4 is about trying to fix invalid or erroneous account numbers. I suggest that you take a few minutes and read the original text; I found it a little unclear at first, but checking the suggested test cases helped clarify the intended meaning [Note 1].
Basically: we must consider the possibility that any glyph (even those recognized as valid digits) may have one extra pipe, or one extra underscore, or that they might be missing one. Only one error is allowed in a glyph to consider the glyph “fixable”; actually, only one error in one glyph is allowed in an entire account number. The error may transform a valid digit into another valid digit (e.g. a 5 into a 6) or into an invalid glyph. The former error is caught by the checksum (as “ERR”), the latter is clearly caught as an “ILL” number. Of course, when we have an “ERR” we don’t know which digit is wrong, as apparently all glyphs are valid. Either way, we have to find out the possible alternative[s] with a valid checksum, with a possible ambiguity if there is more than a way to fix the account number within the constraints above.
Can we say that better?
Instead of jumping into coding, or even formulating a strategy to solve the problem, I’m going to do something that these days is rarely even whispered in software engineering: I’m going to make the requirements more formal. So, it’s not about fixing account numbers yet, but about the properties those alternative account numbers must satisfy.
In information theory, there is a notion of Edit Distance, and a relatively well-known type of edit distance is the so-called Hamming Distance, named after Richard Hamming. Given two sequences of equal length S1 and S2, the Hamming distance between S1 and S2 is the minimum number of substitutions required to transform one into the other. For instance, given S1 = [1,2,3,4,5] and S2 = [1,0,3,7,5], the Hamming distance between S1 and S2 is 2.
With that in mind, our requirements could be formally stated as:
-? Given a sequence S of 9 glyphs forming a candidate account number.
- If some of the glyphs are not digits, or if they’re all digits but the checksum fails, consider all the alternative glyph sequences that are at Hamming distance 1 from S, where the only substitutions allowed [Note 2] are with glyphs which:
o?? Are valid digits
o?? Are at Hamming distance 1 from the glyph they’re replacing
and then accept those with a valid checksum (if any).
The Hamming distance appears twice in the statement above: first applied to account numbers and then to glyphs. The first occurrence will help to identify test cases, the second to find a simple and fast technique to patch the glyphs.
Test cases
Given a sequence S of 9 glyphs [G1…G9], it follows from the statement above that:
a) If the 9 glyphs are all digits, and the checksum is valid, we’re done.
b) If there are two or more glyphs which are not digits, there is nothing we can do and we should report the scan as invalid (“ILL”). This follows from the fact that any sequence at Hamming distance 1 from S will still have at least a residual invalid glyph. This case was not covered in the data set provided in the original kata text (see links).
c) If there is exactly one glyph Gi that is not a digit, we should try to substitute only Gi with all the digits at Hamming distance 1 from Gi (if any).
d) If all the glyphs are digits (but the checksum is invalid, as otherwise we would be in case a) then we should try all the sequences that can be obtained by replacing G1…G9, one at a time, with all the digits at Hamming distance 1 from the chosen glyph (note the difference with the previous case, where only the alternatives to one glyph should be tried out).
In both case c) and d), we may find:
-??No alternative with a valid checksum. Then the account should be reported as invalid (“ILL”) for case c) and as error (“ERR”) for case d).
-??Exactly one alternative with a valid checksum. Then this is accepted as the correct account number.
-??More than one alternative with a valid checksum; then we report them all as ambiguous (“AMB”) cases.
This can be translated more or less directly into code, and of course needs to be covered by test cases as well. Coding the entire procedure is more or less trivial, except for the part where we have to patch the glyphs.
Patching glyphs
Remember that a glyph can be represented as a bit pattern by replacing underscores and pipes with 1 and spaces with 0; for instance, a 6 becomes 010110111 and a 5 becomes 010110011.
Here is something cool about the Hamming distance between two bit patterns B1 and B2: you just have to count the 1s in B1 XOR B2 [Note 3]. For instance, 010110111 XOR 010110011 gives 000000100, because the digits 5 and 6 can be turned one into another by changing just one segment from space to pipe (or vice versa).
The procedure also works backward: if you start with a bit pattern (say, a valid digit) you can pre-compute all the patterns at Hamming distance one by XOR-ing that bit patterns with all the powers of two in the pattern range. This can be done using just shift and XOR instructions.
That gives us two possible strategies for patching glyphs:
a)??keep a reference representation of 0…9 as bit patterns. Given a glyph G, just XOR its bit pattern with each reference representation and see if there is only one bit left on (as we’ve seen in part one, that means the XOR is a power of two). If so, the corresponding digit is at Hamming distance one from G, and that digit is a candidate patch.
b)??given a reference representation of 0…9 as bit patterns, pre-compute all the glyphs at distance one from each digit. Since representing all the glyphs requires only 2^9 = 512 patterns, we can pre-compute and store the entire “patch table” for all the glyphs, in a way that given a glyph we can immediately obtain (using the bit pattern of the glyph itself as an index into an array) the list of “candidate” digits.
I choose to go with option b), as it is more in line with what I did in my previous post (pre-computing a reverse index). In practice, option b) would be especially advantageous if you have to recognize and patch several account numbers (as you can amortize the cost of pre-computing the index) but it’s also possible to go with option a) and then maybe cache the results, so there is no clear winner (or clear loser, meaning: they’re both ok).
Just a few lines of code
Computing the fixup table according to option b) above takes just about 30 lines of code, most of which are { }:
internal static class FixupTable
{
static FixupTable()
{
for(int i = 0; i < neighbourDigits.Length; ++i)
neighbourDigits[i] = [];
foreach(Glyph g in Glyph.digits)
{
IEnumerable<ushort> neighbours =
HammingOneNeighbours(Glyph.BitsPerGlyph, g.Pattern);
foreach(ushort n in neighbours)
{
neighbourDigits[n].Add(g);
}
}
}
internal static IEnumerable<Glyph> ReplacementsFor(Glyph g)
{
return neighbourDigits[g.Pattern];
}
private static IEnumerable<ushort>
HammingOneNeighbours(int bits, ushort pattern)
{
ushort mask = 1;
for(int i = 0; i < bits; ++i)
{
yield return (ushort)(pattern ^ mask);
mask <<= 1;
}
}
private static readonly List<Glyph>[] neighbourDigits =
new List<Glyph>[1 << Glyph.BitsPerGlyph];
}
After the table is filled, obtaining all the candidate replacements for a glyph is straightforward: we just use the bit pattern of the glyph itself as an index into the array (see the ReplacementsFor method above). Building the table is quite fast: we only have to iterate over 10 glyphs and find the 9 neighbors at Hamming distance one, so it’s just 90 operations. Since the original text of the kata mentioned a file with a few hundred account numbers, building the index at startup is a reasonable strategy.
Merge or replace?
The idea I’ve sketched so far is a compact and theoretically sound way to address the new requirements. However, it is somewhat at odd with the code that we’ve seen in part 1. More specifically, as we try to deal with error correction, some properties of the previous code are getting in the way. Quoting my post (“it” being the code in what follows):
-??It doesn’t contain any explicit representation of the digits 0…9
-??In fact, even while glyphs are being scanned and recognized, the exact shape of each individual glyph being scanned is not even present in memory at all.
-??It does not try to match the input by comparison; it does use a look-up table, but it does not scan through it looking for a match.
Whatever strategy I choose now for error correction, I need a reference representation of 0…9 (as bit patterns). Also, while scanning the input, it’s no longer enough to keep the AND of the compatibility masks: that transformation loses information, and I need the full 9-bit input pattern to perform error correction.
领英推荐
Note that “being at odd” does not mean “being incompatible”. I could add the fixup table above to the previous code, and clearly nothing bad would happen. I could also transform the old code:
internal void AddSlice(int row, byte pattern)
{
and &= Digits.Candidates(row, pattern);
}
To (also) keep the entire bit pattern: I just need to shift and OR as I add the rows. So, one option would be to simply merge the new code with the old code. In a real-world project, of a an entirely different size and with an entirely different life cycle, I would probably do so. Still, on a small program like this, and having to introduce a reference representation for 0…9 anyway, it is very tempting to throw away the previous code, forget all the claims above, make the entire thing slightly slower and just match the incoming glyphs (as bit patterns) against the reference representation of each digit, looking for a perfect match.
In the end, I choose to rewrite the code entirely, so even though I reused some names (like BinaryImage) and a few lines here and there, there is little in common between version 1 and version 2. I’ll come back to ponder on this decision at the end.
Roles and Responsibilities
As before, you can find all the code on github, but here is a UML diagram:
Instead of doing the usual thing, that is, explaining what every class is doing, I’ll try a new approach and explain where each responsibility is allocated:
-? Input string parsing: BinaryImage
-??Accumulating bits while parsing: BitPattern
-??Digit recognition, reference representation of digits: Glyph
-??Fixing glyphs: FixupTable
-??Checksum validation: AccountNumber
-??Output formatting: Report
-??Account number recognition: OCR
-??Fixing account numbers: ErrorCorrectingOCR.
With that in mind, a few notes:
-? BitPattern is basically a shift register, used to incrementally build the pattern for each glyph; I’m still doing a single scan of the input, so each glyph is built in 3 stages, one per row, as I scan left-to-right, top-to-bottom. The scan is performed by BinaryImage.
-??Glyph transforms a BitPattern into a domain-specific concept; the class knows the reference representation of 0...9 and how to recognize digits. Glyphs are immutable, so when a bit pattern is recognized as a digit, Glyph returns a shared instance of the digit; it follows that, if you only scan valid digits, there is no need for any dynamic allocation of Glyphs (no big deal, it’s just that I naturally tend toward ecological / energy efficient programming).
-??Report is now responsible for the “presentation” aspect, which in the previous version was simple enough to be squeezed into AccountNumber. Now we also have to deal with a list of ambiguous account numbers, so a separate class was a better option. Of course, I could have introduced that class from the very beginning: when you publish some code, you’re always tempted to show “your best self”, so to speak. However, in many practical cases we under-engineer some parts, ideally where we know it won’t be hard to come back and add some structure if / when needed: I just wanted to use this as a small-scale example.
-??OCR builds AccountNumbers out of Glyphs and by itself will just invoke checksum validation.
-??ErrorCorrectingOCR contains the logic to correct errors in account numbers (I’ll show some details in a moment). Basically, it’s a mirror of what we discussed under “test cases”.
Overall, the code is still quite short, but this time I would find it unpleasant to rewrite in C; also, unlike my previous version, it would not be trivial to turn this into a hardware-only solution.
Essential vs. Accidental Symmetries
When I “formalized” the requirements, a similarity (symmetry) seemed to pop up: account numbers have to be fixed by trying out alternatives at Hamming distance one, and glyphs too have to be fixed by trying out alternatives at Hamming distance one. Yet there is no trace of this symmetry in the code: the part about fixing account numbers became (mirroring the test cases) a series of condition inside ErrorCorrectingOCR:
public Report Recognize(string[] rows)
{
Glyph[] digits = RecognizeDigits(rows);
AccountNumber an = new(digits);
if(an.HasInvalidDigits)
return TryFixingInvalidDigits(digits, an);
an.Validate();
if(an.HasInvalidChecksum)
return TryFixingInvalidChecksum(digits, an);
return new Report(an);
}
With a few more conditions inside TryFixingInvalidDigits and TryFixingInvalidChecksum. The part about fixing glyphs became a fixup table instead, built once and used inside those TryFixing… functions above.
Isn’t that sort of bad? Shouldn’t the code look similar? Shouldn’t we somehow share some of (or all) the logic? No, for two different reasons:
-??That symmetry is entirely accidental. In the fictional story of the kata, the rules have been chosen in a way that could (accidentally) be formalized using the notion of Hamming distance twice. That’s not an inherent property of the domain of optical character recognition or account numbers or glyphs. It’s easy to come up with reasons to change the two rules (about fixing account numbers and fixing glyphs) independently.
?-? Even if it were an essential (as opposed to accidental) kind of symmetry, the context is different: we only have 512 glyphs, of which only 10 are valid digits. It makes sense to precompute a table. However, we have 512^9 account numbers (many of which have several invalid glyphs). We can’t approach the problem in the same way, because the context is an integral part of the problem.
The perils of minimalism
Perhaps some of you were disappointed when I choose to replace the old code, instead of evolving it into a new shape. Of course I could have done that; in fact, as a seasoned consultant / trainer / author, I could have faked [Note 4] the “right” design from the very beginning: after all, I had all the code (version 1 and version 2) ready well before writing my first post. I could have shoehorned version 1 code inside version 2 classes from the beginning, and then pretend to evolve everything into version 2 with ease (of course, I would have kept the old algorithm for character recognition, on the basis of efficiency). Many papers on software design are written just that way.
I did not, because even within the limits of a small problem, I wanted to put the spotlight on the perils of minimalism, and nothing looks as dramatic as scrapping all the code and starting from scratch. The code for version 1,? was, in my opinion, quite decent: short, efficient, could be implemented in C without dynamic memory allocation, and the underlying idea of using a compatibility mask had that sort of “beauty” (at least to my eyes) that often comes with overfitted solutions. Still, it took relatively little (one more requirement) to make all those promises untenable. There is a reason why we put things in software, and why even microcontrollers are getting more and more powerful every day: complexity grows nonlinearly with requirements.
Isn’t the new code still overfitted? Isn’t the idea of using the Hamming distance just as fragile as and-ing the compatibility masks? Yes and no; yes, a small change in requirements might prompt us to forget about the Hamming distance and use something else; and no, the code is not really overfitted because the structure of version 2 hides (in the sense of Information Hiding) those choices pretty well. It doesn’t really matter how we find the candidates to replace a glyph: it’s hidden behind FixupTable.ReplacementsFor(). It doesn’t really matter how we fix errors at the account number level: we can just replace ErrorCorrectingOCR with a different class; we could also make a small change and use a Strategy pattern instead of plain inheritance. No information is lost as we read the input; no overly restrictive promises are made about the inner workings of the code.
In essence: although I like the code of version 1, a like the structure of version 2 a bit more.
Notes
[Note 1] when the original text says “So your next task is to look at numbers that have come back as ERR or ILL, and try to guess what they should be, by adding or removing just one pipe or underscore” it means “So your next task is to look at account numbers that have come back as ERR or ILL, and try to guess what they should be, by adding or removing just one pipe or underscore”. So, you can add/remove one pipe or underscore from the entire account number, not from each figure (figure vs. number is important here). It’s obvious in hindsight but it took me a minute to convince myself I was reading it the “right” way (also by checking the test cases).
[Note 2] strictly speaking, when we constrain the substitutions, we’re defining a new type of edit distance; I still use the term Hamming Distance for sake of simplicity.
[Note 3] from my? previous post, it follows that if your bit pattern fits in a register, in assembly you just need two instructions to compute the Hamming distance between two patterns: you XOR the registers, then use POPCNT to count the bits set to 1.
[Note 4] I’m using this term w.r.t. Parnas and Clement’s paper “A rational design process: how and why to fake it”, first published on IEEE Transactions on Software Engineering, February 1986.
Nerd. Super Saiyan. Per aspera ad scientiam. Also, that #PhysicsOfSoftware guy.
3 个月Links - Part 1 of my solution to the Bank OCR Kata: https://www.dhirubhai.net/pulse/compact-highly-efficient-solution-bank-ocr-kata-carlo-pescio-22zdf - The Bank OCR kata original text: https://codingdojo.org/kata/BankOCR/ - Github repo for version 2: https://github.com/carlopescio/BankOcrKataV2cs - Wikipedia page on the Hamming Distance: https://en.wikipedia.org/wiki/Hamming_distance