Reflection on failure
For the past several months, I have been grinding away at the seemingly impossible problem of deciphering ancient scrolls buried by Mount Vesuvius’s eruption 2000 years ago. I won the Kaggle competition originally associated with this challenge, but that was only the first step in the journey of unlocking the information hidden in the scrolls. A million-dollar prize pool was allocated for finding the first words and eventually finding entire passages of text out of the scroll. After the Kaggle competition, I dipped my toes into applying my models to the scrolls, but after not finding much success out of the box, I moved on to other things. I simply gave up and thought our models were useless as they couldn't bridge the domain shift gap.
I watched as others found the first words by really getting their hands dirty and looking deeply at the data. Disappointed in myself for not showing better grit and resolve I committed to attacking the broader problem and building systems that could help decipher the whole scroll. I prepped terabytes of data and built model after model, probing at various different angles of the problem, trying to understand where things could be improved. It was a sophisticated multistage pipeline so it was difficult to quantify where things were going wrong when there was no ground truth present for any of the stages. Everything was exploration.
I started by retraining my state-of-the-art models on the newly found words. This showed great signs of life and it found additional text in new regions. I continued to iterate on this, adding in pseudolabeling and various augmentations. Critically I figured out that some of the segments were not flattened in the same orientation and training a model with both directions immediately uncovered new characters that hadn't been found before. It was like magic seeing a sheet I thought had no text suddenly light up with characters. After a while I felt like there was simply a limit to what the model could extract from the data and looked upstream at earlier parts of the pipeline, information must be lost somewhere.
One of the crucial elements of this problem was finding where the sheets of papyrus were wrapped so that they could later be flattened and read as a 3d surface volume. It seems simple on the surface, youre just tracing along a line but it becomes massively complicated when you have to do this consistently for 14,000 layers of a CT scan. It could be over a million clicks tracing the line to represent a single row of letters worth of space. The community developed tools to try to speed this process up, primarily by making it so one could trace a single line and then broadcast it to the next layers of the scan with a little bit of extra intelligence so it would stick to the sheet. We knew this process was no substitute for actually drawing on all the lines by hand but no one had ever quantified how well this broadcasting process worked. I formed a simple heuristic to see how often the points were landing on the papyrus and found there was an obvious pattern in which layers were done by hand. Usually, a point drawn by hand had over 85% of points land on papyrus but broadcasted 25 layers away that number could be under 60%.?
I tried to push in many directions to both speed this process up and improve the quality. I realized I could copy the hand-drawn points and use those to train an algorithm to mimic this behavior. I tried various formats of solving this problem, completely redoing the pipeline and representation of the data. Segmentation, skeletonization, connected component analysis, and other analytical approaches failed. Eventually I found something I thought was quite useful and simple. Given a set of input coordinates and a small crop of the region, try to guess the next point on the line. This eventually got nicknamed "ants" because of the black dots I drew as they marched along.?
I built a small prototype around this just to see what it was like using this model as an assistant. I could click a couple of points and then let it try to follow that line for another 10-100 points. Semi useful but it got left on the shelf because integration into the existing annotation tools was a strong friction point, mostly done in c++ while my ants were locked in python.
Instead what I realized I could use my ants for was to refine the existing points to stick on the papyrus better. It was a simple but compute expensive process. For every point on every layer of the scan march the ants forward one step and then march them backwards in the opposite direction. Voila, the ants have jittered themselves just enough to stick to the sheets better. Instead of falling from 85% to 60% it might only fall to 70-75%
领英推荐
Initial results after using these mildly corrected points found that it did actually recover some additional text, but even poring over a large area it was difficult to find spots where this made a visible difference. Refining the points for a segment and then running the unrolling process could take an entire day or fail if certain conditions were met that I could never fully mitigate.
Disappointed by this I moved on to what I viewed as the holy grail. Eliminating the tracing of these sheets entirely. What I realized was that we didn’t need to unroll things in order to read them. In 3d space you can still make out letters. If we could make a 3d ink detector then it would be a bunch of letters floating in space with all the papyrus dissolved away.
I worked on building out this pipeline, taking my highest-quality 2d predictions and mapping them back to the original 3d volume. This was a heavy process having my server grind every pixel back to its original spot in 3d space, but within a few days I had the process worked out and was able to start modeling against this new volume. Lots of trial and error figuring out what exactly to do with this data now that I had it. In 3d the labels were very sparse, I had some ribbons in space I needed to model and lots of nothingness that hadnt been explored or labeled yet. I could not treat this nothingness as lack of ink. It just meant no one had ventured to this area of the scroll yet. I built out a solution that took 128x128x128 cubes out of the volume and then tried to learn the same patterns we’d found in 2d in 3d.
Initial results weren't great, and I still needed to figure out a couple of things. How do I run inference on the entire 2 TB volume of the scan? How do I read the thing once I do it? I optimized the model until I finally got convincing results and I built out the inference code so I could do a coarse pass of the whole volume in under a day and I could download subvolumes and inspect them in 3d slicer. I built a process so I could unroll my 3d predictions back to 2d based on the original segments in only a few minutes per segment.
With this pipeline in place then I figured out a cool trick to significantly boost performance, pulling inspiration from LLMs I made something I started calling an LSM(large scroll model). I trained a model with the task of filling in the blank. Input to the model was the original scroll volume with a bunch of holes punched into it, and the output was the original volume, so the model was trained to fill in the blanks the same as BERT was trained on masked language modeling. Based on this, I showed a significant boost in performance and generalization. This is roughly where I stand today. I could not get these innovations all the way to the finish line before time was up, but I still strongly believe in what I’ve done.?
I think a lot about my errors or what I could have done differently. One of the things I walk away with is a deeper appreciation for the reality that the wrong metric can mislead you. I measured my performance against dice score and continued to optimize this to accuracy levels I never even thought possible, but in hindsight, this didn't yield more readable text. A very clean and strong model could get a score of 0.77 while a different model could score only 0.65, but the one that scored 0.77 might lose its 23% from turning an “F” into a “P” while the 0.65 scoring model correctly formed the “F” but had salt and pepper noise all over it. This is simply a reality of data, especially complex data. It is difficult to summarize its characteristics in a single number.?
I knew what I submitted wasn’t the best I had to offer, the model I submitted finished training in October. Everything after that didn’t make it into the December 31st cut. I landed on the wrong side of explore vs exploit in the end. It’s very disappointing I couldn’t walk away with something to show for it all, but I think about what someone who was truly elite would do in response to a setback. No benefit from moping. It’s time to learn, move on, and execute and not treat it as a failure but just another step in the discovery process.?
Congratulations to all of the winners. The community is a large part of why I've stuck around and spent so much time on this problem. It would be hard to keep motivated if it didn't feel like everyone was inching closer to the finish line together.
Project Lead, Vesuvius Challenge
7 个月We are so grateful for your contributions, Ryan Chesler! Your knack for focusing in on the data and visualizing it has led to multiple "aha" moments for me after years of working on this problem. I love what you write about everyone inching closer to the finish line together. That inching forward is what it's all about to me. You're playing a huge role in that!
Member of Parliament, City Councillor, Expert in financial & insurance risk and venture capital, Physicist, Entrepreneur, Author and Researcher.
7 个月Thanks for your report! It has been very useful for me to see what has been tried so far. The key issue seems to be that, unlike with open fragments with visible ink, we do not have clear and strong "ground truth data" with which we can label areas as ink and train the algos. The winners had to rely on the cracks from dried ink patches and patiently and dilligently work and guess from there.
Senior Software Engineer at PlayStation
1 年your process was really cool to read, thanks for writing this. I haven’t heard of anyone who hit big without failure anyways
User Experience Engineering and Design Leader
1 年Amazing effort Ryan. Inspirational!
JPMorgan Data Science | Kaggle Competitions Grandmaster
1 年I love "day after" writeups like this. Thank you for sharing an interesting and detailed post with photos and videos to boot. Don't be too hard on yourself, because your curiosity and persistence to innovate are very clear.