5 AI Studies Every Builder Must Know (but probably doesn’t)
Devansh Devansh
Chocolate Milk Cult Leader| Machine Learning Engineer| Writer | AI Researcher| | Computational Math, Data Science, Software Engineering, Computer Science
Thank you Tampa for all the love. I couldn’t spend enough to meet all of you who reached out, but I really appreciate the cultists pulling up to make this a fantastic time. SF next (15th night to 21st early morning).
When people think about AI products, the focus is often on the best models or algorithms to solve problems. However, I think this approach is short-sighted, and it misses how tech products (and I would argue solutions anywhere) are deployed in practice. In a nutshell, this approach emphasizes individual aspects of the system while ignoring the larger system within which the solution operates. Such an approach leads to problems like-
all b/c you didn’t think about what you were building.
Put another way, building great AI isn't just about technical brilliance or chasing improvements in specific aspects of the product. It’s about recognizing crucial underlying systems, subtle patterns, and human factors that determine whether your AI thrives—or fails. Building solutions with those larger underlying currents in mind will ensure that the smaller pieces fall into place without effort.
Building based on techniques is a counter-striker learning to win by focusing on their pull counter or their slip + liver shot. Building by thinking about systems is the counter striker learning about foot placements, baits, and setups- getting their opponent to throw the shot that they most want. The former gets highlight reels, the latter puts the opponents under.
Anime- How Heavy are the Dumbbells you Lift? Not great, but not a bad time kill.
In this article, I’ve personally picked out four powerful studies that anyone involved in AI should know. We’re not going to talk about specific models, techniques, or technical concepts, but instead about studies highlighting deeper principles and truths about building AI and software systems that will always be true, regardless of what technical developments happen.
Specifically we will cover research that helps us answer:
If you want to know the answer to these questions, keep reading. One of the publications here is my favorite AI read EVER, so you absolutely should not miss that.
?
This article was originally published on my AI newsletter, AI Made Simple on Substack over here. If you want to ensure that you get such high-quality analysis delivered to your inbox at no cost sign up here. If you like like my work and think it's valuable, please consider becoming a premium subscriber of the newsletter for access to special articles, access to "inside" information from my network, and lots of style points, please consider becoming a premium subscriber to my newsletter over here.
I provide various consulting and advisory services. If you‘d like to explore how we can work together, reach out to me through any of my socials over here or reply to this email.
Executive Highlights (TL;DR of the article)
We will focus on the following papers-
“Accounting for Variance in ML Benchmarks”: This study investigates how reliable machine learning benchmarks really are, given the many sources of random variation in model training. The researchers modeled the entire benchmarking process and found that variance from factors like data sampling, model weight initialization, and hyperparameter choices can dramatically affect reported performance. Teams had a strong propensity to underestimate this high variance, and were often picking the wrong models due to incomplete evaluations. This paper is a must-read in the age of Generative AI, where the evaluation protocols are often a lot flimsier than even traditional ML pipelines (most people that end up building AI apps are not true AI people, but simply people that call the AI apis w/ Cursor and thus overlook these foundational but not obvious aspects of ML Engineering).
Another interesting bit from this study is their recommendation for how to run AI evaluations better. Their approach was counter-intuitive: integrate more randomness in your evals improved your performance, “We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost.” Here’s my working hypothesis on why this happens: most AI optimizations happen within a set of narrow/similar configurations (this is the theory behind Bayesian Optimization, which uses Bayes theorem to better estimate hyper params). So we can reasonably assume that if a model does well on Task A, it will do well on tasks very similar to A (Adversarial Perturbation is an exception, but it isn’t as much of a concern in most commercial apps since users aren’t normally trying to provide AP affected inputs). So instead of exhaustively searching the neighborhood of A, we can instead spend those resources for zipping around the search space and looking at other possible input values (this is what a lot of very good human testers do as well). It’s kind of the inverse principle we discuss at length- ensemble models do better than big singular models, even normalizing for computing resources, since ensembles sample from a more diverse search space, ensuring
“System design and the cost of architectural complexity” looks at how expensive architectural complexity is. And it’s not pretty. “Measures of architectural complexity were taken from eight versions of their product using techniques recently developed by MacCormack, Baldwin, and Rusnak. Significant cost drivers including defect density, developer productivity, and staff turnover were measured as well. The link between cost and complexity was explored using a variety of statistical techniques. Within this research setting, we found that differences in architectural complexity could account for 50% drops in productivity, three-fold increases in defect density, and order-of-magnitude increases in staff turnover.”
The study was conducted within a successful software firm, which adds to the practicality of the study. Another important point, I think this study underestimates the long-term impacts of complex architectures. Based on my own experiences and conversations, complex architectures tend to demotivate and push away more enterprising engineers, much more than they would push away engineers who don’t take personal ownership of things. This means that the long-term drops/culture can become incredibly stale, where nothing gets done b/c the “tenured engineers” on that team do nothing to reduce complexity and new hires either adopt the mentality or leave (creating what we call the “babu mentality” in India). All while the code base slowly gets worse and worse.
Above gets worse when we have out of touch upper management that pushes change recklessly, leaving unresolved threads and logic pieces in the code base, as the core team is constantly forced to accomodate the whims of overpaid executives who think they suddenly understand technology b/c they have ChatGPT explain very surface level ideas to them and code basic demos.
Yes, I’m thinking of a specific company as I write this. Yes, you know the company, and even the product. I won’t say which one, but that company will probably see a high profile exit due to these tensions. Fun times. We may revisit this when it happens, depending on what other topics I will have in my pipeline then.
Lastly, I think this study becomes even more important in coding-based tools. None of the coding tools that I’ve played with are great with generations in very long existing code bases (Augment is by far the best with understanding code bases so it’s my goto, but even that falls apart very quickly if not guided). Applying AI to complex code bases will likely lead to a lot of conflicts, especially when we consider that AI-generated code tends to be more verbose, adding to the complexity already there. Teams should be vigilant about cutting down architectural complexity so that they don’t have to deal with this nightmare fuel.
“What Distinguishes Great Software Engineers?”: Through a large survey of 1,926 senior engineers and 77 follow-up interviews, the researchers pinpointed the five attributes most essential to engineering “greatness”. These top five traits were: writing good code, adjusting behaviors to account for future value and costs, practicing informed decision-making, avoiding making others’ jobs harder, and learning continuously In plain terms, great engineers excel technically (code quality), think ahead about long-term implications, make thoughtful decisions, collaborate without creating friction, and constantly upgrade their skills.
Interestingly, the highest dimensions of greatness show the ability to work with the team, demonstrating holistic and team-oriented skills (as opposed to the myth of the “genius hacker” that interacts with no one and does what they want).
The study also identifies two aspects of bad coders. Both were very surprising, but made a ton of sense when I read them. This main section will cover them.
The next study is one of richest studies on AI that I’ve ever come across. I won’t be able to add a deep-dive here b/c that would make this article very long. But this paper has so much widom that I would hate for you to miss out on, so I’m going to give you a tl;dr of the most important ideas for now. We will do a seperate deep dive on this piece in the future. I think everyone in AI should be familiar with this paper.
“Operationalizing Machine Learning: An Interview Study”: This is one of most goated studies in AI, and it makes me sad that people don’t speak of it with the same reverence as the “Attention is All You Need” paper. Casuals doing Casual tings I guess. In fact this is the only study I’ve ever covered, where I had to do two different (very comprehensive) breakdowns just to do it some justice (this and this).
This interview study examined how organizations operationalize machine learning (ML), i.e. how they deploy and maintain ML pipelines in production settings. Through in-depth interviews with 18 ML engineers spanning domains like chatbots, autonomous vehicles, and finance, the researchers mapped out the end-to-end MLOps process and its pain points. They found that successful production ML systems revolve around three core capabilities, dubbed the “three Vs”:
The interviews revealed that many common issues in ML deployment arise from tension between these goals. For instance, pressure for fast iteration (velocity) can conflict with thorough testing (validation), leading to mismatches between development and production environments or bugs that slip through. Likewise, without good versioning, data errors can creep in or it becomes hard to reproduce results, undermining validation. Here is a list of aforementioned errors-
Overall, the research paints a picture of MLOps as a continuous loop of data collection, experimentation, deployment, and monitoring – which needs the infrastructure and practices address speed, quality assurance, and reproducibility in unison.
Let’s break each of these down in more detail.
1. Accounting for Variance in ML Benchmarks – How teams pick the wrong models
We begin by questioning a cornerstone of AI development: benchmarks. Benchmarks are far more variable and less reliable than we often assume. This has some very important implications for how we approach AI. For best results, it’s very important to change how we view benchmarks and the results from our AI evaulations.
The core idea, often overlooked in the daily rush of AI development, is that every benchmark score is not a fixed point, but rather a sample from a distribution influenced by numerous random factors. These factors spawn everywhere- from the seemingly innocuous act of splitting your data into training and test sets, to the random initialization of model weights during the learning process, to the “intelligent” choices we make during hyperparameter tuning (we’ll pretend there is a lot of thought required here so that you can keep playing ping pong while the model finds the best configurations) – randomness is deeply ingrained in the very fabric of evaluating AI models. This inherent randomness introduces variance, a significant factor that can dramatically skew our perception of model performance and lead us down the wrong path.
The various ways we could induce variance into our learning agents. The numbers can’t be ignored. The variance can literally change the results of your comparison.
This variance is not a two-bit statistical phenomenon that gets ignored like the “+C” during integration; it's a critical signal, telling us that a single benchmark score is merely one snapshot from a spectrum of possible outcomes. It speaks to the range of performance we can realistically expect from a model. And ignoring this can have some major problems-
?
In case you missed that let me reiterate- “Properly accounting for these factors may go as far as changing the conclusions for the comparison”. In other words, trained pros were picking the wrong models.
This can get worse when you start accounting for more nuanced eval protocols- accuracy + stability is a big ones. Some models might have very high peaks, but also fail more spectaularly, while others have more stability. In this case, the latter model will be easier to build around, which creates a strong reason to pick it, one that can be overlooked if your evals don’t account for this angle. This isn't just a theoretical concern; it's a practical, costly error in AI development that I've witnessed time and again in the field.
The problem is amplified even further in Generative AI. Here, evaluation moves beyond the relatively structured world of classification accuracy and into the murkier waters of subjective assessments and less standardized metrics. Evaluation protocols for generative models are often far less rigorous, even more prone to variance than traditional ML pipelines. If benchmark variance is a problem in traditional ML, it's bordering on a crisis in Generative AI if left unaddressed. It’s no coincidence that multiple non-technical AI experts (like investors look at AI, policy makers for this etc) who have studied LLMs have all independtly come to the conclusion that benchmarks aren’t very useful.
The paper proposes a solution that, at first glance, seems paradoxical: fight randomness with randomness; embrace more variation in evaluation. Instead of clinging to the illusion of precision offered by single benchmark runs, we need to intentionally inject more variation into our evaluation process. Run benchmarks multiple times – not just once or twice, but five, ten, twenty times, systematically varying random seeds, data splits, and even configurations. Think of it as stress-testing your model under a multitude of conditions, probing its performance limits, and mapping its true capabilities.
This "noisy" evaluation approach might feel less efficient, less precise on the surface. But it's precisely this seemingly counter-intuitive strategy that provides a far more robust and reliable understanding of a model's true potential. By testing across a wide set of configurations, we can get a strong understanding of the models behaviors and metrics w/o needing to run on every possible configuration. This is where the 51x reduction in cost that the paper described comes from- randomized sources of variance allow the evals to jump through the search space very easily, creating a good understanding of the kinds of performance your model might have.
Now because I love you, my dearest reader, I’ve studied the literature, thought back through my own experiences, and spoken to a lot of people to give you some steps that you can apply-
?
Even Deep Research (which is a great product that I’m paying for FYI) asked very different follow up questions for identical input prompts, leading to different outputs. Any AI builder should always have strong cross-validation in their evals.
Let’s move onto the next section. Once you have strong evals, you need to take a step back and think about your system. Specifically, how easy it is to deal with.
How Complexity comes with a Death Sentence
I don’t think I need to come in and tell you about why complexity is a bad thing. So I’m going to skip the repetitions and the stats and go straight to the solutions for this. This way we both save time-
To really make this work, you need to hire (or become) a great software engineer. This is where Microsoft’s research can be extremely insighful. Let’s see what we can learn from it.
What Defines Great Software Engineers?
We already know the traits from the tl;dr, but let’s redefine them here to keep things easy-
Nothing too shocking, but good to see them above other traits. Helps us focus our attention.
How can you develop these traits? Let’s do that next.
Cultivating these traits
Read more. YT videos on informative topics also count.
That’s really it. I could spin 20 stories, but this is what it boils down to.
We on the same page? Let’s cover the next part- what should you read?
There is a strong base of three types of reading material that will help you a ton. I share different examples of these 3 all the time. Any guesses for what they are?
Are you sure, you have no guesses?
I’ll give you a cookie if you get it correct.
Our holy Trinity is the following-
Our self-guided learning plans have focused on the last 2 (reach out if you want a specific one), since I have the expertise to tell you how to learn from them. The first kind of source isn’t something I’ve studied formally, so I wouldn’t be a good person how to approach them beyond the general- be curious, ask questions about the source, and talk to people to see how the principles apply IRL. That’s what I’ve done. If you have something to add, please do share.
To this base, add whatever interests you. Depending on your interests, goals, and inclinations you will spend your time on each of these differently. That’s fine. In the end, that will help you develop skills and ideas that are unique to you. And that is where you will make amazing career leaps.
Reading more will help you make better decisions. It will help you foresee some challenges you might face, know how different people solved them in related (or different domains), and what engineering decisions will give you the highest ROI. It will also expose you to various best practices/design decisions that will ultimately help you create code/solutions that are functional, performant, and easy to work with/modify.
Now let’s get into the deadly programming sins that you should avoid.
The 2 Deadly Sins for Developers
We’ve already covered the 2 terrible sins. Let’s talk about why they’re bad and what you can do instead.
Hard work
This might come as a shock to most of you. Hustle Culture really loves Sigmas that love to grind. And it is important to put in the work. If you’re trying to run a marathon, you have to put in the miles. No way around it. So why is this considered a bad thing-
However, if a developer is consistently finding themselves working 8-hour+ days, they’re probably doing something wrong. As we’ve talked about many times, it’s not about doing more but doing the right things. This is why I emphasize the need to take a step back at times and analyze things as a whole. 1 High Impact Decision > 10 Medium Impact decisions.
…workload for a developer is a function of management and planning happening above that developer. Usually long working hours are needed, because the planning was not good, the decisions made during the project lifecycle were bad, the change management wasn’t “agile” enough
-From the paper
The key is in chilling out. Instead of rushing into a problem, spend your time thinking of the details (forecasting the future). Pick the most impactful, simple areas to tackle. Keep the end goal in mind as you proceed. Less is more. Great results can be attained by doing very little (comparatively).
This is not to say that you will never have to do these long days. Challenges crop up all the time. Just don’t make those long days the norm. Most of your time should be spent thinking, planning, and considering details. The grind should be a very rare event. Not something that happens on a monthly basis.
Moving on to the second sin.
Trading Favors
Imagine you helped someone fix something. Then you called upon them to help you? What’s wrong with this?
Nothing. Absolutely nothing.
The problem becomes when people start forming clusters. You and your crew go to each for help. And help mostly each other. Not because of drama or anything in particular. Humans are tribal creatures so this is only natural. We will naturally gravitate to the people we are familiar with. So why is this a bad thing?
?
I covered Conway’s Law a while back. It shows that biases in team structures tend to propagate throughout the system. Trading favors amongst a group will add a layer of bias that will show up in the solutions.
Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.
-Melvin E. Conway
There is a simple Tweak to solving this. Instead of helping/going to only a select few, reach out to more people. Actively pursue more and more people who might be able to help you. Post your challenges on the company boards/comm channels and work with a greater variety. This will allow you to avoid this problem while also allowing you to meet more people. Win win.
I selected these papers b/c I believe that they form a strong foundation for every team, irrespective of what they’re doing. I’ve found that everyone I’ve ever worked with has benefitted from the ideas discussed in these papers, and I often find myself quoting these studies.
Are there any studies you would add to this list for a part 2? Would love to hear them.
Like this post? Please consider sharing it with someone you think will benefit from it. And if you want more of such posts, sign up for AI Made Simple over here and never miss an important update on AI.
Thank you for being here, and I hope you have a wonderful day.
Dev <3
Explorer, MD, PhD | Physician, Scientist, Clinical Informatics AI/ML | CMO, VP, Board Member | Diversity & Health
1 周??