Closing the Data Loop: Genentech’s Lab-in-the-Loop Model in Drug R&D
?? ?? Thibault GEOUI ?? ??
Science CDO - Head of AI/ML for Drug R&D ??- Bridging Science ??, Data ??, and Technology (AI) ?? to Help Life Sciences Companies Bring Better Products ?? to Market Faster - Linkedin Pharma Top 1%
For years, pharmaceutical companies have invested heavily in technologies that churn out enormous volumes of data: from combinatorial chemistry to next-generation sequencing, high-throughput screening, and automation. Yet paradoxically, while the scale of data production has skyrocketed, productivity in drug R&D has continued to decline. This trend is encapsulated in Eroom’s law, a term coined by Jack Scannell , which observes that:
While Moore’s law predicts that transistor counts double roughly every 18 months, drug R&D follows the opposite pattern, every 9 years, and stabilize around 2010 (edited), the number of new drugs developed per billion dollars is effectively halved. Adjusted for inflation, it now costs about 100 times more to develop a drug than it did in the 1950s.
One explanation for this inefficiency is the “brute force approach”: pouring more money into R&D with the expectation that more data will yield more breakthroughs, despite the fact that data generation has remained siloed and rarely integrated into a cohesive decision-making framework. As we enter the era of AI, it’s clear that for technology to truly transform pharma, organizations must be re-engineered from the ground up to harness integrated, iterative, and AI-driven processes, rather than simply generating isolated data points.
What Is “Lab-in-the-Loop” and What Problem Does It Solve?
The paper "Lab-in-the-loop Therapeutic Antibody Design with Deep Learning" (link here) introduces a paradigm-shifting system known as “Lab-in-the-loop” (LitL).
This approach directly addresses the disconnect between massive data production and actionable insights in pharmaceutical R&D by creating a closed-loop, active learning framework that continuously integrates computational predictions with experimental validation.
Key components of LitL include:
The system employs multiple generative models, both unguided methods (like discrete Walk-Jump Sampling and SeqVDM) and guided approaches (such as LaMBO-2, DyAb, and PropEn), to create a diverse library of candidate antibody sequences. These models are trained on large-scale protein sequence data and are designed to explore vast regions of sequence space that traditional methods might miss.
Instead of treating each experimental dataset as a one-off artifact, LitL uses property prediction oracles that evaluate key therapeutic attributes such as binding affinity, expression yield, and developability. An active learning framework, using techniques like Noisy Expected Hypervolume Improvement (NEHVI), then ranks these candidates to select the most promising designs for laboratory testing. This step ensures that only those designs which are likely to exceed a preset performance threshold (e.g., a 3× improvement in binding affinity) are moved forward.
The selected candidates are expressed and tested using a streamlined, automated pipeline. Techniques such as surface plasmon resonance (SPR) provide high-resolution binding affinity data, while crystallography offers structural insights into how specific mutations enhance binding. The experimental feedback is then fed back into the machine learning models to refine future predictions, effectively “closing the loop.”
Addressing the Challenges of Siloed, Brute Force Data Generation
The traditional model in pharma R&D has often been linear and siloed:
an experiment is conducted, data is produced, analyzed in isolation, and then archived with little opportunity for reuse.
LitL breaks this mold by integrating all stages:
from in silico design to in vitro validation, into a continuously evolving feedback loop.
This holistic approach not only accelerates the pace of discovery but also mitigates the risk of developing therapeutics that meet one criterion at the expense of others (e.g., improving binding affinity while compromising expression or increasing non-specificity).
Specific Findings and Conclusions
The paper demonstrates the efficacy of the LitL system across four therapeutically relevant targets: EGFR, IL-6, HER2, and OSM. Key findings include:
Over four iterative design rounds, the system produced antibody variants that exhibited 3× to 100× improvements in binding affinity compared to their lead candidates. For example, some designs achieved a 10× improvement, with the best binders reaching therapeutically relevant sub-nanomolar (100 pM) affinities.
In addition to enhancing binding, the system simultaneously improved expression yields and maintained acceptable developability profiles. In silico filters and surrogate models (such as those predicting non-specific binding via BV ELISA scores) ensured that no design carried undue risk of poor pharmacokinetics or manufacturability.
The integration of structural analyses, via SPR sensograms and crystal structure determination, provided mechanistic insights into how specific mutations (including insertions, deletions, and substitutions) stabilize the antibody structure and improve binding. These insights affirm that the sequence-based design methods not only predict functional improvements but also capture critical biophysical interactions.
The closed-loop nature of LitL, with continuous refinement of both the generative models and property predictors based on experimental data, creates a “flywheel effect” where each round of testing informs and improves subsequent rounds. This is a marked departure from the one-off experiments that characterize traditional R&D processes.
Implications for the Pharmaceutical Industry
The Lab-in-the-loop approach embodies a fundamental shift in how drug discovery can be conducted in the age of AI. By breaking down traditional siloes and enabling an iterative, data-driven decision-making process, LitL directly confronts the inefficiencies highlighted by Eroom’s law. For the industry, this means:
Integrated, AI-driven systems have the potential to dramatically lower the cost per new drug by streamlining the optimization process and reducing redundancy in experiments.
Moving away from isolated data silos, the LitL system fosters a culture where data is continuously recycled and built upon, transforming raw data into actionable insights that inform every step of the development process.
As Marc Andreessen recently highlighted:
AI-driven processes cannot simply mirror traditional organizational charts.
Instead, pharma companies must redesign their workflows to be inherently data-centric, ensuring that every decision is informed by a holistic view of the R&D pipeline.
In conclusion
the Lab-in-the-loop system represents a significant leap forward in therapeutic antibody design. It showcases how the integration of deep learning, active learning, and rapid in vitro experimentation can overcome longstanding challenges in drug discovery. For an industry burdened by escalating costs and diminishing returns, this approach offers a promising blueprint for the future, a future where data isn’t just produced in siloes, but is actively harnessed to drive meaningful, cost-effective innovation.
Bioinformatics | (meta) genomics | microbiome | data science
6 天前This is a very good model, but not new, right? This is basically DBTL, used in biotechnology for many years
CEO, Etheros Pharmaceuticals Corp.
1 周?? Thanks ?? ?? Thibault GEOUI ?? ??. Eroom's law was a great meme. But R&D expenses per drug approval doubled more like every 9 nears (1950 to 2010) - not every 18 months - before stabilizing around 2010.
Digitalization Leader | Life Science Specialist | Digital Transformation
1 周Thanks for sharing, very interesting read !
Founder & CSO at Iktos
1 周This is exactly the method we use at iktos for small molecule drug design?? https://chemrxiv.org/engage/chemrxiv/article-details/675fee0ffa469535b9cd0d41
Helping accelerate drug R&D & precision medicine using genomics, NLP & AI | Scientific Director | Creator of DISGENET
1 周Super interesting. What kind of data infrastructure is required to support such model efficiently? Can this paradigm be applied to the later phases of drug development?