Lost in translation

Usman — Fri, 10 Apr 2026 06:10:50 GMT

As an outsider to AI for biology, I was excited about the potential of what AI could do for drug discovery. A good part of this last year for me was spent realising why biology matters and why it might be the most important thing for me to work on. Abhishaike (The Owl Man) has a much better piece on why anyone should be working on biology. As an ML researcher coming to this field, and chronically online on Twitter, I had somewhat higher expectations for where things stand. The claims from insiders and new startups sometimes left an impression that we will get much better drugs, much faster, and for much cheaper. Longer lives for all and maybe even freedom from aging, from disease, from biological fragility altogether, and no pain of losing a loved one. After digging through the literature and understanding the computational landscape of AI in biology, I came out with a lot of nuance. Let’s start with the story of this drug called DSP-1181 (one can make a case for how biologists suck at naming more than software developers).

In January 2020, Exscientia and Sumitomo Dainippon Pharma announced that a molecule called DSP-1181 had entered clinical trials. After a drug is tested computationally and in animals, it needs to go through different phases of testing in humans to make sure it is safe and effective. The drug discovery pipeline for DSP-1181 made use of machine learning to design a serotonin receptor agonist for obsessive compulsive disorder, and the time it took to get this done was the main headline: twelve months from initial screening to a drug candidate ready for human testing, versus an industry average of 4-6 years. The use of machine learning here was to narrow down the search space for possible molecules and make sure that DSP-1181 bound its target with high potency and selectivity. In the preclinical stages, the models had performed well for their intended tasks. DSP-1181 had the solubility, stability, and safety profile that typically takes years of trial and error to achieve.

However, the program was later discontinued when the study had not met its expected criteria. In other words, in initial human testing, the drug was safe; it had reached the bloodstream, but it never translated into treating the disease properly.

The story of DSP-1181, like other AI discovered drugs, is a good way to tether ourselves to reality as it forces us to look at what success actually means when we leave the computational realm, and why we do not have zero-shot drug discovery and probably will not for a very long time.

Eroom’s Law

Drug discovery is a scientifically and economically complex process. Insiders casually talk about how it takes more than $2.5 billion and timelines of ten or more years to bring a new drug to market. The process of finding a cure burns through cash, and most attempts fail. In 2012, Jack Scannell and his colleagues called this pattern Eroom’s law, Moore’s law spelled backwards. They argued that the number of new drugs approved per billion dollars of R&D has halved roughly every nine years since 1950.

Despite all the improvements in technology over the past few decades, including high-throughput screening, combinatorial chemistry, genomics, proteomics, and computational modeling, none of them has translated into overall productivity. Researchers keep getting better at the early stages of the discovery, identifying targets, designing molecules, testing compounds in cells and animals. But the late-stage failure rate has not budged, and most drugs that enter clinical trials still fail, with most of them failing around Phase II or Phase III, after the bulk of the money has already been spent. Ruxandra Teslo has written some very good primers on this topic recently.

This is a very strange thing to wrap your head around if you are like me, coming from the world of software. We are used to tools getting better and productivity going up, but in drug discovery, the tools have gotten better, and productivity and output are still going down.

AI In Biology

Statistical models in biology go back decades. The earliest ones relied on simple linear regression, trying to correlate genetic variations with observable traits or disease risks, things like how fast you metabolize a drug or whether it will have toxic effects in the body. As computational power increased and machine learning techniques advanced, the models grew more sophisticated, but the basic approach stayed the same for a long time.

In the early 2010s, it was mostly Quantitative Structure-Activity Relationship modeling (QSAR) and property prediction, which relates a quantitative measure of chemical structure to a physical or biological activity. Models were trained to predict properties like solubility, binding affinity, and permeability from a molecule’s chemical structure. The models were narrow, and each property of a drug got its own model, trained on its own dataset where the input was the chemical structure combined with other features scientists would care about, whilst the output was a prediction for one very specific property.

The progression from here rhymes with other fields where deep learning started gaining traction. Around 2016, deep learning’s success in vision and language processing started translating into drug discovery, and new architectures emerged that were better suited to chemical and biological data. Graph neural networks (GNNs) were one key development during this time period. A chemical already has a natural graph structure where the atoms are nodes, and the bonds connecting them are edges. Prior approaches had to convert this structure into a flat list of features, which meant losing information about how atoms were connected to each other. GNNs could take the molecular graph as input directly, reasoning about structure in a way that matched how chemists actually think about molecules.

Being able to represent proteins and genomes as sequential structures led to an increasing use of transformers (the architecture powering LLMs) in biology. Proteins, which are a chain of amino acids, could be represented as a string of letters where a set of three letters can represent each individual amino acid, and for the genomes, the letters A, T, C, and G written in a specific sequence can encode the information about our genes.

One of the key examples of this paradigm is when DeepMind released AlphaFold 2. This was a significant step for AI-driven drug discovery, as most drugs work by binding to specific proteins in the body, and to design a drug that binds well, you need to know what the protein looks like in 3D. The amino acid sequence alone doesn’t tell you that, and experimentally determining a protein’s structure used to take months or years. Protein folding models like AlphaFold allowed us to reduce the experimentation timelines by producing the protein folds in minutes or hours from just the sequence.

In the last few years, with the success of LLMs, foundation models have become popular in computational biology. These models are defined by scale and self-supervision. They train on far more data with much bigger architectures, hundreds of millions of protein sequences, tens of millions of chemical structures, and they learn general patterns from unlabeled data before being fine-tuned for specific tasks. We now have protein language models like ESM for structure prediction, chemical models for molecular generation, and diffusion models that can design new proteins with specified properties.

If you zoom back out and look at the whole timeline, there is a funnel effect to be observed. We started with hyper-specific objectives and very small datasets. Over time, two things happened at once, the scope of problems we could attack computationally kept expanding, and the architecture moved from narrow task-specific models to foundation models that handle multiple tasks at once. The problem space expanded because ML proved useful and the model architecture broadened because we realized pretraining on everything helps with specific tasks.

The Generalisation vs Translational Gap

If our models are getting better and we are collecting more data over time, then why can’t we turn around Eroom’s law? Surely, if models like AlphaFold are reducing the timelines of protein folding from months to days or minutes, then we should be getting closer to this promised future world where all the diseases will be cured by AI?

The reality, however, has a surprising amount of detail to it, and getting a drug to work in humans is not as simple as just asking a model to generate the right compound with the properties and characteristics we want. There are a few gaps that indicate we aren’t zero-shotting drugs anytime soon and the first kind is the generalization gap, which is the standard failure mode anyone working in ML must know of. Generalization gap is when you train a model, and it does well in the training, but the results don’t generalize to the unseen situations that the model hasn’t encountered in its training dataset. While models like AlphaFold have made it much easier than before, the models still have a lot of work to do to be able to generalize well. Claus Wilke has written some well thought out things about this here. To sum it up, biology is very difficult, and there are way too many edge cases for us to still cover.

But building models for drug discovery introduces another gap, which I call the translation gap. The translation gap is arguably harder than the generalisation gap. This is when your model’s predictions might generalize well within the computational domain, but still fail when you actually test the compounds in a lab or in humans. A common failure mode here is when there is no practical way for a chemist to build the compounds generated by the models i.e, the compounds can’t be synthesised in the real world. Another most important failure mode is what we saw earlier with Clinical trials and Eroom’s law, where the compounds generated in the labs don’t work in humans.

Why is the translation gap uniquely difficult in biology? One of the issues is incomplete biological representation. In silico (computational) methods can’t capture full biological complexity. They’re typically used alongside in vitro ( test tube experiments) data, both to create the model and to test it, but the complexity of diseases in a human with various molecular pathways makes it very difficult for computational models to replicate the disease.

So unlike AI in other domains, if your model generalizes, you’re mostly done. In drug discovery, even a perfectly generalizing model might fail in the lab because biology has context-dependent behavior that’s hard to capture in the training data.

This matters because it means the standard ML playbook (bigger models, more data, better benchmarks) doesn’t automatically translate to better drugs. You can have SOTA performance on every benchmark and still not move the needle on actual drug discovery.

Where are we?

The pattern across the industry is clear. AI has widened the top of the drug development pipeline and sped up the early stages. Generative models can now propose structures that satisfy multiple constraints at once. Property prediction, docking, ADMET models, and toxicity flags are all meaningfully better than they were ten years ago and can go through millions of candidates very quickly.

We can see the impact in the real world (albeit with a smaller sample size), where AI-discovered drugs have a much higher success rate in the first phase of the clinical trials; however, the late stages, which are also the expensive stages, remain as brutal as they have always been. Most of the money in drug development goes into clinical trials. Most of the failures happen in Phase II and Phase III, after all the early computational and experimental work is already done and the models we built have so far not made a huge dent in improving the success rate here.

If you widen the top of the pipeline by a factor of ten but do not change the success rate in the late stages, you have not actually changed the economics of drug development. You have just given yourself more candidates to fail expensively.

The honest reading from an outsider’s perspective of where we are is that AI has helped us make great advances for the part of the problem, the part where you need molecules to be good molecules, to have the right properties, to be safe and stable. The part that hasn’t been solved is the part where good molecules need to become good medicines. That gap, from computational prediction to clinical reality, is still as wide as it ever was.

In Derek Lowe’s words: “Compounds found (wholly or partly) by such AI methods are still going to be subject to the same white-knuckle dice-rolling as all the others when they get into human trials, because we have (as yet) no computational tools that really help us predict whether we have picked the right target, the right disease, the right biochemical pathway, or the right compound to affect it without doing anything unexpected along the way.”

Eroom’s law is still intact, and the curve hasn’t bent. Bending it is going to require something more than better molecular optimization, something closer to actually understanding disease at a depth that our current models don’t come close to.

The gap remains.

Coming soon

Usman — Sun, 01 Mar 2026 04:31:46 GMT

This is Tinkering Tokens.

Subscribe now