AI-first biology

Published by Nathan Benaich. This post was originally delivered as a talk hosted at the Francis Crick Institute on 2nd September 2019 as part of the Image and Data forum.


This summer I was invited to give a talk at the Image and Data conference held at the Francis Crick Institute in London. Organized and hosted by my friend and former lab colleague, Davide Danovi, the event took place today. It brought together a critical mass of 150 scientists and researchers from across the UK who are involved in creating, analyzing and developing methods for biological image data. This is an area of significant importance because (epi)genetic changes that occur in living cells can very often be observed through imaging (e.g. microscopy). This is particularly relevant to AI because of the sheer volume of imaging data that is being created through large scale biological experiments. Lastly, we need not preach the value of AI (especially deep learning) to the analysis of imaging data :-) Indeed, it was quite telling to see the event's first speaker ask the audience if they were confident that their image analysis work extracted the most out of their data. Only three people rather sheepishly raised their hands.

Without further ado, I'd like to share the slides from my talk today to encourage discussion around the topic. If you're working on translational research and would like to discuss building an AI-first life science company, do reach out.

From my vantage point as an investor in AI-first technology and life science companies with Air Street Capital, I will use this post to explain why biology is experiencing its "AI moment". That is to say, many areas of biology, and in particular, imaging, are being significantly transformed by the use of AI. I walk through several examples of AI-first imaging analysis in biology through the lens of technical developments in computer vision. Finally, I make the case for building full-stack solutions in biology using the pharmaceutical industry as an example. I present an analogy for reasoning about where value accrues in AI-first biology. Hint: the money is in owning the full-stack. Drop me a tweet with any feedback an/or critique.

Biology is experiencing its "AI moment"

Traditional scientific exploration is built on the premise that a system under study is defined by intrinsic rules. Our job as scientists is to design experiments that test a set of hypotheses that propose what these rules might be. The data we generate helps us evaluate if these hypotheses hold true or if they should be reformulated instead. This creates a cycle of trial and error-based exploration through solution space. At the end of the day, we're able to define the knowledge we find into a set of system rules that be shared with others in our field. This is the knowledge that ultimately is taught in biology textbooks.

By contrast, AI-first scientific exploration aims at using software-defined means of solution space exploration. Here, we build AI models that learn system dynamics from data. We learn about the system by asking the model to make predictions about its biology.

Indeed, this is the era of empirical computation whereby the world of bits and the world of atoms come together. This means supercharging the pace and power of scientific problem-solving by using AI techniques to a) detect complex patterns in biological data that evade human scientists, b) process large volumes of data that cannot be done in human time, and c) guide experimental exploration to fruitful solution spaces. It is important to note that empirical computation is rate limited by different factors:

Don't believe me that biology is experiencing its "AI moment"?

Deep learning-based computer vision is retooling methods in biology

While the impact of ImageNet on R&D in deep learning is no stranger to AI practitioners, many biologists have never heard of the project. Having said that, private AI-first life science companies like Recursion are taking the charge by making large (relatively speaking) datasets of biological images available to the public to explore biological phenotypes. What's more, research organizations such as The European Bioinformatics Institute (EMBL-EBI) are hosting large public resources such as the Electron Microscopy Public Image Archive.

One of the virtues of AI-first image analysis in biology that I'm most fond of is the following. Wet lab assays require extensive assay-specific manipulation to generate data. If you're interested in testing something new, then you'll have to redo your experiment. By contrast, computational analysis generates data from existing wet lab experimental data. Today's "traditional" experimental analysis work of labeling images (e.g. enumeration of cellular features) serves as training data for tomorrow's computational analysis. Thus, in many ways, you get extra information for free. No extra wet lab assays were harmed. Here are some compelling examples.

The protocol below should hopefully mean no messing around with 7 different fluorescent-conjugated antibodies to get everything to play together right under the scope. Instead, just learn a model that predicts your labels :-)

This work shows early examples of creating 2D images and using a deep learning model to predict what that same image looks like in 3D. It's not perfect yet, but shows you how you can get more from less:

Last but not least, this work shows how our favorite celeb-rendering generative adversarial networks can be used to super-resolve fluorescent microscopy data. Again, more from less! Not how the fuzzy fibers on the bottom left (input) become more detailed in the middle (AI-based output) and right bottom panels (ground truth):

How to create value as an AI-first life science startup

As I've argued elsewhere, I'm a proponent of AI-first startups building full-stack products. In short, the reason is that you a) directly monetize your predictions, b) have control over the interface through which you serve model outputs, c) you have access to supervision signals, and d) you can abstract away loads more complexity from your customer, which allows them to take more of a risk on you.

Here I cast the opportunity for full-stack, AI-first life science startups in the context of the pharmaceutical value chain. I present several major steps in drug discovery and development. For each step, I reference a key paper that shows how AI can be used to automate or significantly speed up that step. When integrated into one full-stack organization, I argue that the result is a pretty killer startup that I would be prepared to invest into. If that's you, call me :-). Note that I am not advocating for startups to build all of this from scratch if open source or 3rd solutions can be procured and integrated in-house. I'm advocating for being in control of these steps.

Successful AI-first drug discovery startups are like game studios

On the topic of drug discovery, I've been using the following analogy to describe how value accrues to these companies. The analogy compares these companies to successful games studios vs. game engines.

Briefly, the biggest companies in the gaming industry are those that have a) creative talent to design a killer story, b) developed or leveraged game engines to c) release and scale hit games that players love and pay for. Such a company would be Supercell. It's the creative and execution genius of their team making games that are the real money maker, not the tooling or platform that they use to create these cames.

The same could be said in drug discovery. This business is predicated upon finding and gaining approval for drug assets that treat human conditions and save lives. For an AI-first drug discovery startup, significant value accrues when they develop their biological hypotheses, run and analyze experiments using a combination of wet lab, software, and robotic automation, and then take those resulting assets through subsequent clinical studies to gain approval. The further they go with those drug assets (often in collaboration with pharma), the more valuable they become. Examples here would be Recursion and LabGenius. This is because (essentially):

value = sum(#drug assets of phase n)*(technology platform that gives rise to said assets).

Made it this far? Learn more here

Thanks for reading! Continue the conversation :-)

Here is a link to the Google slides.

Thanks to Zavain Dar, Davide Danovi, Ben Sklaroff, and James Field for critical feedback.