In a recent study led by Christina Theodoris, MD, PhD, assistant investigator at Gladstone Institutes, researchers created an AI model named Geneformer to understand gene interactions and their influence on cellular function and disease. Trained on data from 30 million cells across various tissues, Geneformer leveraged transfer learning, allowing it to apply core knowledge to new problems without retraining from scratch. This method offered a breakthrough in diseases with scarce data. In their study, the team demonstrated Geneformer’s effectiveness by identifying genes linked to heart disease, including TEAD4, whose removal in lab cardiomyocytes resulted in reduced cell function. Additionally, Geneformer pinpointed genes that, when targeted, could revert diseased cardiomyocytes to a healthy state. This novel AI application promised to enhance the design of therapies by understanding gene networks and had vast potential across biology for drug target discovery. The open-source model was expected to propel further advancements in understanding complex gene interactions.
Christina Theodoris discussed her work with SCINQ.
Geneformer is a major breakthrough in using artificial intelligence for biological research. Could you explain how this system is different from older methods in simple terms?
Standard approaches to computational modeling in biology involve training a model to accomplish a single particular task by showing it data relevant to that specific task. If you’d like that model to make predictions in a new task, you’d have to retrain it again from scratch using new task-specific data and training it on the new task at hand.
In our method, we use a machine learning approach called “transfer learning”, where we train what is called a “foundation model” on massive amounts of general gene expression data from a broad range of human tissues to gain a fundamental understanding of how genes interact to control cells’ functions. This foundation model, called Geneformer, can then be used to make predictions in a vast range of downstream tasks, transferring its knowledge to accomplish new tasks.
The system seems to be very helpful for studying networks of genes, especially when there isn’t much data available. Can you explain in simple terms how Geneformer’s learning method helps in these situations and why it matters?
Mapping the networks of interactions between genes, known as gene networks, allows us to understand what is central within the network and how disruption of those central genes leads to disease. By restoring the levels of those central genes back to the normal state, we can treat the underlying disease process. However, mapping those gene networks requires large amounts of gene expression data to learn the connections between genes, which is often not available, especially for rare diseases and diseases that affect tissues that are hard to sample in the clinic. Although data remain limited in these settings, advances in sequencing technologies have driven a rapid expansion in the amount of gene expression data available from human tissues more broadly. By “pretraining” our model, Geneformer, on this massive amount of general gene expression data, the model gains a fundamental understanding of gene network interactions. Then, Geneformer can significantly boost predictions in downstream applications after seeing just a limited amount of data relevant to that specific disease or other application, which would not be possible if it were asked to make predictions based on this limited data alone without the baseline knowledge gained during the pretraining phase.
Geneformer has been trained on a huge amount of data from about 30 million individual cells. How does this extensive training help the model make accurate predictions in different tasks?
By pretraining the model on a massive general dataset from a broad range of human tissues, the model gains a fundamental understanding of how genes interact to control cells’ functions. Researchers can then fine-tune the model to make predictions in specific downstream tasks by training it further with limited data specific to that task that would be insufficient to yield meaningful predictions if used in isolation without the baseline knowledge gained from the pretraining phase.
Your study mentions that Geneformer learns about how genes interact and arranges this knowledge by itself. Could you explain how this works and why it’s a good thing in easy-to-understand language?
During the pretraining, we show the model examples of gene expression data from individual cells without giving it any prior knowledge about which genes are central to controlling cells’ functions. By seeing a vast number of examples of individual cells, the model learns on its own which genes are key regulators within cells. The remarkable ability of the model to learn this information on its own indicates that the model is able to learn directly from the data itself, without being biased by prior knowledge. This allows the model to make new discoveries rather than just finding out what we already know.
Sign up for the Daily Dose Newsletter and get every morning’s best science news from around the web delivered straight to your inbox. No bells. No whistles. Just the news. It’s easy like Sunday morning.
The system has been successful in predicting what goes wrong in heart disease and suggesting possible treatments. Could you expand on this in simpler terms, and also talk about any other diseases where you think Geneformer could be very useful?
After the pretraining phase when Geneformer gained a fundamental understanding of how genes interact to control cell function, we fine-tuned the model to distinguish cells from healthy hearts compared to those affected by a type of heart muscle disease called cardiomyopathy. We then asked the model whether decreasing the expression level of particular genes in the disease cells would shift them back toward the healthy state. We then tested these predictions in cells modeling the heart muscle disease in a dish in the lab. Indeed, decreasing the level of two of the Geneformer-predicted genes improved the ability of the heart muscle cells to contract.
This approach can now be applied to additional human diseases to predict the genes where we could intervene to restore the cells back to the healthy state. This approach could be especially impactful for progressive diseases where we would have the opportunity to halt or reverse the progression if a medical therapy existed.
You’ve said that Geneformer can use its knowledge about gene networks to answer many biology-related questions. What are some specific questions or areas of study that you are excited for it to help with?
I am most excited for researchers to use Geneformer to identify candidate therapeutic targets for a wide range of diseases where progress has been obstructed by limited data. I am also excited about the potential for using Geneformer to investigate interactions between combinations of genes, which would also help us understand how combinations of gene variants lead to disease of differing severity in different individuals.
Geneformer has been made available for any researcher to use. What impact do you think this will have on the wider scientific community, and what advice do you have for researchers on how to get the most out of this tool?
We are excited about the potential for Geneformer to accelerate the discovery of key regulator genes in both normal development and disease. To get the most out of the tool, it is best to fine-tune the model to the reseachers’ specific scientific questions. Even if there is only a small amount of data available relevant to their applications, the power of transfer learning is that with just a small number of examples related to a particular setting, the model can make meaningful predictions because it is starting from a strong foundation of knowledge gained during the pretraining phase. We are excited to see how other scientists apply Geneformer to their research to learn new biological insights and discover candidate therapies for human disease.
If you enjoy the content we create and would like to support us, please consider becoming a patron on Patreon! By joining our community, you’ll gain access to exclusive perks such as early access to our latest content, behind-the-scenes updates, and the ability to submit questions and suggest topics for us to cover. Your support will enable us to continue creating high-quality content and reach a wider audience.
Join us on Patreon today and let’s work together to create more amazing content! https://www.patreon.com/ScientificInquirer