We all know – computers are pretty incredible. From finding personalized ads to human-like robots, artificial intelligence seems to be able to do pretty much everything. But, that’s not all that artificial intelligence can do. In fact, there’s another huge application for computers that could change the world – which is, of course, biology! Recently, there has emerged one of the biggest discoveries that will revolutionize the field forever, which is the discovery of AlphaFold 2, an artificial intelligence program that can predict the structure of proteins. And as molecular biologist Andrei Lupas concluded, “This will change everything.”
Why Do We Need to Know Protein Structure?
First of all, why’s it important for us to know the structure of proteins? While it might seem like protein structures are a given, there are essentially an unlimited number of proteins out there in the world. That’s because proteins are encoded by amino acids, and with 20 different amino acids and no limit to the length of the protein sequence, there’s a huge variation of proteins that make up the world – anything from creating hair to regulating the expression of DNA! Which, to summarize, proteins do everything in life. And in order to understand life, we need to understand proteins. The function of a protein is determined by its structure – which is why it’s so important for us to know the structure of proteins! This is called the protein folding problem, where we want to understand how a protein folds spontaneously from the amino acid sequence into a 3-D structure.
How Scientists Solved the Protein Folding Problem
This was a huge problem that stumped scientists for centuries. While we can find the structure of proteins experimentally, these methods are often time-consuming and expensive to do, so we can’t do that for every single protein that exists. So that’s where AlphaFold comes in. In 2020, the Critical Assessment of Protein Structure Prediction (CASP) competition was held. The contestants – around a hundred or so teams per year – were told to create a system to predict the 3-D structure given the amino acid sequence of 100 proteins, where they had the structures solved experimentally but weren’t publicly available. In the competition, AlphaFold 2, created by the DeepMind team, emerged with astonishing accuracy. The algorithm had more than half of the structures solved with over 92.4% of atoms in pretty much the same placement as experimental structures, which was surprisingly comparable to experimental techniques. Since then, AlphaFold has solved over 200 million proteins – almost all of the proteins that we know exist.
How Does AlphaFold Work?
So how does it work? At the fundamental level, machine learning works by training an algorithm to learn patterns and create new predictions. As a machine learning program, AlphaFold also uses this concept. Because machine learning is highly dependent on the quality of the data that is used to train the algorithm, it was thanks to publicly available data that there was lots of protein structural information is collected in databases such as the Protein Data Bank (PDB), which has the structures of around 180,000 experimentally-determined proteins, and UniProt, which has the amino acid sequences of around 200 million proteins. The availability of large, publicly available data sets made it possible to create a model using machine learning.
Given the data, the next step is constructing the model. First, the protein sequence undergoes multiple sequence alignment to find other proteins that might have similar sequences – and thus, it’s presumed that they will have similar structures. This part isn’t new, since there are other protein prediction softwares that use multiple sequence alignment. But here’s where the novelty really comes in. In AlphaFold, the team used a powerful type of model called a transformer, which is a deep neural network that is often used in natural language processing and computer vision. The DeepMind team created a type of transformer called the invariant point attention model that can be used to specifically process 3-D structures. The transformer uses multiple sequence alignment to extract all the information it can get to form a hypothesis about the structure, and it repeats this again, and again, and again, until it forms a solid structure. This is also a unique feature of the transformer – it’s an iterative process, so AlphaFold will take the structure and feed it back into the transformer to refine the structures.
AlphaFold Is Open-Source – Technically, You Can Run AlphaFold on Your Laptop!
Since then, the team has published a paper in Nature detailing the engineering and architecture of their model, and the code for AlphaFold has been released for open source access as well. So technically, you could download and run AlphaFold on your laptop if you wanted to – which is a huge deal for scientists to get the data they need for their projects, no matter their resources. Ultimately, this makes science more accessible. For me, I’ve actually used AlphaFold before in my own research, and I’m really glad for open-source computational biology to give us students a chance to be a part of research!
Of course, there are still limitations of AlphaFold. For example, AlphaFold can’t predict protein-protein complexes very well, since it was trained to predict single protein structures, called monomers. But, they’ve since been working on this by incorporating training data on protein-protein complexes, called AlphaFold Multimer. There are also proteins that have different forms that it folds into, and AlphaFold only outputs a single conformation. And finally, another big limitation is that AlphaFold can’t predict proteins that lack a fixed structure, called disordered proteins, and it also can’t predict any out-of-the-ordinary proteins.
How AlphaFold Has Already Changed Science
While the predictions by AlphaFold still aren’t perfect for every protein out there, for now, the protein folding problem has been considered solved. And that’s making huge strides for many different fields of research, particularly in computational drug development, where previously, it was essentially a requirement to have an experimentally-solved protein structure in order to begin the drug development process. It’s also already made a difference in COVID-19 as well, as the structure of the Sars-CoV-2 virus was predicted by AlphaFold. There have also been multiple teams that spent years and millions of dollars trying to solve a protein structure experimentally, all to have AlphaFold find the structure in a few hours. This is a huge step forward in the field, which makes it all very exciting to see what’s to come in the future!
Bibliography
Jumper, John, et al. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature, vol. 596, no. 7873, Aug. 2021, pp. 583–89. www.nature.com, https://doi.org/10.1038/s41586-021-03819-2.
Rubiera, Carlos Outeiral. AlphaFold 2 Is Here: What’s behind the Structure Prediction Miracle | Oxford Protein Informatics Group. https://www.blopig.com/blog/2021/07/alphafold-2-is-here-whats-behind-the-structure-prediction-miracle/. Accessed 5 Sept. 2022.
Toews, Rob. “AlphaFold Is The Most Important Achievement In AI—Ever.” Forbes, https://www.forbes.com/sites/robtoews/2021/10/03/alphafold-is-the-most-important-achievement-in-ai-ever/. Accessed 5 Sept. 2022.