Predicting Protein Structures using Quantum Computers - Part 1
Predicting protein structures from polypeptide chains is a 60-year old problem. It is one of the grand research challenges in computational biology even today. Proteins are synthesized in living cells and are essentially long chains of amino acids bonded together. Once synthesized these long chains fold up into 3 dimensional (3D) structures and work like micro-machines or devices inside the cells. They perform a variety of functions. They act as enzymes for chemical reactions within our cells, transporters for molecules through membranes, act as receptors for cell signaling, and regulate hormones, to name a few. The 3D structures that proteins take up is what gives them their ability to work as a micro-machine. Misfolding of proteins is associated with a variety of degenerative diseases. Understanding protein-folding mechanisms and predicting the folded structures can thus help us in devising novel therapeutic strategies to treat these diseases. It is therefore important that we understand how proteins take up a particular configuration in the first place, and eventually predict those configurations when given with the initial chain.
Predicting protein structures is a difficult problem primarily because of two main challenges. The first challenge is in accounting for the variety of factors that decide the final structure like amino acid sequence, atomic and molecular force fields, the external medium, etc. The second challenge is in predicting that one particular 3D configuration, out of the large number (~10 raised to power 30) of possible configurations. While we have made remarkable progress in the journey of solving or meeting the challenge in the last 60 years, there is still some distance to go.
This is where quantum computing comes into the picture. Quantum computers, because of their property of superposition, gives us the ability to sample from a large solution space. In order to arrive at a desired solution, one has to navigate this space using algorithms that can creatively use the properties of superposition and entanglement. The algorithms can, in theory, give us a large speedup over classical computers. While quantum computing hardwares and technologies are in their infancy at the moment, the time is right for not only building new algorithms but also in trying out existing algorithms in current quantum hardwares (with error correction and mitigation subroutines) for predicting structures of smaller protein chains. This will lay the foundations for quantum computers to play a bigger role when the time comes.
In this two-part blog we will discuss how quantum computers could help in solving the above challenge. In part 1 here, we will give a brief introduction to protein structures, how the structures are studied, the importance of knowing the structures and predicting them. We will also discuss different methods used to predict the structures of proteins. In part 2, we will discuss how quantum computers could help.
A short introduction to protein structures
Proteins are large chains of amino acids. Amino acids are molecules made up of the elements carbon, nitrogen, oxygen and hydrogen. The structure of an amino acid is given in Fig. 1.
The molecule essentially consists of two groups at the ends, the carboxyl group and the amino group. The carbon atom at the centre, also known as the Cα atom, is connected to another group called the side chain. The side chain is what gives the amino acid its uniqueness. The simplest amino acid, known as Glycine, has the hydrogen atom as the side chain. There are 22 naturally occurring amino acids found in all life forms. Some amino acids are hydrophilic in nature (love water molecules) and others have hydrophobic nature (stay away from water molecules). These properties play a fundamental role in the formation of the native state.
Two amino acids link up through the condensation reaction and a water molecule is given out as a by-product of the reaction. Fig. 2 gives the schematic of the condensation reaction.
The hydroxyl (OH) group inside the carboxyl group combines with the hydrogen atom of the amino group (from an adjacent amino acid) to give out water. As a result, an amide bond between a carbon atom and a nitrogen atom is formed between two adjacent amino acids and they form a dipeptide. This reaction can happen for any length resulting in long chains of amino acids forming what are called polypeptides. These polypeptides are what makes up a protein.
The functional state of a protein in a 3 dimensional (3D) folded form is reached by a series of steps. As mentioned before, proteins are formed when we have long chains of amino acids. These linear chains are the primary structures of proteins. Thereafter, because of interactions between the many different molecules inside the chain, secondary structures in the form of helices (known as α-helix) and sheets (known as β-pleated sheets) take shape. The interactions are mainly due to hydrogen bonds between the many different amino acids. These secondary structures lead to tertiary and quaternary structures, which are basically the folded form of several secondary structures. At this stage, folding happens mainly because of the nature of the amino acid complex like hydrophobicity, and because of complex interactions like hydrogen bonds, Van der Waals forces, etc. Fig. 3 gives the different stages in the formation of the native state.
The point to note here is, that the folded structure (also known as a conformation) is dependent upon the sequence of amino acids in the original polypeptide chain. Any disturbance in the sequence would lead to a fold that is different from the original one. Apart from this, other causes like thermal and oxidative stresses also lead to folds that are different. This phenomenon of the occurrence of a fold which is different from the expected one is known as a misfold. Protein misfolds are considered to be the primary causes of diseases like sickle cell disease, cystic fibrosis, Parkinson’s disease and many others.
Studying and predicting protein structures
Now that we have a basic understanding of the structure of a protein, let us understand how the folded structures are studied. Proteins are formed inside the cells of living organisms. The sequence of amino acids is determined by the sequence of base pairs (A-Adenine, C-Cytosine, T-Thymine, G-Guanine) in our DNA (deoxyribonucleic acid). There are several articles and YouTube videos on protein synthesis that one can find in the internet, that are targeted at a general audience (ref. 1).
The genesis of the protein folding problem was the pioneering work by Max Perutz and John Kendrew in finding the structure of globular proteins (ref. 2). They were awarded the Nobel Prize in Chemistry in 1962 for their work (ref. 3). This work laid the foundations of research in the area of understanding protein structures and other bio-molecules. In their work, they found that the structures lack symmetry and are more complicated than any theory can explain. While we have made a lot of progress since then in understanding some of the mechanisms and forces at play, proper and accurate methods for predicting structures of proteins that are not there in our database (PDB) has remained elusive. Out of a total of 250 million protein sequence entries in the UniProt (Universal Protein Resource) database, only close to about 200 thousand structures are available in the RCSB PDB (Research Collaboratory for Structural Bioinformatics Protein Database) (ref. 4 and 5). It is apparent that we have barely scratched the surface in terms of knowing structures of naturally occurring proteins. Experimental methods like X-ray crystallography, NMR spectroscopy and electron microscopy exist currently to determine the structures of proteins (ref. 6). However, these methods are time consuming and expensive. As a result, researchers turned their focus on computational methods to determine the structure of proteins.
Physics and knowledge-based models
Computational models for predicting protein structures can be broadly classified into two types, i) physics-based and ii) statistical approach (knowledge-based or template-based). In the physics-based modelling approach, all the force fields (electrostatic, Van der Waals, hydrogen bonds, hydrophobic interactions, etc.) are taken into account. Using models of these force fields, and starting from an initial position, Newton’s laws of motion are solved for the atoms and the solvent molecules (medium in which the protein molecules are present) in a resourceful computer. These simulations are known as Molecular Dynamics (MD) simulations. They explore pathways by which the conformations move from one state to the other, lowering its free energy (thermodynamically more stable) in the process. These simulations operate at the level of atoms and hence require enormous computing resources to model the protein as well as the solvent. These models have been very successful for smaller proteins but are not practical for larger proteins.
领英推荐
One question that cropped up was, given the large number of conformations possible, how were the protein molecules able to fold to one particular structure in a matter of milliseconds? This phenomenon was referred to as the Levinthal’s paradox, as this question was first brought forward by Cyrus Levinthal in 1968 (ref. 7). A lot of experiments were performed to understand the process of folding, by taking snapshots of the protein structure over a period of time (in the range of microseconds), as the native state took shape. It appeared that the energy landscapes of the evolving structure has the shape of a funnel, with the bottom of the funnel being very narrow. Fig. 4 gives a representation of the probable landscape.
With a narrow bottom, there are only a few possible conformations (having the lowest energy level) that the protein chain can take. The path to the bottom could be different, however, all roads lead to the same conformation. It was also theorized that folding happens in units of secondary structures, whereby, secondary structures once formed do not fold further but participate as a global unit in the folding mechanism of the entire protein. This is akin to a divide-and-conquer principle of tackling a tough problem. The problem with modelling the protein at the level of an atom results in an energy landscape that is very rough. Exploring pathways using MD simulations along rough landscapes results in sub-optimal solutions where the simulation can get stuck in local minima.
These inefficiencies led to the rise of coarse-grained modelling, where, instead of modelling the protein at the level of an atom, modelling is done at the level of a group of atoms or a functional group of the protein. Coarse-grained modelling could be of two types. A physics-based modelling would model multiple atoms as a single unit and derive force fields based on the aggregation. We also have knowledge-based modelling where interactions between the different aggregated units of the protein are modelled based on statistical observations. The geometry considered for coarse-grained models could either have a continuous representation or have a lattice representation (ref. 8). With a lattice representation, it is important to have a dense packing with lots of degrees of freedom (ref. 9). Lattice representations do offer significant computational speedup.
CASP competition
Ever since protein structure prediction methods started pouring in, there has been a world-wide effort to develop computer algorithms that can predict structures, with or without the structural knowledge of similar known proteins. To advance the science of prediction using computer algorithms, John Moult and his colleagues started a competition on structure prediction in 1994. The competition was called Critical Assessment of Structure Prediction, or CASP (ref. 10) and is held once every two years. Participants are given a sequences of amino acids and asked to predict the 3D folded structure. The native structures of these sequences are known to the organizers based on experimental results. In every competition, more than a hundred participants from all around the world participate and submit the results of their algorithms. Every submission is given a score called the Global Distance Test-Total Score (GDT-TS). This score is a measure of the structural similarity between the experimentally determined structure and the predicted structure, and ranges from 0 to 100 (100 - absolute match). The score is primarily based on the location of the Cα atom. Thus far, we have had 15 CASP competitions.
Machine learning methods in structure prediction
Several techniques from the field of machine learning have been used to predict structures, and several have participated in CASP. One technique employed using the Rosetta platform is to select local sequences of the given protein, and search for template structures with similar sequences in the PDB and then combine several such local templates to obtain the global structure. The most successful submission to CASP (2018 and 2020) has been the models used by Google DeepMind’s AlphaFold (ref. 11). AlphaFold uses a neural network-based approach to predict structures. AlphaFold had a median GDT-TS score of above 90 in CASP-14 (2020). This was a remarkable achievement given that the success rates of the other submissions were much lower. However, some doubts persist on their efficacy on sequences that are very different from the training data. It was also found that machine learning methods do not necessarily add anything to our knowledge with regards to the trajectory of folding mechanisms. While they may be efficient in predicting the final structure in some cases, we do not have much idea about the intermediate structures and are still far away from unravelling the path nature takes to reach the native state (ref. 12).
We end the first part here. The second part can be found in the link given below.
Acknowledgement
The author would like to thank Prof. Sanjib Senapati and his PhD. students Avinash, Vasavi and Ashwini from the Biotechnology Department of IIT Madras, Chennai, and Soham Bopardikar from the College of Engineering, Pune. I would also like to thank Anuradha Gupta and Praveen Jayachandran for their valuable suggestions in improving the writeup.
References