Exploring Structural Biology with Computed Structure Models (CSMs)
Computational methods like AlphaFold2 and RoseTTAFold2 use structures in the PDB archive to predict the folding of proteins.
Ever since the first structures of proteins were determined, scientists have been searching for ways to predict the folding pattern of protein chains. After many years of study, several approaches have been successful. Homology modeling starts with a protein of known 3D structure and predicts the 3D structure of similar proteins based on the 1D sequence alignment. Newer methods, like AlphaFold2 and RosettaFold2, expand on this approach, using artificial intelligence/machine learning (AI/ML) to predict the structure based on a large database of known structures. Physics-based methods start from first principles and simulate the folding of proteins. Currently, homology modeling is highly effective for many well-folded proteins, AI/ML-based methods expand this to predict structures across entire proteomes, and physics-based methods are effective mostly for small proteins.
The RCSB Protein Data Bank (RCSB PDB) research-focused web portal (RCSB.org) currently hosts a collection of more than one million computed structure models (CSMs) coming from the AlphaFold Database and the Model Archive. These data are delivered alongside more than 220,000 experimentally-determined PDB structures. Searching for both PDB structures and CSMs at RCSB.org can be turned on using the toggle located at the upper right corner of each RCSB.org web page.
This short article discusses some features of CSMs.
Access the individual sections in this resource
1. AlphaFold2 models include a measure of reliability
CSM of integrin beta, a cell-surface protein involved in cellular adhesion AF_AFL7RT22F1. The extracellular domains and transmembrane segments are predicted with reasonable reliability (colored shades of blue), owing to the availability of experimental structures of similar proteins. The intracellular domain is intrinsically disordered and interacts with many different proteins–consequently its structure is predicted with low confidence (orange and yellow). The signal peptide would be clipped off from the N-terminus of the functional protein, but is included in the AlphaFold2 predicted structure with low confidence.
In general, CSMs are only as good as the collection of experimental structures that are used to train the prediction method. For this reason, the computationally predicted atomic coordinates of proteins must be treated with healthy skepticism. Fortunately, the AlphaFold2 method uses a measure of confidence, termed the pLDDT, to assess the reliability of each part of the predicted model. Regions with low pLDDT may be due to several reasons. Firstly, they may be regions that are intrinsically disordered and would not be expected to have a defined structure. Secondly, they may be regions that AlphaFold2 finds difficult to predict, for instance because it does not find enough information in the input Multiple Sequence Alignment to infer a structure.
2. AlphaFold2 models have limitations
Currently, the archive of AlphaFold2 structures, AlphaFoldDB, includes many predictions of single chains. These structures are effective for well-folded proteins that are functional as single chains, such as serine proteases and GPCRs. However, many proteins act as larger assemblies, with multiple chains and/or with small molecule cofactors. For example, the computed structure model of human myoglobin (AF_AFP02144F1) does not include the heme cofactor, so this must be modeled separately to understand the functional complex in 3D. Protein structure prediction researchers are currently working to expand prediction methods to predict these larger complexes and assemblies, but the AlphaFoldDB archive currently only includes single chains.
The experimental structure of chloroplast ATP synthase from spinach is shown on the left (PDB ID 6fkf). No experimental structure is currently available for the model organism Arabidopsis thaliana, but computed structure models of the individual protein subunits have been predicted using AlphaFold2, as shown on the right ( AF_AFP56757F1, AF_AFP19366F1, AF_AFQ01908F1, AF_AFQ9SSS9F1, AF_AFP09468F1, AF_AFP56758F1, AF_AFP56759F1, AF_AFP56760F1).
3. AlphaFold2 results can help determine experimental structures
Computed structure models are useful tools for a variety of applications. Several survey studies and decades of work with homology models have shown that computed structure models are moderately effective for computer-aided drug design, although the details of side chain conformation and domain flexibility can introduce problems. They may also be used for hypothesis generation, for example, for predicting structures of proteins with folds that haven't been seen in experimental structures, such as the uncharacterized bacterial protein in entry A0A849ZK06.
Computed structure models have been particularly useful for the interpretation of experimental data, particularly if the data are not of sufficient resolution to allow determination of an atomic level 3D structure. This is an example of integrative structural biology, wherein several different experimental and computational techniques are combined to determine a plausible atomic coordinate model consistent with the entire data set. For example, the cryoEM map shown in the figure was created using a relatively small number of observations of oxoglutarate dehydrogenase inside of cells. The map was interpreted by generating atomic-level structures of trimers of the protein chain using AlphaFold2, and then fitting these trimers into experimentally observed 3DEM map obtained by imagingthe larger assembly. The PDB archive now includes many additional examples of structures determined using CSM to aid structure determination.
(Left) Experimental 3DEM density map for oxoglutarate dehydrogenase observed inside cells using cryo-electron tomography (EMDataResource emd_13844). (Right) Atomic-level 3D structure determined with the help of AlphaFold2 predictions (PDB ID 7q5q, with one trimer in green).