EGAD Library Manual

Physical Model

Proteins

Once the base of the physical model is set up (atom types and the rotamer library), we are ready to build a Protein. The Protein class is a child of Macromolecule, the centerpiece of the physical model for EGAD Library. It is through this class that information about the design problem is set and queried.

Reading a PDB file

We will create a protein by reading coordinate data from a PDB file (the most common file format for protein structural data). Like all other input files, data in a PDB file is first stored in an internal structure before being fed to the Protein.

A PDB file can contain data for multiple proteins, called chains. This complicates our example, since we only want to consider a single chain for now. Since multi-chain data is important, the file parsing object PDBFile stores each chain it reads from the file. We can access each chain via iterators very similar to those you would encounter using the STL container classes such as vector.

Since Protein represents only a single chain, its ReadPDB function expects a single chain in the PDBFile object. The second parameter to ReadPDB indicates whether positions should be fixed or floating by default. This distinction is about to be discussed, so ignore it for now:

Example 2.7: Reading a protein from a PDB file

// Continue from example 2.6

// Open an input stream for the PDB file
std::ifstream finPDB;          // defined in <fstream>
finPDB.open("protein.pdb");

// Read PDB data into an EGAD Library object
PDBFile pdb;                   // defined in "EGAD_PDBFile.h"
pdb.Load(finPDB);

// Close the input stream for the PDB file
finPDB.close();

// Create the Protein object
Protein myProtein;             // defined in "EGAD_Protein.h"

// Set the residue library so that the protein will have access
// to prototypes for building residues.
myProtein.SetResidueLibrary(residueLibrary);

// Read the first chain of the PDB file
myProtein.ReadPDB(*pdb.begin(), false);

Protein Structure

The end result of this procedure is a Protein with coordinate data from a PDB file. Proteins are divided into an ordered collection of positions, representing the locations of amino acid residues along the protein backbone. If the PDB file contained a protein chain with 50 amino acids, the resulting Protein object would have 50 positions.

Each position, at this point, is fixed. That means there is only one possible conformation at this position: the one that was just read from the PDB file. This conformation is often referred to as the wild-type conformation.

You can retrieve the residue at a position by using the BuildResidue function, which returns a pointer to the requested residue. Ignore the second parameter for now:

Example 2.8: Retrieving residue information from a protein

// Continue from example 2.7

// Check how many positions are in the protein
unsigned int iProtSize = myProtein.Size();

// For each position, retrieve the residue and print its name
for (unsigned int i = 0; i < iProtSize; ++i){
    std::cout << "Position " << i << " is named " <<
        myProtein.BuildResidue(i,0)->Name() << std::endl;
}

Accessing positions by serial number

Positions in a protein are identified using serial numbers in a PDB file. Even though the first position in a Protein object always has the index of 0, serial numbers in a PDB file do not have to start at zero or be contiguous. Still, it is useful to be able to access positions via their PDB serial number.

Since some PDB serial numbers may contain character components, these serial numbers are represented by strings in EGAD Library. The Protein function SerialToPosIndex converts a serial string to a position index. The GetSerial function can do the inverse operation, retrieving the serial string for a given index.

Allowing positions to mutate

Finally, the moment you've been waiting for. We want to design our protein, so we should be allowed to mutate the residues at some positions (or all positions, if we're daring). A position where there are multiple residue/rotamer possibilities is called a floating position. From EGAD Library's standpoint, there is no difference between mutating the identity of a residue (as in, from arginine to alanine) and mutating the conformation of a residue. Both require the position to be floating. From now on, when we use the word rotamer, we will be referring to either of these cases.

There is more than one way to make a position float. As you might expect, they differ based on how many choices we want to make available at that position. The simplest case is when we want to consider all the allowed residues in the residue library:

Example 2.9: Floating a position

// Continue from example 2.7

// Let's assume our protein has 50 positions and we are floating position 4 
// to consider all allowed mutations.
myProtein.FloatPosition(4);

Recall example 2.6, when we disallowed mutations to cysteine, proline and glycine. Even though these residues exist in the rotamer library, the FloatPosition function will ignore them when setting what residues are allowed at this position. Also recall that we have set the residue library earlier via the SetResidueLibrary function. This function creates an internal copy of the residue library. A potential pitfall is to assume that further changes made to the library will be reflected by the protein:

PITFALL: Modifying the residue library

// Continue from example 2.7

// Set the residue library for the protein
myProtein.SetResidueLibrary(residueLibrary);

// Change the residue library so that arginine is not allowed
residueLibrary.AllowMutation("ARG", false);

// Float position 4, hoping that arginine will not be allowed
myProtein.FloatPosition(4);         // WRONG!

While the above code might look correct, it ignores the fact that the protein makes a copy of the residue library. Every time you change the residue library and expect the protein to know about it, you have to call SetResidueLibrary. This function also fixes all of the positions, since it does not know if there is a relationship between the old and new residue libraries.

Once a position is floating, there are multiple rotamers available at that position. They can be accessed via the second parameter to the BuildResidue function we used earlier. In fixed positions, the second parameter should always be zero because there is only one available choice. In floating positions, however, this value can range anywhere between zero and one less than the number of rotamers present at that position.

Floating Indexes

Some functions only make sense when they are called on floating positions. Also, floating positions are the positions most energy functions care about. Because of this, it is useful to be able to index only floating positions, ignoring the other positions in the protein. This sort of an index is called a floating position index, as opposed to an absolute position index that can be used to access any position in the protein.

Floating to Absolute position index mapping

The function FloatToPosIndex() can be used to convert floating position indices into absolute position indices, as depicted in the diagram.

Wild-type rotamers

By default, the first rotamer at any position is called the wild-type rotamer. This is usually the conformation that was read from the PDB file and does not necessarily exist in the residue library. Some design calculations wish to ignore the wild-type rotamer because it can bias the calculation. It is possible to force a floating position to ignore the wild-type rotamer if it has other rotamers available.

The function to include the wild type rotamer only makes sense if it is called on floating positions, so it uses floating position indices:

Example 2.10: Disallowing a wild-type rotamer

// Continue from example 2.9

// Currently, position 4 is a floating position, and all other positions
// are fixed. We can retrieve the number of floating positions.
unsigned int iFloats = myProtein.Floats();      // value is 1

// A floating position is still a position, and the return value of
// Protein::Size() is unchanged.

// We want to exclude the wild type rotamer at the first floating position.
// To do this, we have to retrieve the absolute position index of the first
// floating position. Of course, we know the value will be 4, but this is
// an example.
unsigned int iAbsIndex = myProtein.FloatToPosIndex(0);  // value is 4

// Check if there is more than the wild-type rotamer at this position.
// Since we floated it using the entire residue library, there should be.
if (myProtein.Rotamers(iAbsIndex) > 1){
    // Exclude wild-type rotamer. Notice that we use a floating position
    // index instead of an absolute position index. Position 4 is the first
    // (and only) floating position, and so it has an index of 0.
    myProtein.IncludeWildType(0,false);
}

Writing a PDB File

Let's say you've read through this manual already. Let's say you've set up a protein with a number of floating positions and you've done the whole design thing and come up with some kind of "answer." Or maybe you haven't and you just want to know what your protein would look like if you changed from rotamer 4 to rotamer 5 at position 3.

Whatever the case, it is a common task to have to write PDB files. That way you can play with your protein using any number of other fancy tools. We do this by creating a PDBFile object to write to, then use the Protein to dump information into that object.

Example 2.11: Writing a PDB file

// Continue from example 2.9

// Output PDBFile object
PDBFile outputPDB;

// Write Protein data to PDBFile
myProtein.WritePDB(outputPDB);

// Write PDBFile data to disk
std::ofstream foutPDB("output.pdb");
PDBFile.Write(foutPDB);
foutPDB.close();

Notice that we did not specify any choices when writing the Protein data. What rotamers were used for the floating positions? By default, Protein simply selects the first rotamer available at any position with multiple rotamers. Of course, this is rarely the correct behavior, so we want to specify what choices to use at each floating position:

Example 2.12: Writing a PDB file (with choices!)

// Continue from example 2.9

// Output PDBFile object
PDBFile outputPDB;

// We have two floating positions, so let's say we want rotamer
// 3 at the first and rotamer 7 at the second. Of course, normally
// these choices aren't arbitrary, but the result of a minimization.
std::vector<unsigned int> vecChoices;
vecChoices.push_back(3);
vecChoices.push_back(7);

// Write Protein data to PDBFile
myProtein.WritePDB(outputPDB, vecChoices);

// Write PDBFile data to disk
std::ofstream foutPDB("output.pdb");
PDBFile.Write(foutPDB);
foutPDB.close();