back to Table of Contents

4. pdb file input, modifiers, and related jobs

The template pdb file is defined in the inputfile by:
TEMPLATE_PDB     file.pdb

All "ATOM" containing lines in the file are parsed; these must follow the standard pdb format. "HETATM", "REMARK", and other such entries are ignored. See below for information on including ligands. Waters are also ignored.

4.1 Requirements for template pdb files, special cases, and modifiers

Only a small fraction (1-2%) of the ~750 pdb files downloaded from the PDB as a test set gave problems; most of these exited gracefully with ERROR messages explaining the problem.

Selenomethionine residues (MSE) are automatically substituted with methionine (MET).

4.1.1 Missing atoms

Completely missing residues are ignored with a WARNING message. If a sidechain atom is missing for a residue, a WARNING message is printed, and the missing atoms built from the existing atoms, assuming ideal bond geometry and all chi=180 .

If any backbone atom for a given sequence ID is given, then it is required that N, C, O, and CA are all defined for that position. For jobs that require building sidechain coordinates for a given position, backbone atoms must be defined for that position. Failure to comply results in exit with ERROR.

4.1.2 User-defined rotamers

A rotamer at any non-Pro/Gly/Cyd position can be substituted with a user-defined rotamer:
ROTAMER residuecode chi(1) chi(2) .... chi(n)

The substituted rotamer does not have to be the same residuetype as in the template structure. For example, if we wish to substitute Lys B 15 with a Leu with chi1 = 180, chi2 = -80 :
ROTAMER 15B L     180.0      -80.0

User-defined rotamers may also be listed in a file:
ROTAMER_FILE filename

The ROTAMER_FILE is distinct from the ROTAMERLIB file. Do not confuse these!

In the rotamer file, the rotamers are listed as described for individual rotamers. The file should end with a newline.

"ROTAMER" entries from .out or .pdb files generated by EGAD may be used for these. The position_type designations in these entries are ignored.

4.1.3 TRANSFORM_MATRIX

A structure may be transformed by a transform matrix defined as follows:
TRANSFORM_MATRIX x1 y1 z1 w1 x2 y2 z2 w2 x3 y3 z3 w3
where
xtransformed = xo*x1 + yo*y1 + zo*z1 + w1
y
transformed = xo*x2 + yo*y2 + zo*z2 + w2
z
transformed = xo*x3 + yo*y3 + zo*z3 + w3

For example,
         TRANSFORM_MATRIX 1 0 0 0 0 1 0 1 0 0 1 0
translates the structure by 1Å along the y-axis.

The transform matrix must be orthogonal and normalized to avoid scaling. A non-orthogonal or non-normalized matrix will result in an appropriate error message and exit the program.

4.1.4 Rebuilding structures from extracted dihedral angles (REBUILD_FLAG)

For jobs involving atom movements in dihedral space, dihedral angles are extracted from the inputted structure, and the sidechain atoms rebuilt (including hydrogens), using ideal geometry. By default, backbone atoms are not rebuilt (except for hydrogens). Rebuilding also attaches hydrogen pseudoatoms that carry partial charges to mimic the pH-dependent average charge for ionizable groups.

However, for jobs that involve the energy evaluation of static structures, such as POINT_ENERGY or COMPLEX_FORMATION_ENERGY, the default is to use the inputted structure as is. This way, the user can be assured that outputted energies are indeed the energy of the inputted coordinates. However, if hydrogens are not present, the born radii values will be incorrect, since part of the energy evaluation involves pre-calculating the effect of covalently bonded atoms, based on the description in the residue parameter file. In addition, the energies involving ionizable groups will not be of the appropriate pH-averaged state.

These facts are printed as a WARNING to stderr. Including the following modifier will rebuild the sidechains and hydrogens for these jobs:
REBUILD_FLAG 1

4.1.5 Flipping of Asn, Gln, and His rotamers

The orientation of Asn, Gln, and His rotamers in experimental electron density maps are often ambiguous (Word 1999). By default, the terminal torsions (chi2 for Asn and His, and chi3 for Gln) are permitted to flip 180 and optimized to minimize the forcefield (vdW, coulombic, torsion) energy. This may be turned off with:
FLIP_FLAG 0 # default 1
If REBUILD FLAG 0, flipping does not occur.

4.1.6 Sequence numbering and atom order

Sequences positions within a given chain may start with any number >0; sometimes, N-terminal affinity tags are given sequence positions ≤0. Failure to comply results in exit with ERROR.

All the atoms for a given sequence ID (combination of chain and sequence number) must be within a contiguous block; however, there is no required order for the atoms within that block. Contiguous residues are not permitted to have the same sequence ID. Failure to comply results in exit with ERROR.

Sequence positions within a given chain must not decrease when going from the N-terminus to the C-terminus. Failure to comply results in exit with ERROR. However, non-standard sequence numberings in which the sequence number stays the same, but has an appended letter (as is common for serine protease structures), are accepted, and are flagged with a "wacky numbering" message.

Alternate conformers in high resolution structures are flagged with the "WARNING alternate conformer detected" message; conformation 'A' is used by default.

4.1.7 Chain IDs and SUPER_CHAIN

All single-letter chain ID's are accepted, with the exception of 'l' (lower-case L), which is reserved for the internal representation of free-moving ligands. The order of the chain ID's in the file is not important.

All the atoms for a given chain must be within a contiguous block. Failure to comply results in exit with ERROR.

For the purpose of pulling apart protein complexes in order to calculate complex formation energy, SUPER_CHAIN can be defined. A super-chain is a collection of one or more chains that can be treated together for rigid body movement. A SUPER_CHAIN can be defined in the template pdb file directly, or in the inputfile. In either case, the usage is the same:
         SUPER_CHAIN chain1 chain2 ......
For example, if we have an antibody-antigen complex, and the antibody is composed of chains L and H, and the antigen is chain Y, we can define in the template or inputfile:
         SUPER_CHAIN L H
This ensures that chains L and H will be moved together as a rigid unit.
        
If SUPER_CHAIN is not defined, the first chain listed in a multi-chain structure is assumed to be a superchain. Regardless of the source of the definition, a SUPER_CHAIN entry is listed in the output pdb file for multi-chain structures.

The CHANGE_CHAIN_ID modifier may be used to change the chain ID for a given chain.
Format:
         CHANGE_CHAIN_ID old_ID new_ID
For example, if chain A should be changed to chain Q:
         CHANGE_CHAIN_ID A Q
If only one chain ID is given as an argument, chains without a chain ID (ie: chain ID = " ") will be modified.

4.1.8 Disulfide bonds

Disulfide bonds are automatically detected (CYS residues with SG ≤ 2.5Å apart), and treated as immobile disulfide-bonded cysteines (CYD); a message indicating this is sent to stderr. Disulfide-bonded CYD sidechains are not allowed to move, and are not allowed to mutate. Interactions between the bonded SG atoms are detected by the various energy calculation functions, and are set to zero.
Disulfides may be ignored by including the following modifier in the input file:
IGNORE_DISULFIDE_FLAG 1
If this is the case, the CYS sidechains will be treated like any other sidechain, and will be permitted to move and mutate.

Disulfide-bonds between cysteines on different chains (antibodies, for example) are automatically detected and grouped into the same SUPER_CHAIN for the purposes of rigid body movement.

4.1.9 Termini

By default, termini are not treated differently than internal backbone atoms. If structures are rebuilt, only one H is attached to the N-terminal-most residue. Similarly, an extra O is not attached to the C-terminal-most residue. If 1H or OXT atoms are included in the pdb file, they are treated as neutral; this is flagged with a WARNING message.

There are a few reasons for this default behavior. For one thing, the presence of charged termini appear to make little difference for rotamer optimization and energy prediction, so their omission is not of great consequence. Secondly, crystal structures used as inputs often have missing residues at the termini; automatically converting these to charged termini could be misleading. Related to this issue is the fact that designed proteins, when expressed, may have N or C terminal tags.

However, charged termini may be included if the user explicitly requests them. For a charged N-terminus, the N-terminal residue in the template file must include 1H for the amide proton. The inputfile must also include the following line:
         NTE_FLAG 1
Similarly, to include a charged C-terminus, the C-terminal residue in the template file must include an "OT" or "OXT" oxygen. The inputfile must also include the following line:
         CTE_FLAG 1
If these atoms and flags are set, the remaining atoms for the defined termini will be rebuilt using ideal bond geometry, and their partial charges set as defined in the residue parameter file.

Charged termini are currently not supported for pK calculations (JOBTYPE pK). For these jobs, it is best to use the default of no charged termini.

4.1.10 Ligands

If free-moving ligands are included in the template pdb file, each ligand must have a ligand parameter file defined for it in the main inputfile (see Ligand section). Ligand lines must be listed with an "ATOM" or "LIG " at the start of the line; "HETATM" entries are filtered out.

4.2 JOBTYPE HBUILD

This job takes the inputted pdb file, extracts sidechain dihedral angles, and rebuilds the them, assuming ideal bond geometry; hydrogens are added during this rebuilding process. Hydrogens are also added to backbone atoms. The rebuilt structure is saved to a pdb file.

For example (examples/hbuild/hbuild.input):

START
TEMPLATE_PDB ../template_structures/gb1.pdb
FORCEFIELD_FILE ../energy_function/forcefield
JOBTYPE HBUILD
OUTPUT_PREFIX gb1.hbuilt
END

The re-built structure is saved to gb1.hbuilt.pdb.

Structures resulting from an HBUILD can almost always serve as valid TEMPLATE files for other EGAD jobtypes. While it should not be necessary to HBUILD a structure of interest (this is done automatically internally), doing so ensures that the structure that is inputted is the same structure that is actually treated internally as the template structure. HBUILD outputs do not include pseudoatoms, such as virtual hydrogens attached to deprotonated ionizable groups.

4.3 JOBTYPE ALIGN_STRUCTURES

This job aligns the TEMPLATE structure to the ALIGN_TARGET structure.
START
...
TEMPLATE pdb file to be transformed
ALIGN_TARGET pdb file for alignment target
JOBTYPE ALIGN_STRUCTURES
...
END
ALIGN_DEFINITION
seq_position residue atom -> seq_position residue atom
...
END

The alignment is defined by the atoms listed under ALIGN_DEFINITION. For example (examples/align_structures/atom_align.input):
...
END
ALIGN_DEFINITION
...
151B     CYS      SG       ->       195A     SER      OG
...
END
SG of CYS 151B in the TEMPLATE will be mapped to OG of SER 195A in the ALIGN_TARGET. This example aligns the catalytically important atoms from a cysteine protease (TEV protease) to a serine protease (trypsin).

A minimum of four atom pairs must be defined. The program will transform the template structure to minimize the rmsd of the distances between the ALIGN_DEFINITION atom pairs. The output structure is the template structure transformed. The rmsd for the defined pairs, as well as the final transform matrix, are listed in the output pdb file examples/align_structures/align_catalytic_atoms.pdb.

If two structures have the same residue numbering, BACKBONE or TRACE can also be used (examples/align_structures/bkbn_align.input):
...
END
ALIGN_DEFINITION

BACKBONE 5 -> 100
END

The backbone atoms for residues 5-100 are used for the alignment. In this example, the N-terminal lobe of ligand-free maltose binding protein (MBP) is aligned to the same lobe in the ligand-bound structure. An overlay of the aligned structure (examples/align_structures/align_apo_bound_mbp_bkbn.pdb) with the bound form reveals a large hinge motion accompanying ligand-binding (Figure 4.3.1).

Alternatively, one could use TRACE, which aligns only the CA atoms of the defined residue range.

The output pdb file contains the TRANSFORM_MATRIX used for the transformation.

4.4 Description of the readpdbfile.cpp: readpdbfile function

readpdbfile is called by input_stuff.cpp: input_stuff, passing the filename, protein, and a re-arrangement flag. As discussed in the input_stuff section, this function is called twice. The first time, the size variables MAX_ATOMS and MAX_RESIDUES are defined based on the number of residues in the file:
MAX_RESIDUES = num_residues + MAX_LIGAND + 4;
MAX_ATOMS = MAX_PROT_RES_SIZE*MAX_RESIDUES + 10;

This yields data-structures that are always large enough for the template under consideration.

If these values are too small, or if the forcefield information has not been read into protein, 0 is returned to the calling function, signaling a problem. Otherwise the pdb file is opened and parsed into the pdbATOM array protein->Template. Ligand atoms are read into separate temp files; these are loaded later by ligand_stuff.cpp: initialize_stuff_for_ligands (see the ligand section).

Disulfide bonds are detected by readpdbfile.cpp: cys_to_cyd_disulfides if IGNORE_DISULFIDE_FLAG 0 (default). A disulfide is defined if a pair of CYS SG atoms are within MAX_DISULFIDE_BOND_LENGTH (2.5Å; defined in structure_types.h) of each other. Both CYS are converted to CYD. In energy calculation functions, interactions between CYD CG within MAX_DISULFIDE_BOND_LENGTH are ignored.

Pointers are attached to protein->Template entries by calling pdb_utilities.cpp: attach_ptr_pdbATOM. After pointers are attached, coordinates are transformed by protein->transform_matrix if TRANSFORM_MATRIX was defined in the inputfile. If a ROTATE entry was defined, this is converted into a transform matrix, rotated about the normalized user-defined rotation axis, and translated back to the original location (midpoint of user-defined rotation axis).

Optimization of Asn, Gln, and His rotamers is done by readpdbfile.cpp: optimize_asn_gln_his_rotamers if FLIP_ASN_GLN_HIS_FLAG 1 (default). This function generates structures for each Q/N/H sidechain and 180 flips and calculates their energies with the non-Q/H/N atoms (stored in the fixed_energies matrix). It then calculates energies between all possible Q/N/H pairs (stored in the pair_energies matrix). These precalculated energy tables are used for evaluating candidate conformations. The algorithm starts with unflipped rotamers. It then passes through each position and evaluates the energy of the flipped form. If the total energy decreases, the flipped form is kept. After all positions are examined, they are all re-examined and flipped as needed, until a complete sweep is made without improvement. This is similar to the HQM minimization described below. This method is not guaranteed to find the best combination of flipped rotamers; it merely finds a minimum near the input conformation. Nevertheless, since the majority of experimental conformations are correct, this simple method may be sufficient.

If a structure has multiple chains, SUPER_CHAINs are defined. The seq_position used by the internal data-structures is defined as follows:
         1st chain: actual_start_position --> actual_end_position
         2nd chain: 1000+actual_start_position --> 1000+actual_end_position      
         nth chain: (n-1)*1000+actual_start_position --> (n-1)*1000+actual_end_position  

The chain ID character of chain n is stored in char protein->chain_id_list[n].

The SEQPOS_TEXT_MAPPING_LIST protein->seqpos_text_map struct stores a representation of seq_positions useful for input and output (see structure_types.h: SEQPOS_TEXT_MAPPING_LIST). The seqpos_text field is the literal sequence position string from the pdb file (including the letters used for non-standard "wacky" numbering described above) merged with the chain ID. For example, for position 125 on chain B (the second of a two chain structure):
seq_position = 1125
chain_id_list[2] = 'B'
seqpos_text = "125B"

Internally, protein sections separated by missing atoms are represented as if they are different chains; this representation permits building of backbone atoms from extracted dihedrals using the implementation previously developed for bona-fide multi-chain templates (see GENES_pdb.cpp: CHROMOSOME_to_mini_pdbATOM). This internal representation should be completely invisible in the input and output.

For example, consider the residue in the previous example, but with the caveat that chain A (100 residues) is missing residues 15-25. In addition, chain B (150 residues) has residues 52-68 missing. Position 125 on chain B is therefore internally represented on the 4th chain:
chain 1 = A 1-14;        chain_id_list[1] = 'A'
chain 2 = A 26-100;      chain_id_list[2] = 'A'
chain 3 = B 1-51;        chain_id_list[3] = 'B'
chain 4 = B 69-150;      chain_id_list[4] = 'B'

Now, position 125 on chain B:
seq_position = 4125
chain_id_list[4] = 'B'
seqpos_text = "125B"

This seqpos_text representation is used for reading in fixed and variable positions (see variable position section), as well as any output that requires a sequence ID.

If the re-arrangement flag argument ≠ 0, this function will rebuild the structure with ideal bond geometry. Dihedrals are extracted and stored in a CHROMOSOME by GENES_pdb.cpp: pdbATOM_to_CHROMOSOME. The coordinates are then rebuilt by GENES_pdb.cpp: CHROMOSOME_to_pdbATOM. This procedure adds hydrogens and builds in missing atoms. If the flag is set to 1, only the sidechains are rebuilt, while backbone atoms simply have hydrogens built on; the heavy atom coordinates are taken directly from the inputted file. If the flag is set to 2, the backbone dihedrals are also extracted, and the entire backbone rebuilt using ideal bond geometry. This option is not currently used internally within EGAD, since rebuilding using extracted dihedrals can cause large structural deformations (see the adjust backbone torsions section in the minimization with moving backbone section). input_stuff sends default values for this flag depending on the jobtype (see above). If REBUILD_FLAG is defined by the user, that takes precedence.

back to Table of Contents