back to Table of Contents
4. pdb file input, modifiers,
and related jobs
The template pdb file is defined in the inputfile by:
TEMPLATE_PDB
file.pdb
All "ATOM" containing lines in the file are
parsed; these must follow the standard
pdb format. "HETATM", "REMARK", and other such entries are ignored. See below for information on including ligands.
Waters are also ignored.
4.1 Requirements for template pdb files, special
cases, and modifiers
Only a small fraction (1-2%) of the ~750 pdb files downloaded from the
PDB as a test set gave
problems; most of these exited
gracefully with ERROR messages explaining
the problem.
Selenomethionine residues (MSE)
are automatically substituted with methionine (MET).
4.1.1 Missing atoms
Completely missing residues are ignored with a WARNING
message. If a sidechain atom is missing for a residue, a WARNING message is printed, and the missing
atoms built from the existing atoms, assuming ideal bond geometry and all chi=180
.
If any backbone atom for a given sequence ID is given, then it is
required that N, C,
O, and CA are all
defined for that position. For
jobs that require building sidechain coordinates for a given position,
backbone atoms must
be defined for that position. Failure to comply results in exit with ERROR.
4.1.2 User-defined rotamers
A rotamer at any non-Pro/Gly/Cyd position can be substituted with a
user-defined rotamer:
ROTAMER residuecode chi(1) chi(2) .... chi(n)
The substituted rotamer does not have to be the same residuetype as in
the template structure. For example, if we wish to substitute Lys B 15
with a Leu with chi1
= 180, chi2
= -80
:
ROTAMER 15B L
180.0
-80.0
User-defined rotamers may also be listed in a file:
ROTAMER_FILE filename
The ROTAMER_FILE is distinct from
the ROTAMERLIB file. Do not confuse these!
In the rotamer file, the rotamers are listed as described for
individual rotamers. The file should end with a newline.
"ROTAMER" entries from .out
or .pdb
files generated by EGAD may be used for these. The position_type
designations in these entries are ignored.
4.1.3 TRANSFORM_MATRIX
A structure may be transformed by a transform matrix defined as follows:
TRANSFORM_MATRIX x1 y1 z1 w1 x2 y2 z2 w2 x3 y3 z3
w3
where
xtransformed = xo*x1 + yo*y1 + zo*z1 + w1
ytransformed = xo*x2 + yo*y2 + zo*z2 + w2
ztransformed = xo*x3 + yo*y3 + zo*z3 + w3
For example,
TRANSFORM_MATRIX 1 0 0 0 0 1 0 1 0 0 1 0
translates the structure by 1Å along the
y-axis.
The transform matrix must be orthogonal and normalized to avoid
scaling. A non-orthogonal or non-normalized matrix will result in an
appropriate error message and exit the program.
4.1.4 Rebuilding structures from extracted
dihedral angles (REBUILD_FLAG)
For jobs involving atom movements in dihedral space, dihedral angles
are extracted from the inputted structure, and the sidechain atoms
rebuilt (including hydrogens), using ideal geometry. By default,
backbone atoms are not rebuilt (except for hydrogens). Rebuilding also
attaches hydrogen pseudoatoms that carry partial charges to mimic the
pH-dependent average charge for ionizable groups.
However, for jobs that involve the energy evaluation of static
structures, such as POINT_ENERGY or COMPLEX_FORMATION_ENERGY,
the default is to use the inputted structure as is. This way, the user
can be assured that outputted energies are indeed the energy of the
inputted coordinates. However, if hydrogens are not present, the born
radii values will be incorrect, since part of the energy evaluation
involves pre-calculating the effect of covalently bonded atoms, based
on the description in the residue parameter file. In addition, the
energies involving ionizable groups will not be of the appropriate
pH-averaged state.
These facts are printed as a WARNING to stderr. Including the following modifier will
rebuild the sidechains and hydrogens for these jobs:
REBUILD_FLAG 1
4.1.5 Flipping of Asn, Gln, and His rotamers
The orientation of Asn, Gln, and His rotamers in experimental electron
density maps are often ambiguous (Word
1999). By default, the terminal torsions (chi2
for Asn and His, and chi3
for Gln) are permitted to flip 180 and optimized to minimize the
forcefield (vdW, coulombic, torsion) energy. This may be turned off
with:
FLIP_FLAG 0 # default 1
If REBUILD FLAG 0, flipping does
not occur.
4.1.6 Sequence numbering and atom order
Sequences positions within a given chain may start
with any number
>0; sometimes, N-terminal affinity tags are given sequence positions
≤0. Failure to comply results in exit with ERROR.
All the atoms for a given sequence ID (combination of chain and
sequence number) must be within a contiguous block; however, there is
no required order for the atoms within that block. Contiguous residues
are not permitted to have the same sequence ID. Failure to comply
results in exit with ERROR.
Sequence positions within a given chain must not decrease when going
from the N-terminus to the C-terminus. Failure to comply results in
exit with ERROR.
However, non-standard sequence numberings in which the sequence number
stays the same, but has an appended letter (as is common for serine
protease structures), are accepted, and are flagged with a "wacky numbering" message.
Alternate conformers in high resolution structures are flagged with the
"WARNING alternate conformer detected"
message; conformation 'A' is used by
default.
4.1.7 Chain IDs and SUPER_CHAIN
All single-letter chain ID's are accepted, with the exception of 'l'
(lower-case L), which is reserved for the internal representation of
free-moving ligands. The order of the chain ID's in the file is not
important.
All the atoms for a given chain must be within a contiguous block.
Failure to comply results in exit with ERROR.
For the purpose of pulling apart protein complexes in order to
calculate complex formation energy, SUPER_CHAIN
can be defined. A super-chain is a collection of one or more chains
that can be treated together for rigid body movement. A SUPER_CHAIN can be defined in the template pdb
file directly, or in the inputfile. In either case, the usage is the
same:
SUPER_CHAIN chain1 chain2 ......
For example, if we have an antibody-antigen complex, and
the antibody is composed of chains L and H, and the antigen is chain Y,
we can define in the template or inputfile:
SUPER_CHAIN L H
This ensures that chains L
and H will be moved together as a rigid
unit.
If SUPER_CHAIN
is not defined, the first chain listed in a multi-chain structure is
assumed to be a superchain. Regardless of the source of the definition,
a SUPER_CHAIN entry is listed in the
output pdb file for multi-chain structures.
The CHANGE_CHAIN_ID modifier may be used to
change the chain ID for a given chain.
Format:
CHANGE_CHAIN_ID old_ID new_ID
For example, if chain A should be
changed to chain Q:
CHANGE_CHAIN_ID A Q
If only one chain ID is given as an argument, chains without a
chain ID (ie: chain ID = " ") will be modified.
4.1.8 Disulfide bonds
Disulfide bonds are automatically detected (CYS
residues with SG
≤ 2.5Å apart), and treated as immobile disulfide-bonded cysteines
(CYD); a message indicating this is sent to
stderr. Disulfide-bonded CYD
sidechains are not allowed to move, and are not allowed to mutate.
Interactions between the bonded SG atoms
are detected by the various energy calculation functions, and are set
to zero.
Disulfides may be ignored by including the following modifier in the
input file:
IGNORE_DISULFIDE_FLAG 1
If this is the case, the CYS
sidechains will be treated like any other sidechain, and will be
permitted to move and mutate.
Disulfide-bonds between cysteines on different chains (antibodies, for
example) are automatically detected and grouped into the same SUPER_CHAIN for the purposes of rigid body
movement.
4.1.9 Termini
By default, termini are not treated differently than internal backbone
atoms. If structures are rebuilt, only one H
is attached to the N-terminal-most residue. Similarly, an extra O is not attached to the C-terminal-most
residue. If 1H or OXT
atoms are included in the pdb file, they are treated as neutral; this
is flagged with a WARNING message.
There are a few reasons for this default behavior. For one thing, the
presence of charged termini appear to make little difference for
rotamer optimization and energy prediction, so their omission is not of
great consequence. Secondly, crystal structures used as inputs often
have missing residues at the termini; automatically converting these to
charged termini could be misleading. Related to this issue is the fact
that designed proteins, when expressed, may have N or C terminal tags.
However, charged termini may be included if the user explicitly
requests them. For a charged N-terminus, the N-terminal residue in the
template file must include 1H for the
amide proton. The inputfile must also include the following line:
NTE_FLAG 1
Similarly, to include a charged C-terminus, the
C-terminal residue in the template file must include an "OT" or "OXT" oxygen.
The inputfile must also include the following line:
CTE_FLAG 1
If these atoms and flags are set, the remaining atoms for
the defined termini will be rebuilt using ideal bond geometry, and
their partial charges set as defined in the residue parameter file.
Charged termini are currently not supported for pK calculations (JOBTYPE pK).
For
these jobs, it is best to use the default of no charged termini.
4.1.10 Ligands
If free-moving ligands are included in the template pdb file, each
ligand must have a ligand parameter file defined for it in the main
inputfile (see Ligand section).
Ligand
lines must be listed with an "ATOM" or "LIG " at the start of the line; "HETATM" entries are filtered out.
4.2 JOBTYPE HBUILD
This job takes the inputted pdb file, extracts sidechain dihedral
angles, and rebuilds the them, assuming ideal bond geometry; hydrogens
are added during this rebuilding process. Hydrogens are also added to
backbone atoms. The rebuilt structure is saved to a pdb file.
For example (examples/hbuild/hbuild.input):
START
TEMPLATE_PDB ../template_structures/gb1.pdb
FORCEFIELD_FILE ../energy_function/forcefield
JOBTYPE HBUILD
OUTPUT_PREFIX gb1.hbuilt
END
The re-built structure is saved to gb1.hbuilt.pdb.
Structures resulting from an HBUILD can
almost always serve as valid TEMPLATE
files for other EGAD jobtypes. While it should not be necessary to HBUILD
a structure of interest (this is done automatically internally), doing
so ensures that the structure that is inputted is the same structure
that is actually treated internally as the template structure. HBUILD outputs do not include pseudoatoms, such
as virtual hydrogens attached to deprotonated ionizable groups.
4.3 JOBTYPE
ALIGN_STRUCTURES
This job aligns the TEMPLATE structure to
the ALIGN_TARGET structure.
START
...
TEMPLATE pdb file to be transformed
ALIGN_TARGET pdb file for alignment target
JOBTYPE ALIGN_STRUCTURES
...
END
ALIGN_DEFINITION
seq_position residue atom -> seq_position residue atom
...
END
The alignment is defined by the atoms listed under ALIGN_DEFINITION.
For example (examples/align_structures/atom_align.input):
...
END
ALIGN_DEFINITION
...
151B
CYS
SG
->
195A
SER
OG
...
END
SG of CYS
151B in the TEMPLATE will be mapped to OG of SER 195A in the ALIGN_TARGET. This
example aligns the catalytically important atoms from a cysteine
protease (TEV protease) to a serine protease (trypsin).
A minimum of four atom pairs must be defined. The program will
transform the template structure to minimize the rmsd of the distances
between the ALIGN_DEFINITION atom pairs. The output structure is the template structure
transformed. The rmsd for the defined pairs, as well as the
final transform matrix, are listed in the output pdb file examples/align_structures/align_catalytic_atoms.pdb.
If two structures have the same residue numbering, BACKBONE
or TRACE can also be used (examples/align_structures/bkbn_align.input):
...
END
ALIGN_DEFINITION
BACKBONE 5 -> 100
END
The backbone atoms for residues 5-100
are used for the alignment. In this example, the N-terminal lobe of
ligand-free maltose binding protein (MBP) is aligned to the same lobe
in the ligand-bound structure. An overlay of the aligned structure (examples/align_structures/align_apo_bound_mbp_bkbn.pdb)
with the bound form reveals a large hinge motion accompanying
ligand-binding (Figure
4.3.1).
Alternatively, one could use TRACE, which
aligns only the CA atoms of the defined
residue range.
The output pdb file contains the TRANSFORM_MATRIX
used for the transformation.
4.4 Description of the readpdbfile.cpp:
readpdbfile function
readpdbfile is called by input_stuff.cpp: input_stuff,
passing the filename, protein, and a re-arrangement flag. As discussed
in the input_stuff section, this function is called twice. The first
time, the size variables MAX_ATOMS and MAX_RESIDUES are defined based on the number of
residues in the file:
MAX_RESIDUES
= num_residues
+ MAX_LIGAND + 4;
MAX_ATOMS = MAX_PROT_RES_SIZE*MAX_RESIDUES + 10;
This yields data-structures that are always large enough for the
template under consideration.
If these values are too small, or if the forcefield information has not
been read into protein, 0 is returned to
the calling function, signaling a problem. Otherwise the pdb file is
opened and parsed into the pdbATOM array protein->Template. Ligand atoms are read into
separate temp files; these are loaded later by ligand_stuff.cpp:
initialize_stuff_for_ligands (see the ligand section).
Disulfide bonds are detected by readpdbfile.cpp:
cys_to_cyd_disulfides if IGNORE_DISULFIDE_FLAG
0 (default). A disulfide is defined if a pair of CYS SG atoms are within MAX_DISULFIDE_BOND_LENGTH
(2.5Å; defined in
structure_types.h) of
each other. Both CYS are converted to CYD. In energy calculation functions,
interactions between CYD CG within MAX_DISULFIDE_BOND_LENGTH are ignored.
Pointers are attached to protein->Template
entries by calling pdb_utilities.cpp:
attach_ptr_pdbATOM. After pointers are attached, coordinates are
transformed by protein->transform_matrix
if TRANSFORM_MATRIX was defined in the
inputfile. If a ROTATE
entry was defined, this is converted into a transform matrix, rotated
about the normalized user-defined rotation axis, and translated back to
the original location (midpoint of user-defined rotation axis).
Optimization of Asn, Gln, and His rotamers is done by readpdbfile.cpp: optimize_asn_gln_his_rotamers
if FLIP_ASN_GLN_HIS_FLAG 1
(default). This function generates structures for each Q/N/H sidechain
and 180 flips and calculates their energies with the non-Q/H/N atoms
(stored in the fixed_energies matrix). It
then calculates energies between all possible Q/N/H pairs (stored in
the pair_energies
matrix). These precalculated energy tables are used for evaluating
candidate conformations. The algorithm starts with unflipped rotamers.
It then passes through each position and evaluates the energy of the
flipped form. If the total energy decreases, the flipped form is kept.
After all positions are examined, they are all re-examined and flipped
as needed, until a complete sweep is made without improvement. This is
similar to the HQM minimization described below. This method is not
guaranteed to find the best combination of flipped rotamers; it merely
finds a minimum near the input conformation. Nevertheless, since the
majority of experimental conformations are correct, this simple method
may be sufficient.
If a structure has multiple chains, SUPER_CHAINs
are defined. The seq_position used by the
internal data-structures is defined as follows:
1st chain: actual_start_position --> actual_end_position
2nd chain: 1000+actual_start_position -->
1000+actual_end_position
nth chain: (n-1)*1000+actual_start_position -->
(n-1)*1000+actual_end_position
The chain ID character of chain n is stored in char
protein->chain_id_list[n].
The SEQPOS_TEXT_MAPPING_LIST
protein->seqpos_text_map struct stores a representation of seq_positions useful for input and output (see structure_types.h: SEQPOS_TEXT_MAPPING_LIST).
The seqpos_text
field is the literal sequence position string from the pdb file
(including the letters used for non-standard "wacky" numbering
described above) merged with the chain ID. For example, for position
125 on chain B (the second of a two chain structure):
seq_position
= 1125
chain_id_list[2] = 'B'
seqpos_text = "125B"
Internally, protein sections separated by missing atoms are
represented as if they are different chains; this representation
permits building of backbone atoms from extracted dihedrals using the
implementation previously developed for bona-fide multi-chain templates
(see GENES_pdb.cpp: CHROMOSOME_to_mini_pdbATOM).
This internal representation should be completely invisible in the
input and output.
For example, consider the residue in the previous example, but with the
caveat that chain A (100 residues) is missing residues 15-25. In
addition, chain B (150 residues) has residues 52-68 missing. Position
125 on chain B is therefore internally represented on the 4th chain:
chain 1 = A 1-14;
chain_id_list[1] = 'A'
chain 2 = A 26-100;
chain_id_list[2] = 'A'
chain 3 = B 1-51;
chain_id_list[3] = 'B'
chain 4 = B 69-150;
chain_id_list[4] = 'B'
Now, position 125 on chain B:
seq_position = 4125
chain_id_list[4] = 'B'
seqpos_text = "125B"
This seqpos_text representation is
used for reading in fixed and variable positions (see variable position
section), as well as any output that requires a sequence ID.
If the re-arrangement flag argument ≠ 0, this function will rebuild the structure
with
ideal bond geometry. Dihedrals are extracted and stored in a CHROMOSOME by GENES_pdb.cpp:
pdbATOM_to_CHROMOSOME. The coordinates are then rebuilt by GENES_pdb.cpp: CHROMOSOME_to_pdbATOM. This
procedure adds hydrogens and builds in missing atoms. If the flag is
set to 1,
only the sidechains are rebuilt, while backbone atoms simply have
hydrogens built on; the heavy atom coordinates are taken directly from
the inputted file. If the flag is set to 2,
the backbone dihedrals are also extracted, and the entire backbone
rebuilt using ideal bond geometry. This option is not currently used
internally within EGAD, since rebuilding using extracted dihedrals can
cause large structural deformations (see the adjust backbone torsions
section in the minimization with moving backbone section). input_stuff sends default values for this flag
depending on the jobtype (see above). If REBUILD_FLAG
is defined by the user, that takes precedence.
back to Table of Contents