back to Table of Contents
8. Input of variable and fixed
positions for rotamer
optimization
8.1 Overview of rotamer optimization
Rotamer optimization is the raison d'etre for EGAD; most of
the code in EGAD is dedicated to this problem. There are three main
steps related to this: input of variable positions, generation
of the
pair energy lookup table, and the rotamer
optimization itself. These
topics are covered in this and the following two sections. It is also
discussed in greater detail in Chapter 2 of my thesis.
8.2 Explicit definition of variable and fixed
positions
For any rotamer optimization job, it is necessary to specify which
positions are allowed to move, and to what residuetypes they are
allowed to mutate to. It may also be necessary to designate some
positions as fixed.
In general:
START
...
END
VARIABLE_POSITIONS (usually required for optimization jobs)
seq_position
permitted residuetypes
...
END (required at end of VARIABLE_POSITIONS block)
FIXED
_POSITIONS (optional)
seq_position
...
END (required at end of FIXED_POSITIONS block and at end of file)
The seq_position fields use the sequence
format described in the inputfile section.
The order of positions within each section is not important.
Comments are permitted on a line following the data. However, comment
lines and/or empty lines are not permitted.
FIXED_POSITIONS will not be allowed to
change their amino acid identity or their rotamer.
8.2.1 Permitted residuetype string format
For the reasons discussed above, prolines and glycines are not allowed
as residue choices. These need to be built into the template structure
if they are required at a particular position.
One may list residues directly. One letter codes are listed as a
comma-delimited list, terminating with a period (.).
No spaces are permitted within the permitted residuetype string.
For example:
15
A.
11
Q,W,E,R,T,Y.
25
V,A,N.
The order of residuetypes is not important. Residues may be repeated
within a comma-delimited string.
If a seq_position field is given, but no residuetypes are given, the
residue from the template file is assumed to be the only residuetype
permitted at that position. This option should be used with care;
unless OTHER_RESIDUES specifies otherwise
(see below), using this option for a position causes the program to
assume that only explicitly defined positions are allowed to move
(i.e.: OTHER_RESIDUES none).
One may also use the following shortcuts for residue classes:
all
#
A,D,E,F,H,I,K,L,M,N,Q,R,S,T,V,W,Y.
polar
# A,D,E,H,K,N,Q,R,S,T.
hydrophobic
# A,F,I,L,V,W,Y,M.
hphob
# A,F,I,L,V,W,Y,M.
aromatic
# F,W,Y.
aliphatic
# A,I,L,V.
neutral_polar
# N,Q,S,T.
charged
# D,E,H,K,R.
acidic
# D,E.
basic
# R,K.
For example:
15
all
#
A,D,E,F,H,I,K,L,M,N,Q,R,S,T,V,W,Y.
21
polar
# A,D,E,H,K,N,Q,R,S,T.
67
aliphatic
# A,I,L,V.
Mixing and matching formats within a line is not permitted. For
example, the following lines are NOT valid:
16
polar,W,Y.
# not valid
45
aromatic,acidic.
# not valid
32
A,neutral_polar.
# not valid
58
K,acidic,aromatic,Q.
# not valid
8.2.2 Proline, glycine, and disulfide-bonded
cysteine positions
are not allowed to vary
Defining positions that contain proline, glycine, or disulfide-bonded
cysteines as variable positions is NOT permitted. These will be caught
by the parser and cause the program to exit with an ERROR message. These positions are not permitted
to move because mutations at these positions are likely to cause
changes in the backbone conformation, a violation of the fixed-backbone
assumption.
If mutations at disulfide-bond forming cysteines are desired, include
the template structure modifier IGNORE_DISULFIDE_FLAG
1 in the input file.
If mutations at proline or glycine positions are desired, the template
pdb file must be modified. Try replacing the residue field for these
positions with "ALA" and delete the proline
sidechain and/or the glycine 2HA atoms.
Please be warned that this may give questionable results, for the
reasons discussed above.
8.2.3 OTHER_POSITIONS
modifier
The OTHER_POSITIONS (or OTHER_RESIDUES) in the parameters section of the
inputfile is used to permit the movement of other residues not listed
explicitly in the variable positions section.
Three options are available:
OTHER_POSITIONS
all
# default; all residues permitted to move rotamers
OTHER_POSITIONS
neighbors
# neighboring residues also permitted to move
OTHER_POSITIONS
nearest_neighbors
# only residues directly in contact # with user-
# defined residues allowed to move
OTHER_POSITIONS
none
# only explicitly defined positions allowed to move
Nearest-neighbor positions are defined as those
that have at least one
sidechain atom within 1.5*(sum of vdW radii of the defined residue and
candidate neighbor) Å of any sidechain atom
in a defined variable position. Neighbor positions also include
positions which have at least one sidechain atom within vdW contact of
a nearest-neighbor position sidechain atom.
8.2.4 AVOID_NATIVE_ROTAMER_FLAG
modifier
Including
AVOID_NATIVE_ROTAMER_FLAG 1 # default 0
in the parameter section of the inputfile will exclude rotamers
extracted from the template structure. By default, these rotamers are
permitted if the residue from the template structure is allowed as
choice.
8.3 Shortcuts for non-explicit definition of
variable positions
For some commonly run jobs, shortcuts are available.
In general:
VARIABLE_POSITIONS
shortcut_string
END
FIXED_POSITIONS may be listed afterwards,
if so desired. These positions will not be allowed to change their
identity or their rotamer.
The available shortcuts are:
optimize_all_rotamers (optimize
all the rotamers in the template file simultaneously; the sequence is
identical to the template file)
total_sequence_design (allow all
non-P/G/c/C residuetypes at all non-P/G/c positions)
hp_pattern_design
(use hydrophobic-polar patterning for
total_sequence_design;
core positions are hydrophobic; surface are polar;
core/surface definition similar to )
relax_all_rotamers
(for calculating the point energy of a structure, using local
minimization of vdW energies;
see JOBTYPE point_energy
section)
An example (examples/rotamer_optimization/gb1.dee.rotamer_optim.input):
START
TEMPLATE_PDB ../template_structures/gb1.pdb
FORCEFIELD_FILE ../energy_function/forcefield
JOBTYPE dee
OUTPUT_PREFIX gb1.dee.rotamer_optim
END
VARIABLE_POSITIONS
optimize_all_rotamers
END
This file specifies that the DEE (dead-end
elimination) method be used to optimize all the rotamers in the
template structure.
As another example, the following file is for a full sequence design of
gb1 using the MC_GA method, with 100
minutes running time for the MC and the GA steps (examples/total_design/gb1.total_design.input):
START
TEMPLATE_PDB ../template_structures/gb1.pdb
FORCEFIELD_FILE ../energy_function/forcefield
JOBTYPE MC_GA
RUNTIME 100
LOOKUP_TABLE_DIRECTORY ../lookup_tables/gb1.lookup_table
OUTPUT_PREFIX gb1.total_design
END
VARIABLE_POSITIONS
total_sequence_design
END
8.3.1 Input of a SEQUENCE
Instead of listing each position individually, it is possible to input
a protein sequence that will be threaded over the template protein
backbone and rotamer optimized. This is equivalent to listing each
position explicitly with a single amino acid choice at each position.
The functionality may not be used with template structures that have
missing residues. These will be caught by the program, causing an exit
and an ERROR message. Prolines, glycines,
and disulfide-bonded cysteines (c [lower-case C] ) may be included in
the sequence; however, their positions within the inputted sequence
must correspond to their positions in the template pdb file.
General usage:
SEQUENCE
protein_sequence
END
FIXED_POSITIONS may be listed afterwards,
if so desired.
For example, to thread a sequence over a single chain template
backbone, starting at the first position:
SEQUENCE
ASDFGHKLMNQWERTYPPKLMNASDFHCRTYPLHMNGAS
If the sequence (n residues long) is not long enough to cover the full
length of the protein, the sequence is threaded over the first n
positions. The other residues retain their template identities.
It is also possible to thread a sequence starting at a position other
than the first position. For example,
SEQUENCE
51 ASDFGHKLMNQWERTYPPKLMNASDFHCRTYPLHMNGAS
starts at position 51, and threads the sequence over the template. As
above, a sequence of length n is threaded over the first n positions
following and including the defined start position. The other positions
retain their template identities.
These features may also be used for multi-chain proteins. For example:
SEQUENCE
1A HCRTYPLHMNGASRTYPPKLMNCRTYPLHMMNGASRTYP
13B MNGASHMNGASRTYPPPKLMNASDF
chain A has the sequence listed on the first line threaded over it,
starting at its first position. chain B has the sequence listed on the
second line threaded over it, starting at position 13 on chain B.
In all these cases, prolines, glycines, and disulfide-bonded cysteines
must be included at positions corresponding to their locations in the
template pdb file, and are not permitted at positions that are not
already proline, glycine, or disulfide-bonded cysteines respectively.
8.3.2 JOBTYPE
list_variable_positions
JOBTYPE list_variable_positions is a tool for checking which
positions are being allowed to move, which positions are kept fixed,
and the residues that are permitted at each position. It is also useful
for identifying the nearest neighbors and neighbors of a given residue.
The data are output in two files. OUTPUT_PREFIX.variable_positions
gives the equivalent VARIABLE_POSITIONS
section, (with OTHER_RESIDUES none) with
all moving residues listed explicitly.
The output_prefix.detailed_floating_positions_list
file contains much more information. The first section has a line for
each position:
status seq_pos residues
core_surf_int_class number_rotamers MEA_sasa MEA_born
status describes whether the position was
user-defined (user), fixed,
is a nearest_neighbor of a user-defined
residue, or is a neighbor of a user-defined residue. The core_surf_int_class describes whether a position
is core (c), surface (s),
interfacial (i), or is a ligand (l). The MEA_sasa
and MEA_born values are used to define the
core_surf_int_class classification, as
described .
For example (examples/list_variable_positions/gb1_varpos.input):
START
TEMPLATE ../template_structures/gb1.pdb
FORCEFIELD ../energy_function/forcefield
JOBTYPE list_variable_positions
OTHER_RESIDUES neighbors
PREFIX gb1_varpos
VARIABLE_POSITIONS
43
hydrophobic
END
FIXED_POSITIONS
8
12
END
gb1_varpos.detailed_floating_positions_list
:
...
fixed 33 Y c 1 23.845674 3.049009
nearest_neighbor 34 A c 1 17.090264 3.328841
fixed 35 N s 1 96.033067 2.392500
fixed 36 D s 1 98.392464 2.305927
fixed 37 N s 1 49.481070 2.580551
neighbor 39 V c 10 4.476708 3.986714
fixed 40 D s 1 124.194057 2.080821
fixed 42 E s 1 102.218801 2.134633
user 43 A F I L M V W Y s 145 69.989387 2.333668
fixed 44 T s 1 92.861906 2.263579
neighbor 45 Y s 19 42.725660 2.505190
...
In this example, the program finds the neighbors and
nearest-neighbors of position 43 (since OTHER_RESIDUES
neighbors was defined; if this was not defined, by default, all
positions not explicitly defined would be allowed to change their
rotamer). The residues that are allowed at user-defined
position 43 are A,F,I,L,M,V,W,
and Y (hydrophobic).
In the fragment of the file shown here, 39
and 45 are neighbors, while 34 is a nearest neighbor. Positions 8 and 12 (not shown
here) were defined to be fixed, regardless
of their actual neighbor status.
Finally, a comma-delimited list (line begins with COMMA_DELIMITED_LIST)
is also listed; this line may be cut and pasted into Rasmol for
defining the moving residues.
To do the actual run, replace the JOBTYPE
with a rotamer-optimization method.
8.4 Description of variable position input structs
and functions
8.4.1 VARIABLE_POSITION and CHOICE
datastructures
The functions described here are all involved in setting up the main protein->var_pos array (see structure_types.h: VARIABLE_POSITION). Each
non-pro/gly position has a protein->var_pos
element, whether it is a FIXED_POSITION or
not, or whether is it user-defined or not. Each VARIABLE_POSITION
has a number_of_choices (number of
residuetypes permitted at that position; non-explicitly defined
positions have one choice corresponding to the residue for that
position in the template structure). It also has pointers to or values
for other information relevant for this position, such as the internal
sequence position (seq_position), whether
it is fixed or not (fixed_flag), whether it
is considered to be a core or surface position (core_flag),
the BACKBONE for this position (bkbn), and the lookup table branch for this
position (lookup_energy_ptr).
VARIABLE_POSITION also has an array choice of CHOICE.
Each choice element corresponds to a
residuetype permitted at that position. A CHOICE
struct has pointers to the RESPARAM
element corresponding to its residuetype (resparam_ptr),
the lookup table branch for this residuetype at this position (lookup_res_ptr), and to the BACKBONE
corresponding to this position (bkbn). It
also has in_use_flag; 1
indicates that this residuetype is indeed a permitted option at this
position (see rotamer calc slave section).
The resparam_ptr element of CHOICE is a copy of the main protein->resparam entry for the residuetype
of interest. The rotamerlib_ptr element of
this RESPARAM is de-referenced from the
relevant entry of the main protein->rotamerlib
array, and then re-allocated and referenced to rotamers specific for
this residuetype at this particular position. In this way, each
position can have different subsets of possible rotamers for the same
residuetype. This is required to filter rotamers that clash with one
backbone position, but not another (lookup_table.cpp:
generate_lookup_table), or to include the rotamer from the
template pdb file (VARIABLE_POSITION.cpp:
pdb_to_VARIABLE_POSITION).
8.4.2 input_stuff.cpp:
input_VARIABLE_POSITION
input_stuff.cpp: input_VARIABLE_POSITION is
sent a file pointer corresponding to the line following the VARIABLE_POSITIONS line in the inputfile. It
assumes that the protein argument has been initialized by input_stuff.cpp: input_stuff, and has attributes
such as OTHER_RESIDUES and JOBTYPE assigned (to a rotamer optimization
job). It also assumes that the TEMPLATE_PDB
has been loaded. This function may be called by functions other than input_stuff. For example, see rotamer_calc_foreman.cpp: rotamer_calc_foreman
or input_stuff.cpp: sequence_to_var_pos_file
.
Historically, the neighbor distance protein->parameters.neighbordist
used to be a user-definable parameter. Now, it is set based on the
value of the OTHER_POSITIONS modifier, and
is used by the VARIABLE_POSITION.cpp:
modify_VARIABLE_POSITIONS and find_neighbors
functions to define additional moving positions.
Overall, user-defined residuetypes are assigned to each user-defined
variable position protein->var_pos. This
information, along with the OTHER_RESIDUES
parameter, and user-defined FIXED_POSITIONS
is used by VARIABLE_POSITION.cpp:
modify_VARIABLE_POSITION to identify other positions in the
structure that are allowed to change rotamer conformations.
If hp_pattern_design has been defined,
protein and the file pointer are sent to input_stuff.cpp:
hp_pattern_design, which determines the HP pattern of the
residues, creates a new inputfile with the appropriate patterning, and
launches a child process with the new file. If optimize_all_rotamers
has been defined, the first alanine (failing that, valine, serine, or
threonine) is defined as a user-defined variable position. If total_sequence_design has been defined, the protein->Template pdbATOM array is searched
to identify non-pro-gly positions. At each of these positions, all 17
non-pro/gly/cys residuetypes are defined as choices. If variable
positions are explicitly defined, they are read in from the inputfile.
User-defined fixed positions are read into the
fixed_positions int array.
If JOBTYPE list_variable_positions is
defined, the data are written to the OUTPUT_PREFIX.floating_positions_list
and OUTPUT_PREFIX.detailed_floating_positions_list
files.
8.4.3 VARIABLE_POSITIONS.cpp:
modify_VARIABLE_POSITION
This function creates a VARIABLE_POSITION
array fixed_res_VARIABLE_POSITION that
contains the residues and rotamers from the inputted template
structure. This is done by VARIABLE_POSTIONS.cpp:
pdbATOM_to_VARIABLE_POSITION, which extracts the native
residuetypes and rotamers from the template structure for all
non-pro/gly positions. The intrinsic energies for these rotamers are
calculated and stored in the custom RESPARAM
and ROTAMERLIB structs generated here.
Next, the non-user-defined residues neighboring the user-defined
variable positions are identified by VARIABLE_POSTIONS.cpp:
find_neighbors, based on protein->parameters.neighbordist
(defined in input_stuff by the OTHER_RESIDUES parameter). The list of
neighboring positions is converted to the VARIABLE_POSITIONS
array neighbor_varpos with the listofvarpos_to_VARIABLE_POSITION function.
These three VARIABLE_POSITION arrays,
along with the list of fixed positions, are melded into the final protein->var_pos by the meld_VARIABLE_POSITION
function. If CHARGES_PH_INDEPENDENT_FLAG 1
requires explicit consideration of the conjugate acid and base forms of
ionizable residues, the rarer form is appended as a CHOICE to the relevant VARIABLE_POSITIONs.
The core_flag element for each protein->var_pos element is determined using
the method similar to that described by Marshall & Mayo (Marshall
and Mayo 2001). This information
can be used for structure-dependent reference states or for hp
patterning design. Some default values and pointers are set for
elements within VARIABLE_POSITION and CHOICE.
Finally, the fixed atoms (backbone and pro/gly) are extracted, and
their internal energy calculated.
8.4.4 input_stuff.cpp:
sequence_to_var_pos_file
This function converts a SEQUENCE input
into a file containing the variable-positions and fixed-positions parts
of a standard inputfile. This file is opened, and the file pointer sent
to input_VARIABLE_POSITION and processed
as described above.
back to Table of Contents