back to Table of Contents

8. Input of variable and fixed positions for rotamer optimization

8.1 Overview of rotamer optimization

Rotamer optimization is the raison d'etre for EGAD; most of the code in EGAD is dedicated to this problem. There are three main steps related to this: input of variable positions, generation of the pair energy lookup table, and the rotamer optimization itself. These topics are covered in this and the following two sections. It is also discussed in greater detail in Chapter 2 of my thesis.

8.2 Explicit definition of variable and fixed positions

For any rotamer optimization job, it is necessary to specify which positions are allowed to move, and to what residuetypes they are allowed to mutate to. It may also be necessary to designate some positions as fixed.

In general:
START
...
END
VARIABLE_POSITIONS (usually required for optimization jobs)
         seq_position     permitted residuetypes
...
END (required at end of VARIABLE_POSITIONS block)
FIXED    _POSITIONS (optional)
         seq_position
...
END (required at end of FIXED_POSITIONS block and at end of file)


The seq_position fields use the sequence format described in the inputfile section.
The order of positions within each section is not important.
Comments are permitted on a line following the data. However, comment lines and/or empty lines are not permitted.

FIXED_POSITIONS will not be allowed to change their amino acid identity or their rotamer.

8.2.1 Permitted residuetype string format

For the reasons discussed above, prolines and glycines are not allowed as residue choices. These need to be built into the template structure if they are required at a particular position.

One may list residues directly. One letter codes are listed as a comma-delimited list, terminating with a period (.). No spaces are permitted within the permitted residuetype string.
For example:
15       A.
11       Q,W,E,R,T,Y.
25       V,A,N.

The order of residuetypes is not important. Residues may be repeated within a comma-delimited string.

If a seq_position field is given, but no residuetypes are given, the residue from the template file is assumed to be the only residuetype permitted at that position. This option should be used with care; unless OTHER_RESIDUES specifies otherwise (see below), using this option for a position causes the program to assume that only explicitly defined positions are allowed to move (i.e.: OTHER_RESIDUES none).
        
One may also use the following shortcuts for residue classes:
all              # A,D,E,F,H,I,K,L,M,N,Q,R,S,T,V,W,Y.    
polar            # A,D,E,H,K,N,Q,R,S,T.
hydrophobic      # A,F,I,L,V,W,Y,M.
hphob            # A,F,I,L,V,W,Y,M.
aromatic         # F,W,Y.
aliphatic        # A,I,L,V.
neutral_polar    # N,Q,S,T.
charged          # D,E,H,K,R.
acidic           # D,E.
basic            # R,K.

For example:
15       all              # A,D,E,F,H,I,K,L,M,N,Q,R,S,T,V,W,Y.
21       polar            # A,D,E,H,K,N,Q,R,S,T.
67       aliphatic        # A,I,L,V.

Mixing and matching formats within a line is not permitted. For example, the following lines are NOT valid:
16       polar,W,Y.               # not valid
45       aromatic,acidic.         # not valid
32       A,neutral_polar.         # not valid
58       K,acidic,aromatic,Q.     # not valid

8.2.2 Proline, glycine, and disulfide-bonded cysteine positions are not allowed to vary

Defining positions that contain proline, glycine, or disulfide-bonded cysteines as variable positions is NOT permitted. These will be caught by the parser and cause the program to exit with an ERROR message. These positions are not permitted to move because mutations at these positions are likely to cause changes in the backbone conformation, a violation of the fixed-backbone assumption.

If mutations at disulfide-bond forming cysteines are desired, include the template structure modifier IGNORE_DISULFIDE_FLAG 1 in the input file.

If mutations at proline or glycine positions are desired, the template pdb file must be modified. Try replacing the residue field for these positions with "ALA" and delete the proline sidechain and/or the glycine 2HA atoms. Please be warned that this may give questionable results, for the reasons discussed above.

8.2.3 OTHER_POSITIONS modifier

The OTHER_POSITIONS (or OTHER_RESIDUES) in the parameters section of the inputfile is used to permit the movement of other residues not listed explicitly in the variable positions section.

Three options are available:
OTHER_POSITIONS  all                        # default; all residues permitted to move rotamers
OTHER_POSITIONS  neighbors                  # neighboring residues also permitted to move
OTHER_POSITIONS  nearest_neighbors          # only residues directly in contact # with user-
                                                # defined residues allowed to move
OTHER_POSITIONS  none                       # only explicitly defined positions allowed to move

Nearest-neighbor positions are defined as those that have at least one sidechain atom within 1.5*(sum of vdW radii of the defined residue and candidate neighbor) Å of any sidechain atom in a defined variable position. Neighbor positions also include positions which have at least one sidechain atom within vdW contact of a nearest-neighbor position sidechain atom.

8.2.4 AVOID_NATIVE_ROTAMER_FLAG modifier

Including
AVOID_NATIVE_ROTAMER_FLAG 1 # default 0
in the parameter section of the inputfile will exclude rotamers extracted from the template structure. By default, these rotamers are permitted if the residue from the template structure is allowed as choice.

8.3 Shortcuts for non-explicit definition of variable positions

For some commonly run jobs, shortcuts are available.

In general:

VARIABLE_POSITIONS
         shortcut_string
END

FIXED_POSITIONS may be listed afterwards, if so desired. These positions will not be allowed to change their identity or their rotamer.

The available shortcuts are:
optimize_all_rotamers   (optimize all the rotamers in the template file simultaneously; the sequence is identical to the template file) 
total_sequence_design (allow all non-P/G/c/C residuetypes at all non-P/G/c positions)
hp_pattern_design             (use hydrophobic-polar patterning for total_sequence_design;
                                                               core positions are hydrophobic; surface are polar;
                                                               core/surface definition similar to )
relax_all_rotamers          (for calculating the point energy of a structure, using local minimization of vdW energies;
                                                                see JOBTYPE point_energy section)

An example (examples/rotamer_optimization/gb1.dee.rotamer_optim.input):

START
TEMPLATE_PDB ../template_structures/gb1.pdb
FORCEFIELD_FILE ../energy_function/forcefield
JOBTYPE dee
OUTPUT_PREFIX gb1.dee.rotamer_optim
END
VARIABLE_POSITIONS
         optimize_all_rotamers
END

This file specifies that the DEE (dead-end elimination) method be used to optimize all the rotamers in the template structure.

As another example, the following file is for a full sequence design of gb1 using the MC_GA method, with 100 minutes running time for the MC and the GA steps (examples/total_design/gb1.total_design.input):

START
TEMPLATE_PDB ../template_structures/gb1.pdb
FORCEFIELD_FILE ../energy_function/forcefield
JOBTYPE MC_GA
    RUNTIME 100
LOOKUP_TABLE_DIRECTORY ../lookup_tables/gb1.lookup_table
OUTPUT_PREFIX gb1.total_design
END
VARIABLE_POSITIONS
         total_sequence_design
END

8.3.1 Input of a SEQUENCE

Instead of listing each position individually, it is possible to input a protein sequence that will be threaded over the template protein backbone and rotamer optimized. This is equivalent to listing each position explicitly with a single amino acid choice at each position. The functionality may not be used with template structures that have missing residues. These will be caught by the program, causing an exit and an ERROR message. Prolines, glycines, and disulfide-bonded cysteines (c [lower-case C] ) may be included in the sequence; however, their positions within the inputted sequence must correspond to their positions in the template pdb file.

General usage:
SEQUENCE
         protein_sequence
END

FIXED_POSITIONS may be listed afterwards, if so desired.

For example, to thread a sequence over a single chain template backbone, starting at the first position:
SEQUENCE
ASDFGHKLMNQWERTYPPKLMNASDFHCRTYPLHMNGAS

If the sequence (n residues long) is not long enough to cover the full length of the protein, the sequence is threaded over the first n positions. The other residues retain their template identities.

It is also possible to thread a sequence starting at a position other than the first position. For example,
SEQUENCE
51 ASDFGHKLMNQWERTYPPKLMNASDFHCRTYPLHMNGAS

starts at position 51, and threads the sequence over the template. As above, a sequence of length n is threaded over the first n positions following and including the defined start position. The other positions retain their template identities.

These features may also be used for multi-chain proteins. For example:
SEQUENCE
1A HCRTYPLHMNGASRTYPPKLMNCRTYPLHMMNGASRTYP
13B MNGASHMNGASRTYPPPKLMNASDF

chain A has the sequence listed on the first line threaded over it, starting at its first position. chain B has the sequence listed on the second line threaded over it, starting at position 13 on chain B.

In all these cases, prolines, glycines, and disulfide-bonded cysteines must be included at positions corresponding to their locations in the template pdb file, and are not permitted at positions that are not already proline, glycine, or disulfide-bonded cysteines respectively.

8.3.2 JOBTYPE list_variable_positions

JOBTYPE list_variable_positions
is a tool for checking which positions are being allowed to move, which positions are kept fixed, and the residues that are permitted at each position. It is also useful for identifying the nearest neighbors and neighbors of a given residue.

The data are output in two files. OUTPUT_PREFIX.variable_positions gives the equivalent VARIABLE_POSITIONS section, (with OTHER_RESIDUES none) with all moving residues listed explicitly.

The output_prefix.detailed_floating_positions_list file contains much more information. The first section has a line for each position:
status seq_pos residues core_surf_int_class number_rotamers MEA_sasa MEA_born

status describes whether the position was user-defined (user), fixed, is a nearest_neighbor of a user-defined residue, or is a neighbor of a user-defined residue. The core_surf_int_class describes whether a position is core (c), surface (s), interfacial (i), or is a ligand (l). The MEA_sasa and MEA_born values are used to define the core_surf_int_class classification, as described .

For example (examples/list_variable_positions/gb1_varpos.input):
START
TEMPLATE ../template_structures/gb1.pdb
FORCEFIELD ../energy_function/forcefield
JOBTYPE list_variable_positions
OTHER_RESIDUES neighbors
PREFIX gb1_varpos
VARIABLE_POSITIONS
    43 hydrophobic
END
FIXED_POSITIONS
    8
    12
END

gb1_varpos.detailed_floating_positions_list
:
...
fixed 33 Y c 1 23.845674 3.049009
nearest_neighbor 34 A c 1 17.090264 3.328841
fixed 35 N s 1 96.033067 2.392500
fixed 36 D s 1 98.392464 2.305927
fixed 37 N s 1 49.481070 2.580551
neighbor 39 V c 10 4.476708 3.986714
fixed 40 D s 1 124.194057 2.080821
fixed 42 E s 1 102.218801 2.134633
user 43 A F I L M V W Y s 145 69.989387 2.333668
fixed 44 T s 1 92.861906 2.263579
neighbor 45 Y s 19 42.725660 2.505190
...

In this example, the program finds the neighbors and nearest-neighbors of position 43 (since OTHER_RESIDUES neighbors was defined; if this was not defined, by default, all positions not explicitly defined would be allowed to change their rotamer). The residues that are allowed at user-defined position 43 are A,F,I,L,M,V,W, and Y (hydrophobic). In the fragment of the file shown here, 39 and 45 are neighbors, while 34 is a nearest neighbor. Positions 8 and 12 (not shown here) were defined to be fixed, regardless of their actual neighbor status.

Finally, a comma-delimited list (line begins with COMMA_DELIMITED_LIST) is also listed; this line may be cut and pasted into Rasmol for defining the moving residues.

To do the actual run, replace the JOBTYPE with a rotamer-optimization method.

8.4 Description of variable position input structs and functions

8.4.1 VARIABLE_POSITION and CHOICE datastructures

The functions described here are all involved in setting up the main protein->var_pos array (see structure_types.h: VARIABLE_POSITION). Each non-pro/gly position has a protein->var_pos element, whether it is a FIXED_POSITION or not, or whether is it user-defined or not. Each VARIABLE_POSITION has a number_of_choices (number of residuetypes permitted at that position; non-explicitly defined positions have one choice corresponding to the residue for that position in the template structure). It also has pointers to or values for other information relevant for this position, such as the internal sequence position (seq_position), whether it is fixed or not (fixed_flag), whether it is considered to be a core or surface position (core_flag), the BACKBONE for this position (bkbn), and the lookup table branch for this position (lookup_energy_ptr).

VARIABLE_POSITION also has an array choice of CHOICE. Each choice element corresponds to a residuetype permitted at that position. A CHOICE struct has pointers to the RESPARAM element corresponding to its residuetype (resparam_ptr), the lookup table branch for this residuetype at this position (lookup_res_ptr), and to the BACKBONE corresponding to this position (bkbn). It also has in_use_flag; 1 indicates that this residuetype is indeed a permitted option at this position (see rotamer calc slave section).

The resparam_ptr element of CHOICE is a copy of the main protein->resparam entry for the residuetype of interest. The rotamerlib_ptr element of this RESPARAM is de-referenced from the relevant entry of the main protein->rotamerlib array, and then re-allocated and referenced to rotamers specific for this residuetype at this particular position. In this way, each position can have different subsets of possible rotamers for the same residuetype. This is required to filter rotamers that clash with one backbone position, but not another (lookup_table.cpp: generate_lookup_table), or to include the rotamer from the template pdb file (VARIABLE_POSITION.cpp: pdb_to_VARIABLE_POSITION).

8.4.2
input_stuff.cpp: input_VARIABLE_POSITION

input_stuff.cpp: input_VARIABLE_POSITION is sent a file pointer corresponding to the line following the VARIABLE_POSITIONS line in the inputfile. It assumes that the protein argument has been initialized by input_stuff.cpp: input_stuff, and has attributes such as OTHER_RESIDUES and JOBTYPE assigned (to a rotamer optimization job). It also assumes that the TEMPLATE_PDB has been loaded. This function may be called by functions other than input_stuff. For example, see rotamer_calc_foreman.cpp: rotamer_calc_foreman or input_stuff.cpp: sequence_to_var_pos_file .

Historically, the neighbor distance protein->parameters.neighbordist used to be a user-definable parameter. Now, it is set based on the value of the OTHER_POSITIONS modifier, and is used by the VARIABLE_POSITION.cpp: modify_VARIABLE_POSITIONS and find_neighbors functions to define additional moving positions.

Overall, user-defined residuetypes are assigned to each user-defined variable position protein->var_pos. This information, along with the OTHER_RESIDUES parameter, and user-defined FIXED_POSITIONS is used by VARIABLE_POSITION.cpp: modify_VARIABLE_POSITION to identify other positions in the structure that are allowed to change rotamer conformations.

If hp_pattern_design has been defined, protein and the file pointer are sent to input_stuff.cpp: hp_pattern_design, which determines the HP pattern of the residues, creates a new inputfile with the appropriate patterning, and launches a child process with the new file. If optimize_all_rotamers has been defined, the first alanine (failing that, valine, serine, or threonine) is defined as a user-defined variable position. If total_sequence_design has been defined, the protein->Template pdbATOM array is searched to identify non-pro-gly positions. At each of these positions, all 17 non-pro/gly/cys residuetypes are defined as choices. If variable positions are explicitly defined, they are read in from the inputfile. User-defined fixed positions are read into the fixed_positions int array.

If JOBTYPE list_variable_positions is defined, the data are written to the OUTPUT_PREFIX.floating_positions_list and OUTPUT_PREFIX.detailed_floating_positions_list files.

8.4.3 VARIABLE_POSITIONS.cpp: modify_VARIABLE_POSITION

This function creates a VARIABLE_POSITION array fixed_res_VARIABLE_POSITION that contains the residues and rotamers from the inputted template structure. This is done by VARIABLE_POSTIONS.cpp: pdbATOM_to_VARIABLE_POSITION, which extracts the native residuetypes and rotamers from the template structure for all non-pro/gly positions. The intrinsic energies for these rotamers are calculated and stored in the custom RESPARAM and ROTAMERLIB structs generated here.

Next, the non-user-defined residues neighboring the user-defined variable positions are identified by VARIABLE_POSTIONS.cpp: find_neighbors, based on protein->parameters.neighbordist (defined in input_stuff by the OTHER_RESIDUES parameter). The list of neighboring positions is converted to the VARIABLE_POSITIONS array neighbor_varpos with the listofvarpos_to_VARIABLE_POSITION function. These three VARIABLE_POSITION arrays, along with the list of fixed positions, are melded into the final protein->var_pos by the meld_VARIABLE_POSITION function. If CHARGES_PH_INDEPENDENT_FLAG 1 requires explicit consideration of the conjugate acid and base forms of ionizable residues, the rarer form is appended as a CHOICE to the relevant VARIABLE_POSITIONs.

The core_flag element for each protein->var_pos element is determined using the method similar to that described by Marshall & Mayo (Marshall and Mayo 2001). This information can be used for structure-dependent reference states or for hp patterning design. Some default values and pointers are set for elements within VARIABLE_POSITION and CHOICE.

Finally, the fixed atoms (backbone and pro/gly) are extracted, and their internal energy calculated.

8.4.4 input_stuff.cpp: sequence_to_var_pos_file

This function converts a SEQUENCE input into a file containing the variable-positions and fixed-positions parts of a standard inputfile. This file is opened, and the file pointer sent to input_VARIABLE_POSITION and processed as described above.

back to Table of Contents