back to Table of Contents
3. Inputfiles
3.1 General layout and rules
Input files for EGAD have the following general layout:
# Comments (optional)
START (required)
# comments (optional)
TEMPLATE_PDB
filename.pdb
(required)
! comments (optional)
template pdb
modifiers (optional)
JOBTYPE type_of_job
(optional if VARIABLE_POSITIONS are defined for single-structure job)
optimization method
modifiers (optional)
FORCEFIELD_FILE
forcefield_file (required)
energy function
modifiers (optional)
names of other files
or directories (as needed)
search space modifiers (optional)
OUTPUT_PREFIX output_prefix (optional but recommended)
END (optional here for rotamer optimization jobs; required for all
others)
VARIABLE_POSITIONS (required for rotamer optimization jobs)
seq_position
permitted residuetypes
(shortcuts also permitted)
...
END (required at end of VARIABLE_POSITIONS block)
FIXED
_POSITIONS (optional)
seq_position
...
END (required at end of FIXED_POSITIONS block and at end of file)
An inputfile has three sections. The first section gives information
about the template structure, the energy function, the desired job, and
how to run it (parameter section). For jobs that do not move
atoms, an END
line is required following the parameter section. For other jobs, it is
optional, if a variable positions section immediately follows. However,
for atom-moving jobs that do not specify variable-positions, END is required here; the default is to move all
positions.
The second and third section describe the search space of the problem
in greater detail. The second section (variable-positions
section) lists positions that are allowed to vary, and what they may
vary to; this section is required for jobs that move protein atoms. The
third section (fixed-positions section) lists positions that
must be held fixed; this section is optional.
Each of these blocks must end with END.
Every valid EGAD inputfile MUST begin with START,
contain entries for TEMPLATE_PDB, FORCEFIELD_FILE, and JOBTYPE,
and end with END.
Modifiers are described in the relevant sections; a complete list with
default values are listed in the appendix: inputfile options.
Multi-word character string labels or modifiers must have words
separated by underscores "_" not spaces.
Flag modifiers are 1 or 0 (or "true" or "false").
Files may be defined by relative paths.
Do not use "=" or ":"
for defining modifiers.
A sample inputfile:
START
TEMPLATE_PDB
templates/gb1.pdb
IGNORE_DISULFIDE_FLAG
1
JOBTYPE
monte_carlo
RUNTIME 10
LOGFILE_FLAG false
FORCEFIELD_FILE
energy_function/forcefield
SASA_FLAG 0
WEIGHT_TORSION 0.556
pH 4.5
LIGAND malate.param
LOOKUP_TABLE_DIRECTORY lookup_tables/gb1
OTHER_RESIDUES neighbors
OUTPUT_PREFIX designed_malatase
END
VARIABLE_POSITIONS
3 all
5 hydrophobic
7 Q.
20 A,L,V,W.
26 all
30 E,F,K,R.
34 all
39 basic
43 polar
52
54 acidic
END
FIXED_POSITIONS
22
16
48
END
This file uses the file templates/gb1.pdb
as a template. Disulfide bonds in the structure will be ignored, and
their cysteines treated as un-crosslinked (IGNORE_DISULFIDE_FLAG
1). The energy function and forcefield are described in the file
energy_function/forcefield. The
energy function modifiers listed in this file supercede those listed in
the forcefield files. In this example, the solvent-accessible
surface-area dependent energies are not to be calculated (SASA_FLAG 0), and the weight on torsional
energies is altered (WEIGHT_TORSION 0.556).
The electrostatics should behave as if the pH is 4.5 (pH 4.5). There is also a free-moving ligand
described in a ligand parameter file (LIGAND
malate.param). The rotamer optimization is to use monte carlo
simulated annealing (JOBTYPE monte_carlo),
and must be completed within 10 minutes of starting the optimization
method (RUNTIME 10). A logfile should not
be written (LOGFILE_FLAG false). The pair
energy lookup table for these calculations will be stored in the
directory lookup_tables/gb1.
In addition to the positions explicitly listed, non-listed neighboring
positions will be allowed to undergo changes in rotamer conformations (OTHER_RESIDUES neighbors). The variable
positions are permitted different compositions. By default, position 52
is set to the residue from the template structure. There are three
positions that are defined as fixed. Defining these as fixed supercedes
any earlier definitions for these positions.
The keywords (in caps here), with the exception of START
and END,
do not have to be in all-capital letters. However, using all-caps for
the keywords does distinguish these from the user-defined parameters.
Similarly, indentation for modifiers is not necessary, but does make
things clearer.
Each of the sections and their options will be discussed in greater
detail in the relevant sections of the manual.
3.2 Sequence position strings for input and output
The sequence position formats described here are valid for the VARIABLE_POSITIONS and FIXED_POSITIONS
sections for rotamer optimization jobs. They are also valid for the
related FIX and FLOAT
options for JOBTYPE minimize, as well as
for the INCLUDE and EXCLUDE
options for JOBTYPE ideal_backbone_geometry.
This same format is used in output files when a sequence position is
required (Of course, output pdb files are in the standard PDB format).
For a single-chain template structure, the sequence position is taken
directly from the pdb file.
For example
52
# position 52
Some structures have non-standard "wacky" numbering, in which a letter
is appended to a sequence position number; this is especially true for
serine protease structures. Often, multiple contiguous residues will
have the same numerical sequence position, but different letter
suffixes. This is used to force analogous residues in related proteins
to have the same sequence position. These are treated just like
conventional numbering.
For example:
51
# position 51
52A
# position 52A
52B
# position 52B
53
# position 53
For multichain structures, the seq_position field should be a string:
numerical_seq_position_chainID_letter.
For example:
23H
# position 23 on chain H
11B
# position 11 on chain B
If "wacky" numbering is present in a multichain structure, a similar
format is used.
For example:
51A #
position 51 on chain A
52AA # position 52A on chain A
52BA # position 52B on chain A
53A # position 53 on chain A
100H # position 100 on chain H
101AH # position 101A on chain H
101BH # position 101B on chain H
101CH # position 101C on chain H
102H # position 102 on chain H
3.3 Comments
Comments may be included in EGAD inputfiles. All text above the START
line is ignored. Within the parameter section, comment lines must begin
with a '# ' or '! '
(akin to C-shell and X-PLOR scripts respectively). Comments are also
permitted on parse-able lines following the data. Comment lines are
permitted after an END and before the
start of the next section. Comment lines are NOT permitted in the
variable-positions or fixed-positions section; however, comments are
permitted on parse-able lines following the data.
For example:
# This is the same inputfile, with
comments
by Navin Pokala
START
TEMPLATE_PDB
templates/gb1.pdb
IGNORE_DISULFIDE_FLAG
1 #
I don't think this
protein has disulfides
JOBTYPE monte_carlo
RUNTIME
10
! I'm in a hurry
LOGFILE_FLAG false
FORCEFIELD_FILE
energy_function/forcefield
SASA_FLAG 0
! didn't include SASA
because it's slow
WEIGHT_TORSION 0.556
pH 4.5
LIGAND malate.param
# from Mark Voorhies
LOOKUP_TABLE_DIRECTORY lookup_tables/gb1
OTHER_RESIDUES neighbors
OUTPUT_PREFIX designed_malatase
END
# end parameter section
! start var-pos section
VARIABLE_POSITIONS
3 all
5 hydrophobic
7 Q.
20 A,L,V,W.
26 all
30 E,F,K,R.
# based on phage display data
34 all
39 basic
! near malate carboxylate
43 polar
52
54 acidic
END
FIXED_POSITIONS
# keep these fixed; catalytic residues
22
16
# catalytic thr
48
END
# the end of this inputfile
3.4 OUTPUT_PREFIX
Any output files produced, including logfiles, are saved to the name
defined by the OUTPUT_PREFIX line.
OUTPUT_PREFIX
path/output_prefix
# path is optional
From the example above, the line:
OUTPUT_PREFIX designed_malatase
means that the log files will be saved to designed_malatase.log, the list of solutions
saved to designed_malatase.out, and the
final structure and energies saved to designed_malatase.pdb.
The prefix may also be a path (either full or relative). For example,
the following are also valid:
OUTPUT_PREFIX
my_data/designed_malatase # final structure in
my_data/designed_malatase.pdb
OUTPUT_PREFIX /usr/data/test123/designed_malatase # final structure in
/usr/data/test123/designed_malatase.pdb
If path/output_prefix.pdb already exists,
the OUTPUT_PREFIX is changed to path/output_prefix_1. If path/output_prefix_1.pdb
exists, the OUTPUT_PREFIX is changed to path/output_prefix_2, and so on. A WARNING message is printed to stdout, along with the new OUTPUT_PREFIX.
If an OUTPUT_PREFIX is not defined, then a
default filename is derived from the name of the inputted template pdb
file. For example,
TEMPLATE_PDB
templates/gb1.pdb
results in the default
OUTPUT_PREFIX gb1_egad
These files are saved to the directory the job was launched
from.
3.5 Description of the input_stuff.cpp:
input_stuff function
Be warned: this is an incredibly bloated and ugly function!! But....it
works. As described in the main function
section, the input_stuff.cpp: input_stuff
function initializes the PROTEIN object
sent to it. It sets defaults for several extern variables (energy
function and job parameters, array sizes). input_stuff
parses each line of the inputfile parameter section, and assigns the
relevant variables (discussed in later sections). It calls readpdbfile.cpp: readpdbfile to read the pdb
file. The first call to readpdbfile is
used to set the MAX_RESIDUES and MAX_ATOMS array sizes; a second call actually
reads in the coordinates (see pdb file section). input_stuff
calls read_forcefield.cpp: read_forcefield
to read in the forcefield, residue parameters, ligand parameters, and
rotamer library (see energy function files section). Finally, input_stuff calls input_stuff.cpp:
input_VARIABLE_POSITION to parse the variable-position and
fixed-position data. These functions set up VARIABLE_POSITION
and CHOICE arrays, which may then be used
as templates for setting up the pair energy lookup table and CHROMOSOME's
for representing candidate structures in sequence or dihedral space
(see variable positions section). For
minimization-related jobtypes, input_stuff
parses related data (see
minimization and loop-grafting sections).
Two useful functions for parsing inputfiles are in io.cpp:
word_count and extract_words.
Both functions parse a line until a "#" or "!" character is reached; anything beyond this
comment marker is ignored. word_count
returns the number of white-space and/or tab-delimited entries. extract_words
places each white-space and/or tab-delimited entry into its own
character array, permitting easier parsing by the calling function.
back to Table of Contents