back to Table of Contents

3. Inputfiles

3.1 General layout and rules

Input files for EGAD have the following general layout:

# Comments (optional)

START    (required)               # comments (optional)
TEMPLATE_PDB     filename.pdb     (required)       ! comments (optional)
    template pdb modifiers (optional)
JOBTYPE type_of_job (optional if VARIABLE_POSITIONS are defined for single-structure job)
    optimization method modifiers (optional)
FORCEFIELD_FILE  forcefield_file (required)
    energy function modifiers (optional)
names of other files or directories (as needed)
search space modifiers (optional)
OUTPUT_PREFIX output_prefix (optional but recommended)
END (optional here for rotamer optimization jobs; required for all others)
VARIABLE_POSITIONS (required for rotamer optimization jobs)
         seq_position     permitted residuetypes   (shortcuts also permitted)
                  ...
END (required at end of VARIABLE_POSITIONS block)
FIXED    _POSITIONS (optional)
         seq_position
                  ...
END (required at end of FIXED_POSITIONS block and at end of file)


An inputfile has three sections. The first section gives information about the template structure, the energy function, the desired job, and how to run it (parameter section). For jobs that do not move atoms, an END line is required following the parameter section. For other jobs, it is optional, if a variable positions section immediately follows. However, for atom-moving jobs that do not specify variable-positions, END is required here; the default is to move all positions.

The second and third section describe the search space of the problem in greater detail. The second section (variable-positions section) lists positions that are allowed to vary, and what they may vary to; this section is required for jobs that move protein atoms. The third section (fixed-positions section) lists positions that must be held fixed; this section is optional.

Each of these blocks must end with END. Every valid EGAD inputfile MUST begin with START, contain entries for TEMPLATE_PDB, FORCEFIELD_FILE, and JOBTYPE, and end with END.

Modifiers are described in the relevant sections; a complete list with default values are listed in the appendix: inputfile options.

Multi-word character string labels or modifiers must have words separated by underscores "_" not spaces.

Flag modifiers are 1 or 0 (or "true" or "false").

Files may be defined by relative paths.

Do not use "=" or ":" for defining modifiers.

A sample inputfile:

START
TEMPLATE_PDB              templates/gb1.pdb
    IGNORE_DISULFIDE_FLAG 1
JOBTYPE monte_carlo     
    RUNTIME 10
    LOGFILE_FLAG false
FORCEFIELD_FILE           energy_function/forcefield
    SASA_FLAG 0
    WEIGHT_TORSION 0.556
pH 4.5
LIGAND malate.param
LOOKUP_TABLE_DIRECTORY lookup_tables/gb1
OTHER_RESIDUES neighbors
OUTPUT_PREFIX designed_malatase
END
VARIABLE_POSITIONS
    3 all
    5 hydrophobic
    7 Q.
    20 A,L,V,W.
    26 all
    30 E,F,K,R.
    34 all
    39 basic
    43 polar
    52
    54 acidic
END
FIXED_POSITIONS
    22
    16
    48
END


This file uses the file templates/gb1.pdb as a template. Disulfide bonds in the structure will be ignored, and their cysteines treated as un-crosslinked (IGNORE_DISULFIDE_FLAG 1). The energy function and forcefield are described in the file energy_function/forcefield. The energy function modifiers listed in this file supercede those listed in the forcefield files. In this example, the solvent-accessible surface-area dependent energies are not to be calculated (SASA_FLAG 0), and the weight on torsional energies is altered (WEIGHT_TORSION 0.556). The electrostatics should behave as if the pH is 4.5 (pH 4.5). There is also a free-moving ligand described in a ligand parameter file (LIGAND malate.param). The rotamer optimization is to use monte carlo simulated annealing (JOBTYPE monte_carlo), and must be completed within 10 minutes of starting the optimization method (RUNTIME 10). A logfile should not be written (LOGFILE_FLAG false). The pair energy lookup table for these calculations will be stored in the directory lookup_tables/gb1. In addition to the positions explicitly listed, non-listed neighboring positions will be allowed to undergo changes in rotamer conformations (OTHER_RESIDUES neighbors). The variable positions are permitted different compositions. By default, position 52 is set to the residue from the template structure. There are three positions that are defined as fixed. Defining these as fixed supercedes any earlier definitions for these positions.

The keywords (in caps here), with the exception of START and END, do not have to be in all-capital letters. However, using all-caps for the keywords does distinguish these from the user-defined parameters. Similarly, indentation for modifiers is not necessary, but does make things clearer.

Each of the sections and their options will be discussed in greater detail in the relevant sections of the manual.

3.2 Sequence position strings for input and output

The sequence position formats described here are valid for the VARIABLE_POSITIONS and FIXED_POSITIONS sections for rotamer optimization jobs. They are also valid for the related FIX and FLOAT options for JOBTYPE minimize, as well as for the INCLUDE and EXCLUDE options for JOBTYPE ideal_backbone_geometry. This same format is used in output files when a sequence position is required (Of course, output pdb files are in the standard PDB format).

For a single-chain template structure, the sequence position is taken directly from the pdb file.
For example
52       # position 52

Some structures have non-standard "wacky" numbering, in which a letter is appended to a sequence position number; this is especially true for serine protease structures. Often, multiple contiguous residues will have the same numerical sequence position, but different letter suffixes. This is used to force analogous residues in related proteins to have the same sequence position. These are treated just like conventional numbering.
For example:
51       # position 51
52A      # position 52A
52B      # position 52B
53       # position 53

For multichain structures, the seq_position field should be a string: numerical_seq_position_chainID_letter.
For example:
23H      # position 23 on chain H
11B      # position 11 on chain B

If "wacky" numbering is present in a multichain structure, a similar format is used.
For example:
51A      # position 51 on chain A
52AA     # position 52A on chain A
52BA     # position 52B on chain A
53A      # position 53 on chain A
100H     # position 100 on chain H
101AH # position 101A on chain H
101BH # position 101B on chain H
101CH # position 101C on chain H
102H     # position 102 on chain H

3.3 Comments

Comments may be included in EGAD inputfiles. All text above the START line is ignored. Within the parameter section, comment lines must begin with a '# ' or '! ' (akin to C-shell and X-PLOR scripts respectively). Comments are also permitted on parse-able lines following the data. Comment lines are permitted after an END and before the start of the next section. Comment lines are NOT permitted in the variable-positions or fixed-positions section; however, comments are permitted on parse-able lines following the data.

For example:

# This is the same inputfile, with comments
by Navin Pokala

START
TEMPLATE_PDB              templates/gb1.pdb
    IGNORE_DISULFIDE_FLAG 1    # I don't think this protein has disulfides
JOBTYPE monte_carlo     
    RUNTIME 10                ! I'm in a hurry
    LOGFILE_FLAG false
FORCEFIELD_FILE           energy_function/forcefield
    SASA_FLAG 0    ! didn't include SASA because it's slow
    WEIGHT_TORSION 0.556
pH 4.5  
LIGAND malate.param      # from Mark Voorhies
LOOKUP_TABLE_DIRECTORY lookup_tables/gb1
OTHER_RESIDUES neighbors
OUTPUT_PREFIX designed_malatase
END
# end parameter section
! start var-pos section
VARIABLE_POSITIONS
3 all
5 hydrophobic
7 Q.
20 A,L,V,W.
26 all
30 E,F,K,R.     # based on phage display data
34 all
39 basic        ! near malate carboxylate
43 polar
52
54 acidic
END
FIXED_POSITIONS  # keep these fixed; catalytic residues
22
16               # catalytic thr
48
END
# the end of this inputfile

3.4 OUTPUT_PREFIX

Any output files produced, including logfiles, are saved to the name defined by the OUTPUT_PREFIX line.
OUTPUT_PREFIX path/output_prefix         # path is optional

From the example above, the line:
         OUTPUT_PREFIX designed_malatase
means that the log files will be saved to designed_malatase.log, the list of solutions saved to designed_malatase.out, and the final structure and energies saved to designed_malatase.pdb.

The prefix may also be a path (either full or relative). For example, the following are also valid:
OUTPUT_PREFIX my_data/designed_malatase # final structure in my_data/designed_malatase.pdb
OUTPUT_PREFIX /usr/data/test123/designed_malatase # final structure in /usr/data/test123/designed_malatase.pdb

        
If path/output_prefix.pdb already exists, the OUTPUT_PREFIX is changed to path/output_prefix_1. If path/output_prefix_1.pdb exists, the OUTPUT_PREFIX is changed to path/output_prefix_2, and so on. A WARNING message is printed to stdout, along with the new OUTPUT_PREFIX.

If an OUTPUT_PREFIX is not defined, then a default filename is derived from the name of the inputted template pdb file. For example,
TEMPLATE_PDB     templates/gb1.pdb
results in the default
OUTPUT_PREFIX gb1_egad
These files are saved to the directory the job was launched from.

3.5 Description of the input_stuff.cpp: input_stuff function

Be warned: this is an incredibly bloated and ugly function!! But....it works. As described in the main function section, the input_stuff.cpp: input_stuff function initializes the PROTEIN object sent to it. It sets defaults for several extern variables (energy function and job parameters, array sizes). input_stuff parses each line of the inputfile parameter section, and assigns the relevant variables (discussed in later sections). It calls readpdbfile.cpp: readpdbfile to read the pdb file. The first call to readpdbfile is used to set the MAX_RESIDUES and MAX_ATOMS array sizes; a second call actually reads in the coordinates (see pdb file section). input_stuff calls read_forcefield.cpp: read_forcefield to read in the forcefield, residue parameters, ligand parameters, and rotamer library (see energy function files section). Finally, input_stuff calls input_stuff.cpp: input_VARIABLE_POSITION to parse the variable-position and fixed-position data. These functions set up VARIABLE_POSITION and CHOICE arrays, which may then be used as templates for setting up the pair energy lookup table and CHROMOSOME's for representing candidate structures in sequence or dihedral space (see variable positions section). For minimization-related jobtypes, input_stuff parses related data (see minimization and loop-grafting sections).

Two useful functions for parsing inputfiles are in io.cpp: word_count and extract_words. Both functions parse a line until a "#" or "!" character is reached; anything beyond this comment marker is ignored. word_count returns the number of white-space and/or tab-delimited entries. extract_words places each white-space and/or tab-delimited entry into its own character array, permitting easier parsing by the calling function.

back to Table of Contents