back to Table of Contents
18. Analyses and optimizations
that require
consideration of multiple structures/states
A powerful feature of EGAD is its ability to consider multiple
structures or states simultaneously for a given sequence. With
multistate design, one can consider the bound and free states for a
complex simultaneously, multiple conformations, or both,
simultaneously. Multistate optimization, as described by Havranek and
Harbury, is a direct method to design specificity and affinity (Havranek and
Harbury 2003). The goal, as described in section
18.5, is to optimize energy of the desired state, while
simultaneously destabilizing undesired states.
Three examples are given and discussed below. One example tries to
redesign glutamine binding protein so it has higher affinity
(specificity) for glutamate. Another example tries to find mutations
that stabilize the closed (ligand-bound) form of maltose-binding
protein (MBP). The third example tries to switch the peptide binding
specificity of the SH3 domain of src.
While these have not been experimentally tested, and the details of the
designs (such as positions allowed to move) have not been optimized,
these examples demonstrate possible applications of multistate design.
The topics addressed by these examples are areas of active research in
the lab; you may wish to consider forming an active collaboration.
18.1 Target and competitor structures
There are three types of structures considered in design. One is the
template or target that we want the sequence to adopt. Another type of
structure are non-target structured competitors. Finally, there are
non-target, non-structured competitors, such as the unfolded,
specificity, and solubility reference states (discussed in ). Standard
design considers the target and the non-target, non-structured
competitors. The non-target, non-structured competitors are simple to
implement, since they can be represented as baseline corrections that
can be added or subtracted. Non-target structured competitors, on the
other hand, require explicit structures and optimization of candidate
sequences on those structures.
EGAD can consider an arbitrary number of structured competitors. These
are defined in the inputfile as follows:
START
....
TARGET target_structure.pdb
# the structure we want
COMPETITOR some_non_target_structure.pdb
# competitor 1
COMPETITOR another_non_target_structure.pdb # competitor 2
...
END
VARIABLE_POSITIONS
...
END
FIXED_POSITIONS
...
END
If the competitor is the free form of a complex, while the target is a
bound form, the competitor can be defined simply as:
FREE
structure_of_free_structure.pdb
# equivalent to COMPETITOR structure_of_free_structure.pdb solubility
(see below)
Multistate optimization and analysis jobs must be parallelized. The
number of available CPUs must equal or exceed the number of states
(target and competitors). The parallelization may be defined as
described in the parallelization section.
The final optimal sequence is threaded over all the structures, and
saved as OUTPUT_PREFIX.xxx.pdb, where xxx describes the structure (competitor n,
target, etc).
18.2 Defining positions to optimize and/or move
Positions to be optimized for multistate optimization are defined
in exactly the same manner as they are for conventional
single-state optimization. The SEQUENCE,
VARIABLE_POSTIONS, FIXED_POSITIONS, and OTHER_POSITIONS
modifiers described above may be used in exactly the same way.
In all cases, the competitors and target structures must have the same
numbering for residues. Residues that are permitted to move (either in
sequence or conformation) in any one of the structures must also be
able to move in the other structures.
18.3 MULTISTATE_JOBTYPE
There are three optimization types available for multistate
optimization. These are indicated by MULTISTATE_JOBTYPE
(to distinguish them from the standard single-state JOBTYPE entries described earlier). Premature
bail-out is signaled by creating the escape file listed in the logfile,
similar to that described above for single-state optimizations.
18.3.1 MULTISTATE_JOBTYPE
scan
This job is the multistate analog to singlestate
scanning mutagenesis. Positions to be scanned are defined in the
inputfile as described above for the single-state version with SCANNING_MUTAGENESIS.
This job will, in turn, place each of the defined amino acids at each
defined position. These are indicated in the logfile by scan entries. The optimal amino acid at each
position is listed with an optimal_res
entry.
Note that since the non-structured reference states for cysteine have
not been calculated, the energies for positions that contain cysteine
in the template sequence are unreliable.
18.3.2 MULTISTATE_JOBTYPE
ga
This job is the multistate analog to the singlestate
GA discussed above. While this GA, unlike the singlestate,
optimizes on sequence space, all the options described above for
singlestate GA, including sequence filters, are applicable.
The default parameters for multistate GA are different from
single-state GA. The default ga_population
for multistate GA is equal to twice the number of variable (more than
one amino acid choice) positions. The default number of isoenergetic
generations for convergence (ga_convergence)
is 20. Independent GAs are run (default number_ga_cycles 2). The first GA includes the
user-defined initial sequence in the starting population. The remainder
use completely random starting populations. The top solutions from the
independent GAs are used to seed a final GA.
These parameters have not been optimized. For some problems, it may be
necessary to increase the population size or convergence criterion. An
especially important variable worth examining is the number of
independent GAs.
18.3.3 MULTISTATE_JOBTYPE
quench
This job is the multistate analog to the heuristic
quench described above for single-state optimization. This
method finds a local minimum near the user-defined initial sequence
(discussed below). Briefly, a variable position (more than one amino
acid choice) is picked at random. All permitted residues are placed at
the position. The lowest energy choice is kept, and the optimal
sequence updated. This is repeated until all variable positions are
sampled without a drop in the energy.
18.3.4 Initialization of optimization
MULTISTATE_JOBTYPE ga
and quench require the user to define an
initial sequence that is used as a baseline for the scoring function
(described above) and/or as a starting point. These may be defined as
follows:
MULTISTATE_JOBTYPE
jobtype
option
There are four options:
template - use the sequence of the template
target pdb file
optimized - use the optimal sequence
resulting from single-state optimization on the target state
scmf - the target-state optimized sequence
is used as an initial sequence; use self-consistent mean-field
optimization on the target state to bias the sequence space used for GA
random - a random sequence is used
For example:
MULTISTATE_JOBTYPE
ga
template
does a GA, and includes the template sequence in the initial
population.
Or:
MULTISTATE_JOBTYPE
quench
optimized
finds the optimal sequence for the target structure, and then
uses the quench method to maximize the gap
between the target and competitor states.
The optimization of the target state is done with MC_GA
with RUNTIME 60. Following the
optimization on the target state (saved in OUTPUT_PREFIX.target_state_optimization.pdb),
the sequence is threaded over all the structures and saved as OUTPUT_PREFIX.target_state_optimized_sequence.xxx.pdb,
where xxx describes the structure
(competitor n, target, etc).
If the OUTPUT_PREFIX.target_state_optimization.pdb
file exists at the start of the run, the sequence will be
extracted from it, instead of re-doing the target-state optimization
run.
After the multistate optimization is complete, the final sequence is
threaded over all the structures, and saved as OUTPUT_PREFIX.xxx.pdb,
where xxx describes the structure
(competitor n, target, etc).
18.4 Slave parameters, MASTER_OPTIMIZATION_TIME,
and early termination
The user may define the SLAVE_JOBTYPE and SLAVE_RUNTIME. SLAVE_JOBTYPE
is the rotamer-optimization algorithm (default FASTER)
used by the slaves.
SLAVE_RUNTIME is the amount of time (in
decimal minutes) that each slave process may use for each sequence
(default 2.0 minutes). For non-ligand
problems, this default works well. However, if a ligand with thousands
of ligamers is included, an increased SLAVE_RUNTIME
may be required. For these problems, this parameter should be optimized
using single-sequence rotamer optimization on the structure of
interest; find a RUNTIME that usually
results in the optimal solution (based on unlimited time).
RUNTIME (without the SLAVE_
prefix) will give the runtime of the process that calculates the
structure of the initial sequence (default 60 minutes).
MASTER_OPTIMIZATION_TIME is analogous to RUNTIME for single-state optimization:
MASTER_OPTIMIZATION_TIME
n
(n = time in decimal minutes)
When this option is defined, the program periodically
checks the time it has spent. If MASTER_OPTIMIZATION_TIME
minutes have elapsed (including the time required for
target-state optimization), the program will tie up loose ends and
terminate.
Early termination may also be signaled by creating the escape file
defined in the logfile, in a manner similar to the escape file
described above for single-state optimizations.
18.5 Objective functions for multistate design
In multistate optimization, the energy being optimized is the
difference between the target state and the boltzmann-weighted average
of all the non-target states, structured and non-structured (Figure 18.5.1).
Depending on the problem, different structures may need to be
considered for stability, specificity, and solubility. By default, the
target structure is the one whose unfolded states, non-native states,
and aggregates are considered as non-structured competitors. However,
there are scenarios where this may not be the case. For example, if one
wishes to optimize a protein complex, the free states should have their
solubility optimized to prevent the monomers from aggregating, and
their folding stability optimized.
To define a particular structure for the stability, specificity, or
solubility reference state in the inputfile:
...
COMPETITOR
some_non_target_structure.pdb specificity
# specificity calc'd for this
COMPETITOR some_other_non_target_structure.pdb solubility # solubility
calc'd for this
COMPETITOR another_non_target_structure.pdb stability # fold stability
calc'd for this
...
If a FREE structure is defined, the
solubility and folding stability of this state is considered.
The energies in the logfile are given with respect to a sequence in
which all the user-defined or rotamer-moving positions are replaced
with alanine. This arbitrary reference subtracts away internal backbone
and fixed atom energies that differ between structures. This baseline
subtraction is also required to prevent overflow errors resulting from
exponentials of large numbers, but should play no difference otherwise
in the optimization. The baseline structures are saved as OUTPUT_PREFIX.baseline.xxx.pdb, where xxx is the relevant structure (target, competitor.n).
The energy that is optimized is the difference between the
boltzmann-weighted average energy of the non-target states and the
energy of the target state (ddG_gap in the
logfile; Figure 18.5.1). This
difference
is increased by destabilizing the competitor states. Unfortunately, in
many cases, the target state is severely destabilized, eventhough the
competitors are destabilized even more. To prevent these situations, if
the target state is predicted to be destabilized by more than 2*ENERGY_UNCERTAINTY kcal/mol (ENERGY_UNCERTAINTY
= RT (0.596 kcal/mol at 25
C) by default) with respect to the initialization
sequence, a penalty proportional to VDW_CUTOFF
(default 1000) is added. If a candidate
sequence destabilizes the folded state of the defined structure by more
than 2*ENERGY_UNCERTAINTY kcal/mol, a
similar penalty is applied. The energy including these penalties is the
actual energy that is optimized (score in
the logfile). This ensures that candidate solutions only modestly
destabilize the target state.
For many cases, this highly conservative default window may be too
stringent, so this is an additional variable that should be examined
and varied. There may be quite a bit of stability that can be
sacrificed in order to gain specificity or affinity. ENERGY_UNCERTAINTY can be defined in the
inputfile as follows:
ENERGY_UNCERTAINTY
0.596 # uncertainty in energies; default is RT
18.5.1 Custom multistate objective functions
Functions that deal with multistate optimization are primarily in multistate.cpp. The main
function calls multistate.cpp: multistate_design,
which in turn creates and launches foremen for generating lookup
tables, and for rotamer optimizations. The foremen jobs go from main to rotamer_calc_foreman.cpp:
rotamer_calc_foreman, as discussed in more detail in the parallelization
section.
As described above for sequence restraints, custom objective functions
for multistate optimization may be written by the user. These functions
are in multistate_objective_function.cpp.
The function called by multistate.cpp:
CHROMOSOME_array_to_fitness (the objective function wrapper) is multistate_objective_function.cpp:
multistate_objective_function. This function may be modified by
the user. The score, including penalties, is returned, the ENERGY *energy argument contains the energy
components of the target structure of the CHROMOSOME
chr to be scored. This energy has already had the baseline
energies from the all-ala-at-moving-positions structures subtracted.
In the current implementation, the first time this function is called,
the ENERGY *energy is saved; it is assumed
that the CHROMOSOME *chr argument for the
first time this function is called is that of the template or some
other baseline sequence that candidate sequences are compared to for
ensuring that the target state is not too heavily destabilized. In
user-modified variations, this may be read from a file, etc. target_non_target_ensemble_energy_gap actually
calculates the energy difference between the non-target and target
states (∆Ggap from above). Based on the energy components of
the target structure (ddG_pseudo, the
predicted folding stability and E_total,
the energy of the target state, including non-structured reference
states) and comparison to the energy of the baseline sequence, the
penalty discussed above is applied.
If a file needs to be read by a custom objective function, this file (MULTISTATE_OBJ_FUNCT_INPUTFILENAME in the code)
may be defined in the inputfile by including: MULTISTATE_OBJECTIVE_FUNCTION_FILE
custom_objective_function_file
As an example, multistate_objective_function.cpp:
binding_protein_objective_function is listed (commented out).
This example function is probably not useful, except as a template for
generating additional user-defined functions.
18.6 Examples of multstate design
18.6.1 Designing binding specificity: change
glutamine-binding
protein to bind glutamate
This example redesigns glutamine (LLN) binding protein so it has higher
affinity (specificity) for glutamate (LLU). It takes into account not
only
the structure of the protein bound to either ligand, but also the
structures of the unbound states (examples/multistate/glu_bp/design_glu_bp.input).
...
TARGET templates/gbp_bound_LLU.pdb
COMPETITOR templates/gbp_bound_LLN.pdb
FREE templates/gbp_unbound.pdb
...
As discussed above, defining a structure as FREE
ensures that the stability and solubility of this state are considered;
the TARGET structure is still the
optimization target. The FREE structure
also includes both ligands, displaced far from the protein. The TARGET and COMPETITOR
structures include not only the relevant ligand bound to the protein,
but also the other ligand displaced far from the protein,
For simplicity, both ligands are kept fixed to the configuration of
gln/LLN in the structure. In addition, only the residues that contact
the ligand atoms are allowed to change identity and/or rotamer (gbp_binding_residues.detailed_floating_positions_list).
More realistic designs would allow the ligands to move, and permit more
residues to change identity or rotamer.
The master first optimizes the sequence of the target structure by
launching a conventional single-state MC_GA optimization. This is
defined in the master inputfile (design_glu_bp.input):
MULTISTATE_JOBTYPE
GA
optimized
The target-optimized sequence is in design_glu_bp.target_state_optimization.pdb.
This sequence is used to seed a GA population. The GA is run to find
mutations that preferentially stabilize the target state with respect
to the other states. As described above, the final sequence is threaded
over all the structures. The final sequence threaded on the target
(glu/LLU bound) structure is design_glu_bp.target.pdb.
The results are shown and discussed in Figure 18.6.1.1.
18.6.2 Designing conformational specificity:
lock maltose-binding
protein in the closed state
This example tries to find mutations that stabilize the closed
(ligand-bound) form of maltose-binding protein. As discussed above, MBP
undergoes a hinge motion upon binding ligand (Figure
4.3.1). First, saturation mutagenesis is performed on all
positions that undergo a change in SASA, but are not neighbors of the
MAL ligand or neighbors of those neighbors (examples/multistate/mbp/define_variable_positions).
Variable residues are identified by using JOBTYPE
sasa_calc on the apo and bound structures. Neighbors of
MAL-binding residues are identified using JOBTYPE
list_variable_positions, with the MAL-binding residues defined
as described above (ligands/find_maltose_neighbors/find_maltose_neighbors.input);
these, as well as the ligand-binding residues and the ligand will be
kept fixed.
The saturation mutagenesis is done with examples/multistate/mbp/mbp_closed_scan.input.
The master places all residues at each position in turn, and calculates
the energy gap between the apo and bound (closed) states.
Next, a GA is done in which each variable position is restricted to
either its wildtype or optimal residue found in the scanning
mutagenesis; this GA is restrained to find the optimal combination of
six mutations via the MAX_MUTATIONS
modifier described above (mbp_closed_ga.input
and output files).
18.6.3 Design of a specific protein-protein
interface: src SH3
domain peptide-binding site
The wildtype src SH3 domain (referred to here as srcSH3) binds the
RALPPLPRY peptide. This design tries to change the specificity to bind
the EALPPEPRY peptide. This example is very similar to the glu/gln
specificity switch discussed above, except that, in this case, the
"ligand" is a peptide.
Moving positions are defined in examples/multistate/SH3_peptide/define_moving_residues/.
The interface residues are identified by JOBTYPE separate_complex
(SH3.RALPPLPRY.separate_complex.input).
These
residues, as well as the nearest-neighbors are allowed to change
identity (SH3.RALPPLPRY.find_interface_near_neighbors.input).
Since this is a simple demonstration, all other residues are kept fixed
to their pdb structure rotamer conformations.
The actual design is done with examples/multistate/SH3_peptide/SH3.EALPPEPRY.design.input.
As discussed above for the glu/gln example, the master first optimizes
the sequence for the bound target, and then runs a multistate GA,
simultaneously optimizing absolute binding affinity with respect to the
free unbound state, as well as the relative affinity (specificity) of
the bound target (EALPPEPRY) with respect to the original peptide
(RALPPLPRY).
As shown in Figure
18.6.3.1, the target-optimized sequence
does not contain any basic residues near the glutamate substitutions in
the target peptide. While this may seem surprising at first glance,
these results are consistent with prior work on coiled-coil
specificity, both experimental and theoretical, which suggest that
ion-pairs are in fact destabilizing, and serve primarily as specificity
determinants (O'Shea
et al 1992; Hendsch et
al 1994). In contrast, in the multistate design, several basic
residues are elicited near the peptide glutamates. This sequence is
predicted to have a higher absolute energy for the target peptide than
the single-state design. However, unlike the single-state design, this
sequence has a high energy for the wildtype competitor peptide and for
the unbound state (Figure 18.6.3.1c). The
predicted affinity (difference between bound and free) is larger, as is
the specificity (affinity difference between the target and competitor).
back to Table of Contents