back to Table of Contents

18. Analyses and optimizations that require consideration of multiple structures/states

A powerful feature of EGAD is its ability to consider multiple structures or states simultaneously for a given sequence. With multistate design, one can consider the bound and free states for a complex simultaneously, multiple conformations, or both, simultaneously. Multistate optimization, as described by Havranek and Harbury, is a direct method to design specificity and affinity (Havranek and Harbury 2003). The goal, as described in section 18.5, is to optimize energy of the desired state, while simultaneously destabilizing undesired states.

Three examples are given and discussed below. One example tries to redesign glutamine binding protein so it has higher affinity (specificity) for glutamate. Another example tries to find mutations that stabilize the closed (ligand-bound) form of maltose-binding protein (MBP). The third example tries to switch the peptide binding specificity of the SH3 domain of src.

While these have not been experimentally tested, and the details of the designs (such as positions allowed to move) have not been optimized, these examples demonstrate possible applications of multistate design. The topics addressed by these examples are areas of active research in the lab; you may wish to consider forming an active collaboration.

18.1 Target and competitor structures

There are three types of structures considered in design. One is the template or target that we want the sequence to adopt. Another type of structure are non-target structured competitors. Finally, there are non-target, non-structured competitors, such as the unfolded, specificity, and solubility reference states (discussed in ). Standard design considers the target and the non-target, non-structured competitors. The non-target, non-structured competitors are simple to implement, since they can be represented as baseline corrections that can be added or subtracted. Non-target structured competitors, on the other hand, require explicit structures and optimization of candidate sequences on those structures.

EGAD can consider an arbitrary number of structured competitors. These are defined in the inputfile as follows:
START
....
TARGET   target_structure.pdb                      # the structure we want
COMPETITOR some_non_target_structure.pdb         # competitor 1
COMPETITOR another_non_target_structure.pdb # competitor 2
...
END
VARIABLE_POSITIONS
...
END
FIXED_POSITIONS
...
END

If the competitor is the free form of a complex, while the target is a bound form, the competitor can be defined simply as:
FREE structure_of_free_structure.pdb
# equivalent to COMPETITOR structure_of_free_structure.pdb solubility (see below)

Multistate optimization and analysis jobs must be parallelized. The number of available CPUs must equal or exceed the number of states (target and competitors). The parallelization may be defined as described in the parallelization section.

The final optimal sequence is threaded over all the structures, and saved as OUTPUT_PREFIX.xxx.pdb, where xxx describes the structure (competitor n, target, etc).

18.2 Defining positions to optimize and/or move

Positions to be optimized for multistate optimization are defined in exactly the same manner as they are for conventional single-state optimization. The SEQUENCE, VARIABLE_POSTIONS, FIXED_POSITIONS, and OTHER_POSITIONS modifiers described above may be used in exactly the same way.

In all cases, the competitors and target structures must have the same numbering for residues. Residues that are permitted to move (either in sequence or conformation) in any one of the structures must also be able to move in the other structures.

18.3 MULTISTATE_JOBTYPE

There are three optimization types available for multistate optimization. These are indicated by MULTISTATE_JOBTYPE (to distinguish them from the standard single-state JOBTYPE entries described earlier). Premature bail-out is signaled by creating the escape file listed in the logfile, similar to that described above for single-state optimizations.

18.3.1 MULTISTATE_JOBTYPE scan

This job is the multistate analog to singlestate scanning mutagenesis. Positions to be scanned are defined in the inputfile as described above for the single-state version with SCANNING_MUTAGENESIS.

This job will, in turn, place each of the defined amino acids at each defined position. These are indicated in the logfile by scan entries. The optimal amino acid at each position is listed with an optimal_res entry.

Note that since the non-structured reference states for cysteine have not been calculated, the energies for positions that contain cysteine in the template sequence are unreliable.

18.3.2 MULTISTATE_JOBTYPE ga

This job is the multistate analog to the singlestate GA discussed above. While this GA, unlike the singlestate, optimizes on sequence space, all the options described above for singlestate GA, including sequence filters, are applicable.

The default parameters for multistate GA are different from single-state GA. The default ga_population for multistate GA is equal to twice the number of variable (more than one amino acid choice) positions. The default number of isoenergetic generations for convergence (ga_convergence) is 20. Independent GAs are run (default number_ga_cycles 2). The first GA includes the user-defined initial sequence in the starting population. The remainder use completely random starting populations. The top solutions from the independent GAs are used to seed a final GA.

These parameters have not been optimized. For some problems, it may be necessary to increase the population size or convergence criterion. An especially important variable worth examining is the number of independent GAs.

18.3.3 MULTISTATE_JOBTYPE quench

This job is the multistate analog to the heuristic quench described above for single-state optimization. This method finds a local minimum near the user-defined initial sequence (discussed below). Briefly, a variable position (more than one amino acid choice) is picked at random. All permitted residues are placed at the position. The lowest energy choice is kept, and the optimal sequence updated. This is repeated until all variable positions are sampled without a drop in the energy.

18.3.4 Initialization of optimization

MULTISTATE_JOBTYPE ga and quench require the user to define an initial sequence that is used as a baseline for the scoring function (described above) and/or as a starting point. These may be defined as follows:
MULTISTATE_JOBTYPE       jobtype           option

There are four options:
         template - use the sequence of the template target pdb file
         optimized - use the optimal sequence resulting from single-state optimization on the target state
         scmf - the target-state optimized sequence is used as an initial sequence; use self-consistent mean-field optimization on the target state to bias the sequence space used for GA
         random - a random sequence is used

For example:
MULTISTATE_JOBTYPE       ga       template
does a GA, and includes the template sequence in the initial population.

Or:
MULTISTATE_JOBTYPE       quench   optimized
finds the optimal sequence for the target structure, and then uses the quench method to maximize the gap between the target and competitor states.

The optimization of the target state is done with MC_GA with RUNTIME 60. Following the optimization on the target state (saved in OUTPUT_PREFIX.target_state_optimization.pdb), the sequence is threaded over all the structures and saved as OUTPUT_PREFIX.target_state_optimized_sequence.xxx.pdb, where xxx describes the structure (competitor n, target, etc).

If the OUTPUT_PREFIX.target_state_optimization.pdb file exists at the start of the run, the sequence will be extracted from it, instead of re-doing the target-state optimization run.

After the multistate optimization is complete, the final sequence is threaded over all the structures, and saved as OUTPUT_PREFIX.xxx.pdb, where xxx describes the structure (competitor n, target, etc).

18.4 Slave parameters, MASTER_OPTIMIZATION_TIME, and early termination

The user may define the SLAVE_JOBTYPE and SLAVE_RUNTIME. SLAVE_JOBTYPE is the rotamer-optimization algorithm (default FASTER) used by the slaves.

SLAVE_RUNTIME is the amount of time (in decimal minutes) that each slave process may use for each sequence (default 2.0 minutes). For non-ligand problems, this default works well. However, if a ligand with thousands of ligamers is included, an increased SLAVE_RUNTIME may be required. For these problems, this parameter should be optimized using single-sequence rotamer optimization on the structure of interest; find a RUNTIME that usually results in the optimal solution (based on unlimited time).

RUNTIME (without the SLAVE_ prefix) will give the runtime of the process that calculates the structure of the initial sequence (default 60 minutes).

MASTER_OPTIMIZATION_TIME is analogous to RUNTIME for single-state optimization:
MASTER_OPTIMIZATION_TIME n       (n = time in decimal minutes)
When this option is defined, the program periodically checks the time it has spent. If MASTER_OPTIMIZATION_TIME minutes have elapsed (including the time required for target-state optimization), the program will tie up loose ends and terminate.

Early termination may also be signaled by creating the escape file defined in the logfile, in a manner similar to the escape file described above for single-state optimizations.

18.5 Objective functions for multistate design

In multistate optimization, the energy being optimized is the difference between the target state and the boltzmann-weighted average of all the non-target states, structured and non-structured (Figure 18.5.1).

Depending on the problem, different structures may need to be considered for stability, specificity, and solubility. By default, the target structure is the one whose unfolded states, non-native states, and aggregates are considered as non-structured competitors. However, there are scenarios where this may not be the case. For example, if one wishes to optimize a protein complex, the free states should have their solubility optimized to prevent the monomers from aggregating, and their folding stability optimized.

To define a particular structure for the stability, specificity, or solubility reference state in the inputfile:
...
COMPETITOR some_non_target_structure.pdb specificity     # specificity calc'd for this
COMPETITOR some_other_non_target_structure.pdb solubility # solubility calc'd for this
COMPETITOR another_non_target_structure.pdb stability # fold stability calc'd for this
...


If a FREE structure is defined, the solubility and folding stability of this state is considered.

The energies in the logfile are given with respect to a sequence in which all the user-defined or rotamer-moving positions are replaced with alanine. This arbitrary reference subtracts away internal backbone and fixed atom energies that differ between structures. This baseline subtraction is also required to prevent overflow errors resulting from exponentials of large numbers, but should play no difference otherwise in the optimization. The baseline structures are saved as OUTPUT_PREFIX.baseline.xxx.pdb, where xxx is the relevant structure (target, competitor.n).

The energy that is optimized is the difference between the boltzmann-weighted average energy of the non-target states and the energy of the target state (ddG_gap in the logfile; Figure 18.5.1). This difference is increased by destabilizing the competitor states. Unfortunately, in many cases, the target state is severely destabilized, eventhough the competitors are destabilized even more. To prevent these situations, if the target state is predicted to be destabilized by more than 2*ENERGY_UNCERTAINTY kcal/mol (ENERGY_UNCERTAINTY = RT (0.596 kcal/mol at 25 C) by default) with respect to the initialization sequence, a penalty proportional to VDW_CUTOFF (default 1000) is added. If a candidate sequence destabilizes the folded state of the defined structure by more than 2*ENERGY_UNCERTAINTY kcal/mol, a similar penalty is applied. The energy including these penalties is the actual energy that is optimized (score in the logfile). This ensures that candidate solutions only modestly destabilize the target state.

For many cases, this highly conservative default window may be too stringent, so this is an additional variable that should be examined and varied. There may be quite a bit of stability that can be sacrificed in order to gain specificity or affinity. ENERGY_UNCERTAINTY can be defined in the inputfile as follows:
ENERGY_UNCERTAINTY                0.596 # uncertainty in energies; default is RT

18.5.1 Custom multistate objective functions

Functions that deal with multistate optimization are primarily in multistate.cpp. The main function calls multistate.cpp: multistate_design, which in turn creates and launches foremen for generating lookup tables, and for rotamer optimizations. The foremen jobs go from main to rotamer_calc_foreman.cpp: rotamer_calc_foreman, as discussed in more detail in the parallelization section.

As described above for sequence restraints, custom objective functions for multistate optimization may be written by the user. These functions are in multistate_objective_function.cpp. The function called by multistate.cpp: CHROMOSOME_array_to_fitness (the objective function wrapper) is multistate_objective_function.cpp: multistate_objective_function. This function may be modified by the user. The score, including penalties, is returned, the ENERGY *energy argument contains the energy components of the target structure of the CHROMOSOME chr to be scored. This energy has already had the baseline energies from the all-ala-at-moving-positions structures subtracted.

In the current implementation, the first time this function is called, the ENERGY *energy is saved; it is assumed that the CHROMOSOME *chr argument for the first time this function is called is that of the template or some other baseline sequence that candidate sequences are compared to for ensuring that the target state is not too heavily destabilized. In user-modified variations, this may be read from a file, etc. target_non_target_ensemble_energy_gap actually calculates the energy difference between the non-target and target states (∆Ggap from above). Based on the energy components of the target structure (ddG_pseudo, the predicted folding stability and E_total, the energy of the target state, including non-structured reference states) and comparison to the energy of the baseline sequence, the penalty discussed above is applied.

If a file needs to be read by a custom objective function, this file (MULTISTATE_OBJ_FUNCT_INPUTFILENAME in the code) may be defined in the inputfile by including: MULTISTATE_OBJECTIVE_FUNCTION_FILE   custom_objective_function_file

As an example, multistate_objective_function.cpp: binding_protein_objective_function is listed (commented out). This example function is probably not useful, except as a template for generating additional user-defined functions.

18.6 Examples of multstate design

18.6.1 Designing binding specificity: change glutamine-binding protein to bind glutamate

This example redesigns glutamine (LLN) binding protein so it has higher affinity (specificity) for glutamate (LLU). It takes into account not only the structure of the protein bound to either ligand, but also the structures of the unbound states (examples/multistate/glu_bp/design_glu_bp.input).
...
TARGET templates/gbp_bound_LLU.pdb
COMPETITOR templates/gbp_bound_LLN.pdb
FREE templates/gbp_unbound.pdb
...

As discussed above, defining a structure as FREE ensures that the stability and solubility of this state are considered; the TARGET structure is still the optimization target. The FREE structure also includes both ligands, displaced far from the protein. The TARGET and COMPETITOR structures include not only the relevant ligand bound to the protein, but also the other ligand displaced far from the protein,

For simplicity, both ligands are kept fixed to the configuration of gln/LLN in the structure. In addition, only the residues that contact the ligand atoms are allowed to change identity and/or rotamer (gbp_binding_residues.detailed_floating_positions_list). More realistic designs would allow the ligands to move, and permit more residues to change identity or rotamer.

The master first optimizes the sequence of the target structure by launching a conventional single-state MC_GA optimization. This is defined in the master inputfile (design_glu_bp.input):
MULTISTATE_JOBTYPE       GA       optimized

The target-optimized sequence is in design_glu_bp.target_state_optimization.pdb.
This sequence is used to seed a GA population. The GA is run to find mutations that preferentially stabilize the target state with respect to the other states. As described above, the final sequence is threaded over all the structures. The final sequence threaded on the target (glu/LLU bound) structure is design_glu_bp.target.pdb.

The results are shown and discussed in Figure 18.6.1.1.

18.6.2 Designing conformational specificity: lock maltose-binding protein in the closed state

This example tries to find mutations that stabilize the closed (ligand-bound) form of maltose-binding protein. As discussed above, MBP undergoes a hinge motion upon binding ligand (Figure 4.3.1). First, saturation mutagenesis is performed on all positions that undergo a change in SASA, but are not neighbors of the MAL ligand or neighbors of those neighbors (examples/multistate/mbp/define_variable_positions). Variable residues are identified by using JOBTYPE sasa_calc on the apo and bound structures. Neighbors of MAL-binding residues are identified using JOBTYPE list_variable_positions, with the MAL-binding residues defined as described above (ligands/find_maltose_neighbors/find_maltose_neighbors.input); these, as well as the ligand-binding residues and the ligand will be kept fixed.

The saturation mutagenesis is done with examples/multistate/mbp/mbp_closed_scan.input. The master places all residues at each position in turn, and calculates the energy gap between the apo and bound (closed) states.

Next, a GA is done in which each variable position is restricted to either its wildtype or optimal residue found in the scanning mutagenesis; this GA is restrained to find the optimal combination of six mutations via the MAX_MUTATIONS modifier described above (mbp_closed_ga.input and output files).

18.6.3 Design of a specific protein-protein interface: src SH3 domain peptide-binding site

The wildtype src SH3 domain (referred to here as srcSH3) binds the RALPPLPRY peptide. This design tries to change the specificity to bind the EALPPEPRY peptide. This example is very similar to the glu/gln specificity switch discussed above, except that, in this case, the "ligand" is a peptide.

Moving positions are defined in examples/multistate/SH3_peptide/define_moving_residues/. The interface residues are identified by JOBTYPE separate_complex (SH3.RALPPLPRY.separate_complex.input). These residues, as well as the nearest-neighbors are allowed to change identity (SH3.RALPPLPRY.find_interface_near_neighbors.input). Since this is a simple demonstration, all other residues are kept fixed to their pdb structure rotamer conformations.

The actual design is done with examples/multistate/SH3_peptide/SH3.EALPPEPRY.design.input. As discussed above for the glu/gln example, the master first optimizes the sequence for the bound target, and then runs a multistate GA, simultaneously optimizing absolute binding affinity with respect to the free unbound state, as well as the relative affinity (specificity) of the bound target (EALPPEPRY) with respect to the original peptide (RALPPLPRY).

As shown in Figure 18.6.3.1, the target-optimized sequence does not contain any basic residues near the glutamate substitutions in the target peptide. While this may seem surprising at first glance, these results are consistent with prior work on coiled-coil specificity, both experimental and theoretical, which suggest that ion-pairs are in fact destabilizing, and serve primarily as specificity determinants (O'Shea et al 1992; Hendsch et al 1994). In contrast, in the multistate design, several basic residues are elicited near the peptide glutamates. This sequence is predicted to have a higher absolute energy for the target peptide than the single-state design. However, unlike the single-state design, this sequence has a high energy for the wildtype competitor peptide and for the unbound state (Figure 18.6.3.1c). The predicted affinity (difference between bound and free) is larger, as is the specificity (affinity difference between the target and competitor).

back to Table of Contents