back to Table of Contents

10. Rotamer optimization jobs

10.1 Overview

EGAD offers several rotamer optimization algorithms. Described here are the algorithms and user-definable options for these jobs. The strengths, weaknesses, and applications of these different methods are also discussed.

The default method, if none is specified, is JOBTYPE faster with number_MC_cycles 10. For single sequence rotamer optimization, FASTER is robust, and converges to the global minimum quickly, within minutes or seconds. However, for designs, it is recommended that JOBTYPE mc_ga be used instead, since FASTER does not always converge to the global minimum for these large problems. For other problems, such as those using sequence restraints, JOBTYPE ga must be used.

To ensure that the final structure is lowest energy conformation, following the main rotamer optimization, the final sequence is rotamer-optimized without consideration of specificity or solubility (Pokala and Handel 2005). This step is done only for designs; for single sequence optimization, specificity and solubility are not considered.

10.2 Algorithms

10.2.1 JOBTYPE faster

FASTER (Fast and accurate side-chain topology and energy refinement; Desmet et al 2002) is the default rotamer optimization method in EGAD. However, for design problems, it is recommended that another method, such as JOBTYPE mc_ga be used instead.

The implementation is EGAD varies somewhat from that described by Desmet et al. Briefly, if permitted, goldstein DEE is used to eliminate rotamers that cannot be in the lowest energy solution. number_MC_cycles (default 10) MC runs followed by HQM are performed. The best solution is used to seed a quasi-exhaustive search. A random position is picked. At that position, each permitted rotamer is selected, placed in the current solution, and frozen. At all other positions, the optimal rotamer is identified and saved to a new candidate solution consisting of the frozen rotamer and the optimal rotamers. If this solution has a lower energy than the current solution, it is kept. This cycle (iteration) is repeated until all positions have been frozen without improvement. At the end, a final MC/HQM is performed.

For single-sequence rotamer optimization, the optimization is often completed within seconds.

By default, the logfile OUTPUT_PREFIX.log is written to with the following format:
FASTER   iteration #      position         energy(if it drops)      timestamp
For example (examples/rotamer_optimization/gb1.faster.rotamer_optim.log):
FASTER iteration 1 position 49 Fri Mar 4 14:58:54 2005
FASTER iteration 1 position 33 Fri Mar 4 14:58:54 2005
FASTER iteration 1 position 29 best_energy = 1628.266820 Fri Mar 4 14:58:54 2005
FASTER iteration 1 position 29 Fri Mar 4 14:58:54 2005

10.2.2 Monte carlo simulated annealing (JOBTYPE mc)

Accepted aliases:
JOBTYPE mc
JOBTYPE monte_carlo
JOBTYPE simulated_annealling
JOBTYPE sa

Monte carlo simulated annealing (MC) is a classic method for complex optimization problems (Metropolis et al 1953; Lee and Subbiah 1991), and is described in more detail in Figure 10.2.2.1.

In this implementation, number_mc_cycles independent MC runs are performed (starting from a random conformation). The solutions are ranked, and the top unique number_mc_solns solutions are subjected to the heuristic quench minimization procedure discussed below. These minimized solutions are ranked and output into the OUTPUT_PREFIX.mc.hqm.out file.

Default values:
number_mc_cycles         2500     (number of independent MC runs)
number_mc_solns          250      (number of MC solutions that are subjected to heuristic quench minimization)
mc_convergence           0                     (default 0 -> automatically set to 100 times the number of floating positions)

If number_mc_cycles is defined, but number_mc_solns is not, number_mc_solns is set to be 10% of number_mc_cycles. This default seems to work well; therefore, number_mc_solns is usually not defined.

An optimal number_mc_cycles is difficult to define. Unfortunately, the number of rotamer combinations does not appear to be a good metric of problem difficulty. The roughness of the sequence-structure energy landscape is the true measure of mis-convergence; this is not easy to define. For some problems, 2500 is overkill, while for others, it's not enough. In practice, number_mc_cycles 10000 appears to be robust for total design problems. For large total design problems (>150 residues), this results in a runtime of little over a day on a single 1.0 GHz CPU. For total design of smaller proteins, the run can often be completed in a few hours. For designing a ligand binding site, the run may require a few days. For single sequence rotamer optimization, number_mc_cycles 100 is often enough to find the optimal solution; this requires on the order of a few seconds to a several minutes, depending on the size of the protein.

A useful alternative is to define number_mc_cycles to be the number of cycles one would like to run, given no time constraints, and then define RUNTIME. This option returns the best solutions found if the program has not yet performed number_mc_cycles independent MC runs within the desired amount of time (see RUNTIME section below). The solutions are ranked and minimized as described above.

By default, the logfile OUTPUT_PREFIX.log is written (see logfile section below). The format for MC output is:
mc       independent MC run #              energy            timestamp
For example (examples/rotamer_optimization/gb1.mc_ga.rotamer_optim.log):
mc 50 1634.096631 Fri Mar 4 14:59:17 2005
mc 60 1631.632009 Fri Mar 4 14:59:18 2005
mc 70 1803.564834 Fri Mar 4 14:59:19 2005
mc 80 1628.781385 Fri Mar 4 14:59:21 2005
mc 90 1801.469873 Fri Mar 4 14:59:22 2005

Unlike for GA, logfile lines are not written during each independent MC run, since each of these are so fast.

10.2.3 Genetic algorithm (JOBTYPE ga)

Accepted aliases:
         JOBTYPE ga
         JOBTYPE genetic_algorithm

Genetic algorithms are powerful methods for complex optimization problems, such as protein design (Desjarlais and Handel 1995). They are essentially evolution in a computer (Figure 10.2.3.1). A population of GA_population random candidate solutions is generated and scored. These solutions are allowed to mate with each other, generating offspring that are hybrids of both parental solutions. The mating frequencies are related to a boltzmann-like probability term; akin to simulated annealing, the simulation temperature required for calculating these is high early in the run, and cools down (see simulation temperature section). Since the mating probability for a given candidate solution is related to its score, better solutions mate more frequently, making their components more numerous in the offspring. This new population of candidate solutions is scored, ranked, and allowed to mate. After GA_convergence number of iterations/generations without improvement, the run is deemed to have converged. The best solutions_per_ga_cycle unique solutions are subjected to heuristic quench minimization (described below).

In this implementation, number_ga_cycles independent GA runs are performed. The top unique solutions_per_ga_cycle from each independent GA run are culled and used to seed a final population. This population is subjected to a final tournament GA run.

Default values:
   ga_population          500     (size of actively mating population; full population is 2x larger)
        number_ga_cycles       10       (number of independent GA runs)
       
solutions_per_ga_cycle 10       (number of best solutions culled from each independent GA run)
       
ga_convergence         300     (number of isoenergetic generations to define convergence)

As with MC, it may not be possible to define the optimal number of independent GA runs required to find the best solution. Therefore, defining number_ga_cycles to a larger number, like 100, and defining RUNTIME may be a good strategy (see RUNTIME section below). If number_ga_cycles independent GA runs cannot be performed within the defined RUNTIME, the solutions that have been identified are used to seed the final tournament GA run.

By default, the logfile OUTPUT_PREFIX.log is written. The format for GA output is (examples/rotamer_optimization/gb1.mc_ga.rotamer_optim.log):
ga       generation #     energy   ∆E       timestamp
For example:
ga 110 1628.266820 0.000000 Fri Mar 4 15:04:22 2005
ga 120 1628.266820 0.000000 Fri Mar 4 15:04:23 2005

After each independent GA run is completed, a similar output is written:
GA       independent GA run #              energy            timestamp
GA 1 1628.266820 Fri Mar 4 15:04:41 2005

10.2.4 Heuristic quench minimization (HQM)

Heuristic quench minimization (HQM) cannot be defined as a JOBTYPE by the user. This method is used to minimize, in discrete-rotamer space, solutions from other methods. Briefly, given an initial structure, a position is picked at random. All possible rotamers permitted at that position assayed to find the lowest energy rotamer, and the structure updated. Another position is picked at random from this updated structure, and the process repeated. After some number of cycles without improvement in the energy, the run is deemed to have converged . This method is useful since the current implementation minimizes the candidate solutions to the point that no single rotamer move can result in lower energy.

By default, the logfile OUTPUT_PREFIX.log is written to with the following format (examples/rotamer_optimization/gb1.mc_ga.rotamer_optim.log):
quench   solution #                energypre-quench         energypost-quench        timestamp
For example:
quench 4 1628.350176 1628.266820 Fri Mar 4 15:04:10 2005
quench 5 1628.351251 1628.266820 Fri Mar 4 15:04:10 2005
quench 6 1628.386728 1628.266820 Fri Mar 4 15:04:10 2005
quench 7 1628.402922 1628.266820 Fri Mar 4 15:04:10 2005

It should be noted that the while the post-quench energy correlates with the pre-quench energy, it is not a very good correlation; lower-ranked MC or GA solutions often result in minimized energies that are lower than the energy resulting from minimizing the best solution.

10.2.5 JOBTYPE mc_ga or JOBTYPE sa_ga

number_mc_cycles independent MC runs are done, and the best unique number_mc_solutions are ranked and heuristic quench minimized, as for the standard MC described above. These solutions are used to seed number_ga_cycles independent GA runs. The best solution from these GA runs is the final solution.

The options described above for MC and GA are accepted. The default for number_ga_cycles is changed:
number_ga_cycles                  2        (number of independent GA runs)

As described above, RUNTIME may be defined. If defined, the RUNTIME clock is reset after the MC phase, so the GA phase has the same amount of time.

By default, the logfile OUTPUT_PREFIX.log is written to, with the formats described above.

The OUTPUT_PREFIX.mc_hqm.out and OUTPUT_PREFIX.ga.out files list solutions from the MC/HQM and GA steps of the optimization respectively.

10.2.6 Dead-end elimination (JOBTYPE dee)

Dead-end elimination (DEE) is a family of techniques that results in the elimination of rotamers and rotamer pairs that cannot be in the global minimum solution . After all non-optimal rotamers and rotamer pairs are eliminated, only one solution remains: the optimal one. This is the only technique that is guaranteed to return the optimal solution, if it converges (convergence is not guaranteed). In actuality, once the number of possible solutions drops into the range where exhaustive search is possible, exhaustive search is used.

An important caveat is that DEE must be used with energy functions that are pair-wise decomposable, which the EGAD energy function is. However, for design, it is often required to consider solubility and solubility, which is assayed by a non-pairwise method (discussed in Pokala and Handel 2005). At present, DEE cannot handle this. Nevertheless, as discussed below, DEE is can be useful for single-sequence rotamer optimization. For jobs listed above where specificity and solubility are not considered (such as single-sequence rotamer optimization), Goldstein elimination is used to quickly prune dead-ending rotamers.

There are a few user-definable options for DEE. When DEE has determined that the number of solutions remaining is less than max_combo_exhaustive_search, then DEE is aborted in favor of exhaustively searching through all possible combinations.
max_combo_exhaustive_search      100000   (default)

max_nodes is the maximum number of leaf nodes visited by the exhaustive search.
max_nodes        100000   (default)

A powerful criterion used for DEE is the bounding energy E_bounding, which acts as a ceiling for rejection; candidate solutions with energies higher than this cannot be in the global minimum and may be rejected immediately. By default, this is calculated by performing 100 independent MC runs. In order to prevent errors if the MC-calculated solution happens to be the global minimum, E_bounding_ceiling is added on. E_bounding may also be defined explicitly by the user. Usage with defaults:
E_bounding               0.0      (default 0.0 -> calculate with MC)
E_bounding_ceiling       1e-6    
If bounding should not be used for some reason, set:
bounding_flag 0           (default 1 -> use bounding for DEE)

If RUNTIME is defined, and if convergence has not occurred within the defined time, DEE is aborted, and MC run instead. It is possible to have an MC that takes advantage of the known eliminated rotamers and rotamer pairs to limit the search space. However, this is not presently implemented.

By default, the logfile OUTPUT_PREFIX.log is written to with the following format:
dee               message           timestamp
The messages state things like which DEE block is currently being used, and the number of possible solutions remaining, etc.

10.2.7 Self-consistent mean field optimization (JOBTYPE scmf)

Unlike the other methods described here, self-consistent mean field optimization (SCMF) does not result in a final structure. Rather, it results in a probability for each rotamer at each position, and provides a measure of sidechain entropy (Koehl and Delarue 1994). Since the result is a self-consistent ensemble of rotamer and amino acid probabilities, this method is not especially useful for design. However, it can provide information on what amino acid types may be permissible at certain positions (Voigt et al. 2001).

It has been suggested that the mean-field energy resulting from an SCMF run provides a better estimate of the actual energy than a single structure resulting from the other optimization methods (Lee 1994; Havranek and Harbury 2003). However, none of the development of the EGAD energy function has assumed used this.

For algorithm description, please see Figure 10.2.7.1. SCMF is used within EGAD for JOBTYPE pK.

Initial probability distribution scmf_seed can be unbiased, random, or weighted. If unbiased (default), each rotamer at a position is assigned the same initial probability. If random, each rotamer is assigned a random probability. If weighted, the initial probability is defined based on the rotamer-backbone energies.
Usage:
scmf_seed        seed-type        (unbiased, random, or weighted; default unbiased)

SCMF does not converge to a unique solution; different random starting points can result in convergence to one of a handful of self-consistent solutions. If the unbiased seed is used, it will likely converge to the solution with the largest radius of convergence; in many cases, this is not the lowest-energy ensemble (Mendes et al. 1999). Therefore, it is suggested that multiple random seeds be used. This can be indicated with number_scmf_cycles. The default is 1. When number_scmf_cycles>1, the first SCMF run is always seeded with the user-defined scmf_seed probability, while subsequent ones have random seeds.
Usage:
number_scmf_cycles       n        (n = number of independent SCMF runs; default 1)

As with the other methods, RUNTIME may be defined. If number_scmf_trials SCMF runs have not been completed within the time allotted, the lowest-energy solution is output.

As described earlier, SCMF optimizations can often result in oscillations; the ensemble moves from one configuration to another, never actually converging to a unique state. To prevent this, the boltzmann probabilities calculated from the energies that are used to calculate the mean-field energies in the next iteration are actually a weighted average between the actual boltzmann probability, and the probabilities of the previous iteration (Koehl and Delarue 1994). The default value:
scmf_lambda      0.25     (weight on current iteration boltzmann probability for averaging )
The weight on probabilities from the previous cycle is 1.0 - scmf_lambda. The default listed here appears to give timely convergence in most cases tested.

While SCMF does not converge to a single solution, the rotamer probabilities may be used to generate a structure composed of the most probable rotamer at each position. This structure is most likely not the lowest energy structure; indeed, it may be some distance from it. Nevertheless, by default, this structure is the one written to the outputted pdb file.

If a lower energy structure is desired, set:
scmf_quench_flag         true     # default false)
and/or:
number_scmf_solns        n        # n = # of SCMF solutions; default 10; if defined, scmf_quench_flag is true)

These options direct the generation of number_scmf_solns random structures whose rotamers are biased by the mean-field probabilities. These biased-random structures are used as seeds for MC runs whose outputs are subjected to heuristic quench minimization. The lowest energy structure is output.

By default, the logfile OUTPUT_PREFIX.log is written to with the following format:
scmf     iteration# convergence_score    Emf     T∆S      Emf-T∆S   timestamp

The convergence_score field is the sum of the squares of the differences between rotamer probabilities in the current iteration and the previous one; convergence is assumed when this falls below 10-6. The Emf field is the mean-field energy, calculated using the current rotamer probabilities. The T∆S field is the entropy calculated with current rotamer probabilities; the temperature T is the user-defined TEMPERATURE value. The Emf-T∆S field is the free energy of the current mean-field ensemble.

See examples/scmf/gb1.scmf.wt_seq.input and the resulting output for an example of scmf optimization of a single sequence.

If more than one residuetype is allowed at a given position, the probabilities for each residue are given As discussed by Voigt et al , SCMF-derived amino-acid probabilities may be of use for designing libraries for selections. However, since these probabilities are not physical (what is a fractional residue?), caution should be exercised. The final pdb file lists these amino acid probabilities and positional entropies (examples/scmf/gb1.scmf.design.pdb):
...
SCMF_ENTROPY 42 TdS = 1.555334 A 0.245224 D 0.143858 E 0.084495 F 0.000000 H 0.000991 I 0.000000
K 0.010949 L 0.000001 M 0.000000 N 0.153603 Q 0.035709 R 0.015772 S 0.309239 T 0.000158 V 0.0000
01 W 0.000000 Y 0.000001
SCMF_ENTROPY 43 TdS = 0.013396 A 0.000000 D 0.000000 E 0.000000 F 0.000000 H 0.000000 I 0.000000
K 0.000007 L 0.000000 M 0.000000 N 0.000000 Q 0.000000 R 0.999993 S 0.000000 T 0.000000 V 0.0000
00 W 0.000000 Y 0.000000
SCMF_ENTROPY 44 TdS = 1.290649 A 0.056263 D 0.212732 E 0.001518 F 0.000000 H 0.002230 I 0.000000
K 0.001435 L 0.000000 M 0.000000 N 0.643994 Q 0.005267 R 0.001966 S 0.074595 T 0.000000 V 0.0000
00 W 0.000000 Y 0.000000
...

In this example, position 42 has a 24.52% probability for A, 14.39% probability for D, 15.36% probability for N, 30.92% probability for S, and smaller probabilities for other residuetypes. Position 43 is far less degenerate; the probabilities are dominated by R. Like position 42, position 44 has significant degeneracy, but different residues are preferred compared to position 42.

10.2.8 OUTPUT_PREFIX.xxx.out format

MC and GA produce OUTPUT_PREFIX.mc_hqm.out and OUTPUT_PREFIX.ga.out files respectively. These files list the top candidate solutions generated by these methods, ranked by energy.

Format:
...
SOLUTION_# n
ENERGY E_working: xxxx
ROTAMER seq_position core_surf_int residuetype chi[1] ... chi[num_chi]
...
ROTAMER seq_position core_surf_int residuetype chi[1] ... chi[num_chi]
SEQUENCE_n       sequence_of_the_solution_including_non_variable_positions
VARIABLE_SEQUENCE_n      sequence_of_the_solution_only_for_variable_positions
END
SOLUTION_# n+1
...

The SEQUENCE,  VARIABLE_SEQUENCE, E_working (the energy of the candidate solution), and ROTAMER entries are described in the pdb format section. The ROTAMER entries can be used as input for user-defined rotamers in the inputfile. VARIABLE_SEQUENCE entries are not produced for single-sequence rotamer optimizations.

10.3 User-definable options

10.3.1 Simulation temperatures

As described above, MC and GA require temperature-dependent boltzmann probability calculations for making moves. The simulation temperature decreases with each iteration, for the reasons discussed above. These temperatures, and the cooling cycle can be defined by the following parameters:
mc_T0    5000     # (starting temperature for simulated annealing)
mc_dT    -10
      # (temperature change per iteration for simulated annealing)
mc_Tf    100
      # (final temperature for simulated annealing)
ga_T0    5000     # (starting temperature for GA)
ga_dT    -10
      # (temperature change per iteration for GA)
ga_Tf    100
      # (final temperature for GA)

The temperature starts at T0, decreases linearly by dT every iteration. When the temperature reaches Tf, it is held constant for the remainder of the run.

The defaults listed here were determined by trial-and-error for a several test cases. By no means are these optimal for all problems. Indeed, if one is going to perform many optimizations for a given backbone, it may be worthwhile to empirically determine better values with a series of short runs.

Non-linear cooling schedules are presently not implemented; however, based on trial-and-error, it does not seem that these are significantly better than the simple linear schedule used here. Indeed, hyperbolic schedules seem to give worse results.

10.3.2 RUNTIME

It is easy to tell when DEE has converged; only one solution remains. For FASTER, convergence is achieved when no more moves are possible. However, for stochastic methods, it is not quite so straightforward to determine when convergence has been reached. In principle, the other methods should converge to the global minimum, given infinite time. In practice, convergence is defined by some empirically determined number of moves without improvement (discussed below in implementation details). Since these methods often (if not usually) become trapped in some local minimum, it is necessary to run these many times. The question is how many times is enough. Unfortunately, this is not an easy question to answer. While the default values described above appear to be pretty good, they are certainly not optimal. For some problems, they may be overkill, while for others, not enough.

The runtimes discussed above may be reasonable for most situations. However, if there are many situations in which what is desired is the best solution that can be obtained within a given period of time.For these cases, the RUNTIME option is useful:
RUNTIME n                 (n = time in decimal minutes)

When this option is defined, the program periodically checks the time it has spent on a particular algorithm. If RUNTIME minutes have elapsed since the start of the optimization method block, the program moves to the next step (usually heuristic quench minimization on the best solutions).

RUNTIME does not include the time needed to calculate the pair energy table or the heuristic quench minimization on the best solutions. If multiple algorithms are used in hybrid methods such as MC_GA, the clock is reset for each method.

10.3.3 Logfile options

Logfiles are written to assure the user that the process in question hasn't crashed <yet>, and to also give some measure of progress, to permit an estimation of completion time.

If LOGFILE_FLAG 1 (default), logfiles are written to OUTPUT_PREFIX.log.
If LOGFILE_FLAG 0, neither the OUTPUT_PREFIX.log logfile or OUTPUT_PREFIX.out file that contains sub-optimal solutions are written. This option is often used for batch processing when it is not desirable to clutter up a working directory with many files.

By default, the a logfile entry is written only every 10th iteration (for MC, "iteration" = independent MC run; for GA, "iteration" = generation within a GA run). This may be changed with:
LOGFILE_PERIOD   n        (n = logfile line written every nth iteration; default 10)

10.4 Premature program termination

If the user wishes to gracefully exit the program pre-maturely, or (for multi-step optimizations such as MC_GA), jump to the next step, the logfile lists the names of files that may be created to signal this.
touch escape.job_specific_label.pid (pid= process ID number)
will create the "escape hatch" file, and lead to premature program termination or advancement to the next step. Depending on the job, the actual escape or jump may not happen instantaneously.

10.5 Strengths and weaknesses of different rotamer optimization methods

These algorithms have different strengths and weaknesses. An excellent reference is the work of Voigt and Mayo (Voigt et al 2000). In this study, the investigators tested various rotamer optimization methods and compared their accuracy to the known global minimum calculated by DEE.

Dead-end elimination (DEE) is guaranteed to converge to an optimal solution, if it converges at all. For some smaller problems, DEE converges quite rapidly. For larger problems, it can take much longer to converge. If non-pairwise filters are placed upon the search, such as solubility, DEE cannot be used. However, there may be ways to address some of these problems in a pairwise manner. Due to these issues, the current implementation of DEE is recommended for use only for single sequence rotamer optimization. For these problems, DEE almost always converges, and is quite fast.

Unlike DEE, stochastic methods such as the genetic algorithm (GA) and monte carlo simulated annealing (MC) are not guaranteed to reach the global minimum (caveat; see below). For smaller problems, such as single sequence rotamer optimization, they are often significantly slower than DEE, due to the requirement of running multiple individual runs (an individual MC run, of course, is much faster than any DEE run). However, these methods have some advantages over DEE for other problems. These methods can easily address problems with non-pairwise search filters, such as specificity. Unlike DEE, these problems scale well for larger problems. Indeed, for design problems, these methods are significantly faster than DEE (even when the running time includes thousands of independent MC runs). Finally, the fact that these are not guaranteed to find the global minimum is often not a problem. For many problems, sub-optimal, near-global solutions may suffice. It should be noted that for almost all single-sequence rotamer optimization cases tested so far, the MC-calculated solution has been shown to be identical to the DEE-calculated global minimum, strongly suggesting that the implementation in EGAD is robust.

While some success has been reported using heuristic quench minimization (HQM) alone with random starting points, it requires an enormous number of runs to find near-optimal or optimal solutions for larger problems (Wernisch et al 2000); unpublished observations. Since it is a purely downhill method, the solution is highly sensitive to the starting point; the heuristic quench procedure is very good at searching local space, but does not sample the entire space well. In contrast, MC samples the entire space far more efficiently, but it is not so good at the "endgame" of finding the lowest energy near a low-energy, yet suboptimal solution. This is due to the highly stochastic nature of MC moves; a solution may be on a cliff wall, but if the right random rotamer at the right random position is never picked within the proscribed number of random tries, the minimal solution is never found. On the other hand, HQM moves are exhaustive at each selected position, and the current implementation guarantees that each position is examined. The hybrid implementation used here allows the strengths of each algorithm to complement the other's weaknesses. One may also imagine other hybrids, such as an MC in which each move consists of picking a random position, assaying the energy of all rotamers at that position in a manner similar to HQM, and selecting one randomly, biased by the boltzmann probabilities; this would contain the exhaustive nature of HQM moves, yet permit uphill moves to escape local minima.

An individual GA run has a much better chance of finding the optimal solution than an individual MC run. However, each individual run is much slower (minutes vs. seconds). MC is fast, but gets trapped easily in local minima. Therefore, it is necessary to run orders of magnitude more independent runs for this method than for a GA. As a trade-off, it appears that a very good general method is the JOBTYPE mc_ga option, in which solutions from individual MC runs are used to seed a population for GA. If one may imagine a landscape with multiple well-separated valleys, each individual MC run (including HQM) converges to one of the valleys. A GA seeded with these can create hybrid solutions that may have lower energies than the already low-energy parents. After all, a basic premise of GA is that hybrids of good solutions are more likely to lead to better solutions than hybrids of poor solutions; an initial population full of good solutions should have better odds of finding the optimal solution than a completely random initial population. It should be noted that for small problems like single-sequence rotamer optimization or core design, the GA step more often than not does not result in an improved solution; for these problems, the MC-derived solution is often optimal, as discussed above.

The FASTER algorithm has some of the quasi-exhaustive properties of HQM and DEE with the uphill moves of MC. Fixing each rotamer is analogous to making an uphill MC move; this fixed rotamer may have a worse energy than the current solution. However, performing an exhaustive search at all other positions guarantees that all single-move improvements will be identified. This method explores the local landscape of a potential solution quasi-exhaustively by climbing all neighboring ridges (fixing each rotamer) and looking down (exhaustive search at all other positions). For small problems, especially single sequence rotamer optimization, FASTER is indeed fast (seconds), certainly much faster than DEE. Another major advantage that FASTER has over DEE is that, like MC and GA, it can be used with non-pairwise decomposable energies, such as our hydrogen-bonding specificity reference model (Pokala and Handel 2005).

While the default implementation of FASTER is the best for small problems, including single-sequence rotamer optimization, it has problems with large design problems. However, by increasing number_MC_cycles from 10, to say, 200, FASTER may yield similar or better results. Since the optimal number of MC cycles that should be run during FASTER for large design problems has not yet been determined, MC_GA is recommended.

10.6 Implementation details

10.6.1 Data structures and general-use functions

The first rotamer optimization method code written in EGAD was the genetic algorithm, thus explaining the name of the program, as well as the names and organization of many of its data-structures. The CHROMOSOME and GENE (and bkbnGENE for moving backbone....not relevant here) structs are the primary repositories and conduits for the dihedral representation of the protein structure. An individual CHROMOSOME represents a single protein structure, and contains information about the energy, and information for the optimization method, such as mating frequencies for genetic algorithms. GENE genes and bkbnGENE bkbngenes are linked-lists within CHROMOSOME that carry the actual information necessary for building the coordinates, such as dihedral angles. GENE has pointers to elements of lookupEnergy and var_pos, permitting rapid access to pair energy lookup table values during rotamer optimization.

The genes linked list within a CHROMOSOME structure is allocated and initialized by GA_utilities.cpp: inoculate_sidechains, using a VARIABLE_POSITION array as a template. This function calls GA_utilities.h: mutate_sidechain to select a random residuetype and biased-random rotamer for each GENE (for each non-pro/gly position), and assign the appropriate pointers within the GENE elements. The function GA_utilities.cpp: randomize_sidechains works in a similar manner to generate new random choices for an already initialized GENE linked-list within CHROMOSOME. For both these functions, if solubility should be considered, the CHROMOSOME is passed to solubility.cpp: solubilize_CHROMOSOME (see solubility design section).

In contrast to these functions, GA_utilities.cpp: mutagenize_sidechains_CHROMOSOME does not completely randomize a pre-initialized CHROMOSOME; it uses the inputted mut_freq mutation frequency randomly to pick positions to alter.

GA_utilities.cpp and GA_utilities.h contain other useful functions.

Energies are calculated for the entire CHROMOSOME by the function CHROMOSOME_to_lookupEnergy.cpp: CHROMOSOME_to_lookupEnergy. For MC and HQM, the function CHROMOSOME_to_lookupEnergy.cpp: lookup_energy_sasa_charge_other is used to calculate the interaction energy between a single rotamer at a given position and the backbone and rotamers at all other position. Both these functions access the lookup table through pointers in GENE, which point to CHOICE elements within the VARIABLE_POSITION array, which in turn point to the actual lookup table elements. If a pair of residues interact, but the energies for that pair has not yet been read/calculated (flagged as described in the pair energy lookup table section), CHROMOSOME_to_lookupEnergy.cpp: load_calc_save_lookupResX is called to read/calculate, and save this information. If the entire lookup table is read/calculated upfront, these function will not be called.

10.6.2 Genetic algorithm (GA)

10.6.2.1 Control functions for genetic algorithm

If "GA" is defined for protein->parameters.algorithm, main sends protein to GA.cpp: GA_rotamers_master_control after the lookup table is read/generated by lookup_table.cpp: generate_lookup_table. A CHROMOSOME array is initialized and sent to GA.cpp: GA_rotamer_control. After this function is completed, the rotamers, energy and sequences of solutions are written to OUTPUT_PREFIX.out. The lowest-energy CHROMOSOME is copied to protein->final_chr. This is then passed to CHROMOSOME_to_lookupEnergy.cpp: final_chr_to_final_pdb_final_energy to generate the final pdb structures and energies in protein.

GA.cpp: GA_rotamer_control manages the GA. This takes as input an array of CHROMOSOME chrs, which contains chrs_size CHROMOSOMEs that are used to seed the actual GA runs. These are solutions from independent runs of the previous optimization method; however, if this function is called by GA.cpp: GA_rotamers_master_control, chrs should contain random CHROMOSOMEs.

The chrs array is sorted by energy upon entry into GA_rotamer_control. working_chrs and final_chrs CHROMOSOME arrays are initialized; these are respectively the working CHROMOSOME array sent to GA.cpp: evolve_sidechains for independent GA runs, and the CHROMOSOME array which stores the best solutions from all the runs. parameters->number_GA_cycles independent GA runs are performed. The nth GA run includes chrs[n] onward in the initial population stored in working_chr; the remainder of the population 2*parameters->pop_size are filled with random CHROMOSOMEs. If the passed chrs_size<0, the number of chrs array elements is |chrs_size|, and each is randomized before each independent GA run. working_chr is passed to GA.cpp: get_rid_of_isoenergetic_chrs in order to replace isoenergetic (ie: identical) CHROMOSOMEs. After seeding, working_chr is passed to evolve_sidechains, which performs the actual genetic algorithm. The best parameters->num_GA_solns_per_cycle solutions are copied into final_chrs. After all the independent GA runs are completed, and if there is time available, final_chrs is passed to evolve_sidechains for a final tournament GA. The best chrs_size solutions in final_chrs are copied back to chrs for return back to the calling function.

10.6.2.2 Actual genetic algorithm: GA.cpp: evolve_sidechains

All the members of the inputted initial population chrs must be soluble, as assessed by solubility.cpp: is_this_chr_soluble (if SOLUBILITY_CUTOFF_FLAG 0, <default> all are soluble). After being scored for energy by CHROMOSOME_to_lookupEnergy.cpp: CHROMOSOME_to_lookupEnergy, the chrs array is passed to GA.cpp: get_rid_of_isoenergetic_chrs to replace identical members.

The population is sorted by energy by sort_CHROMOSOME, and sent to GA_utilities.cpp: calculate_mating_frequencies. This function calculates the relative (with respect to the best one) real energies pop_energy for each member of the population. The rank is used to define deltaRank for each member of the population. The mating frequencies boltzProbab and rankProbab calculated for each CHROMOSOME by boltzmann_CHROMOSOME. boltzProbab is the boltzmann probability of a given CHROMOSOME, based on the relative real energy pop_energy, while rankProbab is the boltzmann probability based on the pseudo-energy deltaRank. While both result in lower energy CHROMOSOMEs mating more frequently, the double scheme used here ensures that individuals with significantly lower energies than the others will not take over the population; this prevents genetic drift and the founder effect that result in premature convergence. If calculate_mating_frequencies returns 0, the GA run is deemed to have convereged, since the worst and best ranked members are identical; in principle, this should never happen, except for very small problems.

The GA runs until the number of isoenergetic generations convergence_counter parameters->ga_convergence. In each generation, the population is mated by GA_utilities.cpp: mate_sidechains, based on the mating probabilities calculated in the previous cycle. Mating is a very simple procedure. Two random CHROMOSOMEs are selected and copied to two child CHROMSOME elements. The children are allowed to undergo recombination (cross-over frequency of 50%; accomplished by swapping nextgene elements of the genes linked-list at cross-over points. Insoluble offspring are solubilized by calling solubility.cpp: solubilize_CHROMOSOME. As described above, energies of the children are scored, identical members pruned, and mating frequencies calculated.

If ten generations have gone by with the same best energy, the mutation frequency is increased steadily for ten generations. If after ten generations with increased mutation frequency the energy has still not dropped, the mutation frequency is reset to the standard value parameters->mutation_freq for another ten generations, or until the energy drops. If the energy has not dropped after an addition ten generations, the mutation frequency is allowed to rise again, and then relax until the energy drops. Once the energy drops, the mutation frequency is reset to parameters->mutation_freq. The simulated temperature is decreased as discussed above (simulation temperature section).

After convergence is achieved, or if time runs out, the population is sorted, and the best parameters->num_GA_solns_per_cycle solutions are subjected to HQM by calling HQM_rotamers.cpp: HQM_rotamers. This final population is sorted.

10.6.3 Monte carlo simulated annealing (MC)

10.6.3.1 Control function MC.cpp: MC_rotamer_control

If protein->parameters.algorithm starts with "MC", main sends protein to MC.cpp: MC_rotamer_control after the lookup table is read/generated by lookup_table.cpp: generate_lookup_table. The protein->chr CHROMOSOME array is allocated with 2*protein->parameters.number_MC_cycles members. protein->parameters.number_MC_cycles independent MC runs are performed by calling MC.cpp: MC_rotamers with a randomized member of the protein->chr array as an argument.

After the defined number of independent MC runs are completed (or if time runs out) GA.cpp: get_rid_of_isoenergetic_chrs is called to filter protein->chr, when is then sorted. The best protein->parameters.number_MC_solns are subjected to HQM. protein->chr is again filtered to eliminate isoenergetic solutions and sorted.

If protein->parameters.algorithm contains "GA", the quenched protein->chr array is passed to GA.cpp: GA_rotamer_control for performing GA runs, using the MC/HQM solutions as seeds.

The rotamers, energy and sequences of solutions are written to OUTPUT_PREFIX.out. The lowest-energy CHROMOSOME is copied to protein->final_chr. This is then passed to CHROMOSOME_to_lookupEnergy.cpp: final_chr_to_final_pdb_final_energy to generate the final pdb structures and energies in protein.

10.6.3.2 Actual simulated annealing: MC.cpp: MC_rotamers

The inputted CHROMOSOME chr must be soluble, as assessed by solubility.cpp: is_this_chr_soluble (if SOLUBILITY_CUTOFF_FLAG 0, then it is soluble). A working CHROMOSOME dummychr and a working GENE i_gene are allocated. chr is copied to dummychr. Bona fide Floating positions are identified, and their var_pos indicies loaded into the indicies_of_floating_positions array. If SOLUBILITY_CUTOFF_FLAG 1, the SASA, transfer energy, and charge components due to fixed positions are added up by solubility.cpp: fixed_position_sasa_charge.

The actual simulated annealing runs until parameters->mc_convergence iterations pass without decreasing the energy. In each iteration, a random floating position i is picked. The dummychr genes element corresponding to this position is copied to i_gene. The interaction energy of the rotamer currently at position i with all other floating rotamers at all other non-i positions is calculated with CHROMOSOME_to_lookupEnergy.h: lookup_energy_sasa_charge_other. If SOLUBILITY_CUTOFF_FLAG 1, this function also returns the SASA, transfer energy, and charge components of the other floating non-position i rotamers. The internal energy and interaction energy with fixed atoms energy_var_fix is added to this energy.

i_gene is then mutated to a random residuetype and rotamer. If SOLUBILITY_CUTOFF_FLAG 1, i_gene is repeatedly mutated until dummychr as a whole is soluble (see solubility design section). The energy of this new rotamer is calculated as described above for the current rotamer. If the Metropolis criterion is satisfied, the move is accepted, i_gene copied into dummychr, and the energy of the entire structure calculated with CHROMOSOME_to_lookupEnergy.cpp: CHROMOSOME_to_lookupEnergy. If the new dummychr is the lowest energy one visited thus far, it is copied into the inputted chr. Finally, the simulation temperature is dropped as described above.

The implementation described here does not require calculation of the energy for the entire structure in order to make a move; instead, only the energies involving the position under consideration need to be calculated. This saves a tremendous amount of time. Indeed, if complete energy calculation were required, MC would be significantly slower than GA, since the vast majority of MC moves are likely to be rejected.

10.6.4 Heuristic quench minimization (HQM): HQM_rotamers.cpp: HQM_rotamers

The start of this function is virtually identical to the start of MC.cpp: MC_rotamers.

The actual HQM runs until hqm_convergence iterations pass without a decrease in energy. A random floating position is picked. The genes element in dummychr.genes corresponding to this position is identified and copied to the working GENE i_gene. The energy of this rotamer with all other floating sidechains is calculated with CHROMOSOME_to_lookupEnergy.h: lookup_energy_sasa_charge_other, and the energy_var_fix added on.

The lowest energy rotamer at this position is identified by calling HQM_rotamers.h: find_best_res_res_rot. This function exhaustively measures the energy of all rotamers permitted at this position to find the lowest energy one (rotamers that cause dummychr to become insoluble are identified and rejected). If a lower energy rotamer is indeed found, the relevant gene in dummychr.genes is updated.,

An array already_used_positions contains information on whether a position has been used or not. HQM_rotamers.cpp: has_this_position_been_used checks this array and returns 1 if it has been used. If the energy drops, the already_used_positions array is reset. Convergence is assumed when every position has been visited without a drop in energy.

The final dummychr is copied to chr, and the total energy calculated with CHROMOSOME_to_lookupEnergy.cpp: CHROMOSOME_to_lookupEnergy.

As with the MC implementation, the HQM implementation described here does not require calculation of the energy for the entire structure in order to make a move; instead, only the energies involving the position under consideration need to be calculated. Indeed, since all moves are required to be downhill, the energy of the entire structure needs to be calculated only once, at the end.

10.6.5 Fast and accurate side-chain topology and energy refinement (FASTER)

If "FASTER" is defined for protein->parameters.algorithm, main sends protein to somewhat_FASTER.cpp: somewhat_FASTER after the lookup table is read/generated by lookup_table.cpp: generate_lookup_table. protein->parameters.number_MC_cycles independent MC/HQM are run. The best of these is copied to the best_chr CHROMOSOME.

As with HQM and MC, bona fide moving positions are identified and stored in the indicies_of_floating_positions array, which is used for randomly picking positions to freeze.
Similar to HQM, an array already_used_positions contains information on whether a position has been used or not. HQM_rotamers.cpp: has_this_position_been_used checks this array and returns 1 if it has been used. The outer loop selects an unused position, exhaustively freezes each rotamer in turn, and uses to find the optimal rotamer at all other positions. The frozen rotamer and the optimal rotamers are stored in best_rotamer_chr CHROMOSOME. After optimal rotamers are selected for all the non-frozen positions, best_rotamer_chr is scored for the total energy. If its energy is lower than that for the current best_chr, it replaces it, and the convergence counter (converged_flag) is reset. Convergence is assumed if all the positions are visited and frozen without an improvement in energy.

10.6.6 Dead-end elimination (DEE)

The DEE implemented in EGAD was written primarily by Mark Voorhies; the reader is directed to his documentation. At present, the DEE source code is distributed under LGPL license, not GPL.

If "DEE" is defined for protein->parameters.algorithm, main sends protein to DeeControl.cpp: DeeControl after the lookup table is read/generated by lookup_table.cpp: generate_lookup_table. If bounding_flag 1 and protein->parameters.dee_E_bounding 0.0 (defaults set in input_stuff.cpp: input_stuff), protein->parameters.number_MC_cycles is set to 20, and protein is sent to MC.cpp: MC_rotamer_control; this performs 20 MC runs and heuristic quench minimizes the top two solutions. The energy of the best solution in protein->E_working (plus E_bounding_ceiling) is defined as the bounding energy E_ref. dee_utilities.cpp: initialize_lookuptable_for_dee is called to generate RESIMER structs within protein->var_pos; this is used to permit conversion from resimer (residue-rotamer; no distinction between rotamers from different residuetype) arguments to the i_res, i_res_rot description used in the rest of EGAD. The various DEE functions are called in turn. At present, lines are written to the logfile indicating where a monte carlo routine that has access to the DEE elimination table should go. The function ProteinDeeSpace.DeeSoln_to_CHROMOSOME is used to generate the final structure from the elimination table, if the DEE converged. If it did not, protein is sent to MC.cpp: MC_rotamer_control to generate an MC solution; this should be replaced by a monte carlo routine that has access to the DEE elimination table.

10.6.7 Self-consistent mean field optimization (SCMF)

The mini_VARIABLE_POSITION and mini_CHOICE structs are used locally within scmf.cpp. These are stripped down analogs of VARIABLE_POSITION and CHOICE used for storing information during scmf optimization.

protein->parameters.number_scmf_cycles SCMF optimizations are performed. At the start of each run, each rotamer at each position is assigned some initial probability, as described above. The SCMF optimization iterations are repeated until convergence is achieved, or protein->parameters.max_scmf_iterations is exceeded, or time runs out. Based on the probabilities calculated in the previous cycle (stored in ERESROT.P[0]), the mean field energy of each rotamer is calculated. These energies are used to calculate new rotamer probabilities stored in ERESROT.P[1]. The scmf_lambda factor described above is used to generate the weighted average of the current iteration's probabilities and the previous iteration's probabilities. The self consistency is checked by calling scmf.cpp: conf_matrix_diff. The probabilities in ERESROT.P[0] are updated with the ones stored in ERESROT.P[1].

After convergence is achieved, if the total SCMF energy for the current run is lower than any of the previous runs, the structure with the most probable rotamer at each position is placed in protein->final_chr; the rotamer probabilities are copied into the mini_varpos struct described above.

After all the runs are completed, protein->parameters.number_scmf_solns random structures are generated, biased by the rotamer probabilities in the lowest energy SCMF ensemble. Each of these are further optimized by MC/HQM. The lowest energy structure is stored in protein->final_chr. As with the other methods, CHROMOSOME_to_lookupEnergy.cpp: final_chr_to_final_pdb_final_energy is called to generate the final structures and energies for output.

Please note than the residue/rotamer probabilities within the protein->var_pos structure are not altered; subsequent calls to MC or GA will continue to use the default uniform residue and database-derived rotamer probabilities. In order to get these methods to access SCMF-calculated probabilities, pass protein->var_pos to scmf.cpp: convert_scmf_probabs_to_rotamer_probabs, as described below.

The rotamer and residue probabilities, entropies, and energies calculated by SCMF may be used for other tasks. For example, within EGAD, pK_calculate.cpp: pK_calculate uses SCMF-generated probabilities (see pK calculation section). These data are stored within the lookup table datastructure:
lookupEnergy[i].lookupRes[i_res].P                probability of residuetype i_res at position i
lookupEnergy[i].lookupRes[i_res].TdS     entropy of residuetype i_res at position i
lookupEnergy[i].lookupRes[i_res].E_mf    energy of residuetype i_res at position i
ERESROT.P[1]                                                  probability of rotamer i_res_rot at position i

As mentioned above, the SCMF probabilities stored in the lookup table are not accessible by optimization methods like MC or GA; these functions call GA_utilities.h: mutate_sidechain, which uses the probabilities stored in the VARIABLE_POSITION datastructure (default to uniform residue probabilities, and database-derived rotamer probabilities). Passing protein->var_pos to scmf.cpp: convert_scmf_probabs_to_rotamer_probabs after SCMF optimization will modify the probability arrays within protein->var_pos to match the SCMF-calculated probabilities, thus permitting MC and GA to make moves biased by these probabilities.

10.6.8 Simulation temperature

As discussed above, both MC and GA use a high simulation temperature to permit uphill moves early in the optimization. This temperature decays linearly from T0 to Tf, wtih dT change at each cycle (default values listed above). Both MC and GA call GA_utilities.cpp: update_simulation_temperature for temperature updates. In the present implementation, after the final temperature Tf is achieved, the temperature is maintained. While the implementation here appears to work well for general purposes, users may try alternative forms by modifying this function. Many other temperature schedules have been described in the literature, including non-linear schedules, and oscillations after Tf has been reached.

back to Table of Contents