back to Table of Contents
10. Rotamer optimization jobs
10.1 Overview
EGAD offers several rotamer optimization algorithms. Described here are
the algorithms and user-definable options for these jobs. The
strengths, weaknesses, and applications of these
different methods are also discussed.
The default method, if none is specified, is JOBTYPE faster with number_MC_cycles
10. For single sequence rotamer optimization, FASTER is robust,
and converges to the global minimum quickly, within minutes or seconds.
However, for designs, it is recommended that JOBTYPE mc_ga be used
instead, since
FASTER does not always converge to the global minimum for these large
problems. For other problems, such as those using sequence
restraints, JOBTYPE
ga must be used.
To ensure that the final structure is lowest energy conformation,
following the main rotamer optimization, the final sequence is
rotamer-optimized without consideration of specificity or solubility (Pokala and Handel 2005).
This step is done only for designs; for single sequence optimization,
specificity and solubility are not considered.
10.2 Algorithms
10.2.1 JOBTYPE
faster
FASTER (Fast and accurate side-chain topology and energy refinement; Desmet et al 2002) is
the default rotamer optimization method in EGAD. However, for design
problems, it is recommended that another method, such as JOBTYPE mc_ga
be
used instead.
The implementation is EGAD varies somewhat from that described by
Desmet et al. Briefly, if permitted, goldstein DEE
is used to
eliminate rotamers that cannot be in the lowest energy solution. number_MC_cycles (default 10)
MC runs followed by HQM
are performed. The best solution is used to
seed a quasi-exhaustive search. A random position is picked. At that
position, each permitted rotamer is selected, placed in the current
solution, and frozen. At all other positions, the optimal rotamer is
identified and saved to a new candidate solution consisting of the
frozen rotamer and the optimal rotamers. If this solution has a lower
energy than the current solution, it is kept. This cycle (iteration) is
repeated until all positions have been frozen without improvement. At
the end, a final MC/HQM is performed.
For single-sequence rotamer optimization, the optimization is often
completed within seconds.
By default, the logfile OUTPUT_PREFIX.log
is written to with the following format:
FASTER
iteration #
position
energy(if it drops)
timestamp
For example (examples/rotamer_optimization/gb1.faster.rotamer_optim.log):
FASTER iteration 1 position 49 Fri
Mar 4 14:58:54 2005
FASTER iteration 1 position 33 Fri Mar 4 14:58:54 2005
FASTER iteration 1 position 29 best_energy = 1628.266820 Fri Mar 4
14:58:54 2005
FASTER iteration 1 position 29 Fri Mar 4 14:58:54 2005
10.2.2 Monte carlo simulated annealing (JOBTYPE mc)
Accepted aliases:
JOBTYPE mc
JOBTYPE monte_carlo
JOBTYPE simulated_annealling
JOBTYPE sa
Monte carlo simulated annealing (MC) is a classic method for complex
optimization problems (Metropolis
et al 1953; Lee and Subbiah
1991), and is described in more detail in Figure 10.2.2.1.
In this implementation, number_mc_cycles
independent MC runs are performed (starting from a random
conformation). The solutions are ranked, and the top unique number_mc_solns solutions are subjected to the
heuristic quench minimization procedure discussed below. These
minimized solutions are ranked and output into the OUTPUT_PREFIX.mc.hqm.out
file.
Default values:
number_mc_cycles
2500
(number of independent MC runs)
number_mc_solns
250
(number of MC solutions that are subjected
to heuristic quench minimization)
mc_convergence
0
(default
0 -> automatically set to 100 times the number of floating
positions)
If number_mc_cycles is defined, but number_mc_solns is not, number_mc_solns
is set to be 10% of number_mc_cycles. This
default seems to work well; therefore, number_mc_solns
is usually not defined.
An optimal number_mc_cycles is difficult
to define. Unfortunately, the number of rotamer combinations does not
appear to be a good metric of problem difficulty. The roughness of the
sequence-structure energy landscape is the true measure of
mis-convergence; this is not easy to define. For some problems, 2500 is overkill, while for others, it's not
enough. In practice, number_mc_cycles 10000
appears to be robust for total design problems. For large total design
problems (>150 residues), this results in a runtime of little over a
day on a single 1.0 GHz CPU. For total design of smaller proteins, the
run can often be completed in a few hours. For designing
a ligand binding site, the run may require a few days. For
single sequence rotamer optimization, number_mc_cycles
100 is often enough to find the optimal solution; this requires
on the order of a few seconds to a several minutes, depending on the
size of the protein.
A useful alternative is to define number_mc_cycles
to be the number of cycles one would like to run, given no time
constraints, and then define RUNTIME. This
option returns the best solutions found if the program has not yet
performed number_mc_cycles independent MC
runs within the desired amount of time (see RUNTIME
section below). The solutions are ranked and minimized as described
above.
By default, the logfile OUTPUT_PREFIX.log
is written (see logfile section below). The format for MC output is:
mc
independent MC run #
energy
timestamp
For example (examples/rotamer_optimization/gb1.mc_ga.rotamer_optim.log):
mc 50
1634.096631 Fri Mar 4 14:59:17 2005
mc 60 1631.632009 Fri Mar 4 14:59:18 2005
mc 70 1803.564834 Fri Mar 4 14:59:19 2005
mc 80 1628.781385 Fri Mar 4 14:59:21 2005
mc 90 1801.469873 Fri Mar 4 14:59:22 2005
Unlike for GA, logfile lines are not written during each independent MC
run, since each of these are so fast.
10.2.3 Genetic algorithm (JOBTYPE
ga)
Accepted aliases:
JOBTYPE ga
JOBTYPE genetic_algorithm
Genetic algorithms are powerful methods for complex optimization
problems, such as protein design (Desjarlais and Handel 1995).
They are essentially evolution in a computer (Figure 10.2.3.1). A population of GA_population random candidate solutions is
generated and scored. These solutions are allowed to mate with each
other, generating offspring that are hybrids of both parental
solutions. The mating frequencies are related to a boltzmann-like
probability term; akin to simulated annealing, the simulation
temperature required for calculating these is high early in the run,
and cools down (see
simulation temperature section).
Since the mating probability for a given candidate solution is related
to its score, better solutions mate more frequently, making their
components more numerous in the offspring. This new population of
candidate solutions is scored, ranked, and allowed to mate. After GA_convergence number of iterations/generations
without improvement, the run is deemed to have converged. The best solutions_per_ga_cycle unique solutions are
subjected to heuristic quench minimization (described below).
In this implementation, number_ga_cycles
independent GA runs are performed. The top unique solutions_per_ga_cycle
from each independent GA run are culled and used to seed a final
population. This population is subjected to a final tournament GA run.
Default values:
ga_population
500
(size of
actively mating population; full population is 2x larger)
number_ga_cycles
10
(number of independent GA runs)
solutions_per_ga_cycle 10
(number of best solutions culled from each independent GA run)
ga_convergence
300
(number of isoenergetic generations to define convergence)
As with MC, it may not be possible to define the optimal number
of independent GA runs required
to find the best solution. Therefore, defining number_ga_cycles
to a larger number, like 100, and
defining RUNTIME may be a good strategy
(see RUNTIME section below). If number_ga_cycles independent GA runs cannot be
performed within the defined RUNTIME, the
solutions that have been identified are used to seed the final
tournament GA run.
By default, the logfile OUTPUT_PREFIX.log
is written. The format for GA output is (examples/rotamer_optimization/gb1.mc_ga.rotamer_optim.log):
ga
generation #
energy ∆E timestamp
For example:
ga 110
1628.266820 0.000000 Fri Mar 4 15:04:22 2005
ga 120 1628.266820 0.000000 Fri Mar 4 15:04:23 2005
After each independent GA run is completed, a
similar output is written:
GA
independent GA run #
energy
timestamp
GA 1 1628.266820 Fri Mar 4
15:04:41 2005
10.2.4 Heuristic quench minimization (HQM)
Heuristic quench minimization (HQM) cannot be defined as a JOBTYPE by the user. This method is used to
minimize, in discrete-rotamer space, solutions from other methods.
Briefly, given an initial structure, a position is picked at random.
All possible rotamers permitted at that position assayed to find the
lowest energy rotamer, and the structure updated. Another position is
picked at random from this updated structure, and the process repeated.
After some number of cycles without improvement in the energy, the run
is deemed to have converged . This method is useful since the current
implementation minimizes the candidate solutions to the point that no
single rotamer move can result in lower energy.
By default, the logfile OUTPUT_PREFIX.log
is written to with the following format (examples/rotamer_optimization/gb1.mc_ga.rotamer_optim.log):
quench
solution #
energypre-quench
energypost-quench
timestamp
For example:
quench 4
1628.350176 1628.266820 Fri Mar 4 15:04:10 2005
quench 5 1628.351251 1628.266820 Fri Mar 4 15:04:10 2005
quench 6 1628.386728 1628.266820 Fri Mar 4 15:04:10 2005
quench 7 1628.402922 1628.266820 Fri Mar 4 15:04:10 2005
It should be noted that the while the post-quench energy correlates
with the pre-quench energy, it is not a very good correlation;
lower-ranked MC or GA solutions often result in minimized energies that
are lower than the energy resulting from minimizing the best solution.
10.2.5 JOBTYPE
mc_ga
or JOBTYPE sa_ga
number_mc_cycles independent MC runs are
done, and the best unique number_mc_solutions
are ranked and heuristic quench minimized, as for the standard MC
described above. These solutions are used to seed number_ga_cycles
independent GA runs. The best solution from these GA runs is the final
solution.
The options described above for MC and GA are accepted. The default for
number_ga_cycles is changed:
number_ga_cycles
2
(number of independent GA runs)
As described above, RUNTIME may be
defined. If defined, the RUNTIME clock is
reset after the MC phase, so the GA phase has the same amount of time.
By default, the logfile OUTPUT_PREFIX.log is
written to, with the formats described above.
The OUTPUT_PREFIX.mc_hqm.out and OUTPUT_PREFIX.ga.out files list solutions from
the MC/HQM and GA steps of the optimization respectively.
10.2.6 Dead-end
elimination (JOBTYPE
dee)
Dead-end elimination (DEE) is a family of techniques that results in
the elimination of rotamers and rotamer pairs that cannot be in the
global minimum solution . After all non-optimal rotamers and rotamer
pairs are eliminated, only one solution remains: the optimal one. This
is the only technique that is guaranteed to return the optimal
solution, if it converges (convergence is not guaranteed). In
actuality, once the number of possible solutions drops into the range
where exhaustive search is possible, exhaustive search is used.
An important caveat is that DEE must be used with energy functions that
are pair-wise decomposable, which the
EGAD energy function is. However, for design, it is often required to
consider solubility and solubility, which is assayed by a non-pairwise
method (discussed in Pokala and
Handel 2005). At present, DEE cannot handle this.
Nevertheless, as discussed below, DEE is can be useful for
single-sequence rotamer optimization. For jobs listed above where
specificity and solubility are not considered (such as single-sequence
rotamer optimization), Goldstein elimination is used to quickly prune
dead-ending rotamers.
There are a few user-definable options for DEE. When DEE has determined
that the number of solutions remaining is less than max_combo_exhaustive_search, then DEE is aborted
in favor of exhaustively searching through all possible combinations.
max_combo_exhaustive_search
100000
(default)
max_nodes is the maximum number of leaf
nodes visited by the exhaustive search.
max_nodes
100000
(default)
A powerful criterion used for DEE is the bounding energy E_bounding, which acts as a ceiling for
rejection; candidate solutions with energies higher than this cannot be
in the global minimum and may be rejected immediately. By default, this
is calculated by performing 100 independent MC runs. In order to
prevent errors if the MC-calculated solution happens to be the global
minimum, E_bounding_ceiling is added on. E_bounding may also be defined explicitly by the
user. Usage with defaults:
E_bounding
0.0
(default 0.0 -> calculate with MC)
E_bounding_ceiling
1e-6
If bounding should not be used for some reason, set:
bounding_flag 0
(default 1 -> use bounding for DEE)
If RUNTIME is defined, and if convergence
has not occurred within the defined time, DEE is aborted, and MC run
instead. It is possible to have an MC that takes advantage of the known
eliminated rotamers and rotamer pairs to limit the search space.
However, this is not presently implemented.
By default, the logfile OUTPUT_PREFIX.log is
written to with the following format:
dee
message
timestamp
The messages state things like which DEE block is
currently being used, and the number of possible solutions remaining,
etc.
10.2.7 Self-consistent mean field
optimization (JOBTYPE scmf)
Unlike the other methods described here, self-consistent mean field
optimization (SCMF) does not result in a final structure. Rather, it
results in a probability for each rotamer at each position, and
provides a measure of sidechain entropy (Koehl and Delarue 1994). Since
the result is a
self-consistent ensemble of rotamer and amino acid probabilities, this
method is not especially useful for design. However, it can provide
information on what amino acid types may be permissible at certain
positions (Voigt et al. 2001).
It has been suggested that the mean-field energy resulting from an SCMF
run provides a better estimate of the actual energy than a single
structure resulting from the other optimization methods (Lee 1994; Havranek and Harbury 2003).
However, none
of the development of the EGAD energy function has assumed used this.
For algorithm description, please see Figure
10.2.7.1. SCMF is used within EGAD for JOBTYPE pK.
Initial probability distribution scmf_seed
can be unbiased, random,
or weighted. If unbiased
(default), each rotamer at a position is assigned the same initial
probability. If random, each rotamer is
assigned a random probability. If weighted,
the initial probability is defined based on the rotamer-backbone
energies.
Usage:
scmf_seed
seed-type
(unbiased, random, or weighted; default unbiased)
SCMF does not converge to a unique solution; different random starting
points can result in convergence to one of a handful of self-consistent
solutions. If the unbiased seed is used, it will likely converge to the
solution with the largest radius of convergence; in many cases, this is
not the lowest-energy ensemble (Mendes
et al. 1999). Therefore, it is suggested that
multiple random seeds be used. This can be indicated with number_scmf_cycles. The default is 1. When number_scmf_cycles>1, the first SCMF run is always seeded with the
user-defined scmf_seed probability, while subsequent ones have random seeds.
Usage:
number_scmf_cycles
n
(n = number of independent SCMF runs; default 1)
As with the other methods, RUNTIME
may be defined. If number_scmf_trials SCMF
runs have not been completed within the time allotted, the
lowest-energy solution is output.
As described earlier, SCMF optimizations can often result in
oscillations; the ensemble moves from one configuration to another,
never actually converging to a unique state. To prevent this, the
boltzmann probabilities calculated from the energies that are used to
calculate the mean-field energies in the next iteration are actually a
weighted average between the actual boltzmann probability, and the
probabilities of the previous iteration (Koehl and Delarue 1994). The
default value:
scmf_lambda
0.25
(weight on current iteration boltzmann probability for averaging )
The weight on probabilities from the previous cycle is 1.0 - scmf_lambda. The default listed here
appears to give timely convergence in most cases tested.
While SCMF does not converge to a single solution, the rotamer
probabilities may be used to generate a structure composed of the most
probable rotamer at each position. This structure is most likely not
the lowest energy structure; indeed, it may be some distance from it.
Nevertheless, by default, this structure is the one written to the
outputted pdb file.
If a lower energy structure is desired, set:
scmf_quench_flag
true
# default false)
and/or:
number_scmf_solns
n
# n = # of SCMF solutions; default 10; if defined, scmf_quench_flag is true)
These options direct the generation of number_scmf_solns
random structures whose rotamers are biased by the mean-field
probabilities. These biased-random structures are used as seeds for MC
runs whose outputs are subjected to heuristic quench minimization. The
lowest energy structure is output.
By default, the logfile OUTPUT_PREFIX.log is
written to with the following format:
scmf iteration#
convergence_score Emf
T∆S
Emf-T∆S
timestamp
The convergence_score field is the sum of
the squares of the differences between rotamer probabilities in the
current iteration and the previous one; convergence is assumed when
this falls below 10-6. The Emf field
is the mean-field energy, calculated using the current rotamer
probabilities. The T∆S field is the
entropy calculated with current rotamer probabilities; the temperature T is the user-defined TEMPERATURE
value. The Emf-T∆S field is the free energy of the current
mean-field ensemble.
See examples/scmf/gb1.scmf.wt_seq.input and the resulting output for an example of
scmf optimization of a single sequence.
If more than one residuetype is allowed at a given position, the
probabilities for each residue are given As discussed by Voigt et al ,
SCMF-derived amino-acid probabilities may be of use for designing
libraries for selections. However, since these probabilities are not
physical (what is a fractional residue?), caution should be exercised.
The final pdb file lists these amino acid probabilities and positional
entropies (examples/scmf/gb1.scmf.design.pdb):
...
SCMF_ENTROPY 42 TdS = 1.555334 A 0.245224 D 0.143858 E 0.084495 F
0.000000 H 0.000991 I 0.000000
K 0.010949 L 0.000001 M 0.000000 N 0.153603 Q 0.035709 R 0.015772 S
0.309239 T 0.000158 V 0.0000
01 W 0.000000 Y 0.000001
SCMF_ENTROPY 43 TdS = 0.013396 A 0.000000 D 0.000000 E 0.000000 F
0.000000 H 0.000000 I 0.000000
K 0.000007 L 0.000000 M 0.000000 N 0.000000 Q 0.000000 R 0.999993 S
0.000000 T 0.000000 V 0.0000
00 W 0.000000 Y 0.000000
SCMF_ENTROPY 44 TdS = 1.290649 A 0.056263 D 0.212732 E 0.001518 F
0.000000 H 0.002230 I 0.000000
K 0.001435 L 0.000000 M 0.000000 N 0.643994 Q 0.005267 R 0.001966 S
0.074595 T 0.000000 V 0.0000
00 W 0.000000 Y 0.000000
...
In this example, position 42 has a 24.52%
probability for A, 14.39% probability for D, 15.36% probability for N,
30.92% probability for S, and smaller probabilities for other
residuetypes. Position 43 is far less
degenerate; the probabilities are dominated by R. Like position 42, position 44 has
significant degeneracy, but different residues are preferred compared
to position 42.
10.2.8 OUTPUT_PREFIX.xxx.out
format
MC and GA produce OUTPUT_PREFIX.mc_hqm.out
and OUTPUT_PREFIX.ga.out files
respectively. These files list the top candidate solutions generated by
these methods, ranked by energy.
Format:
...
SOLUTION_# n
ENERGY E_working: xxxx
ROTAMER seq_position core_surf_int residuetype chi[1] ... chi[num_chi]
...
ROTAMER seq_position
core_surf_int residuetype chi[1] ... chi[num_chi]
SEQUENCE_n
sequence_of_the_solution_including_non_variable_positions
VARIABLE_SEQUENCE_n
sequence_of_the_solution_only_for_variable_positions
END
SOLUTION_# n+1
...
The SEQUENCE,
VARIABLE_SEQUENCE, E_working (the
energy of the candidate solution), and ROTAMER
entries are described in the pdb format section. The ROTAMER entries
can be used as input for user-defined
rotamers
in the inputfile. VARIABLE_SEQUENCE
entries are not produced for single-sequence rotamer optimizations.
10.3 User-definable options
10.3.1 Simulation temperatures
As described above, MC and GA require temperature-dependent boltzmann
probability calculations for making moves. The simulation temperature
decreases with each iteration, for the reasons discussed above. These
temperatures, and the cooling cycle can be defined by the following
parameters:
mc_T0
5000
# (starting temperature for simulated annealing)
mc_dT
-10
# (temperature change per iteration for simulated annealing)
mc_Tf
100
# (final temperature for simulated annealing)
ga_T0
5000
# (starting temperature for GA)
ga_dT
-10
# (temperature change per iteration for GA)
ga_Tf
100
# (final temperature for GA)
The temperature starts at T0, decreases
linearly by dT every iteration. When the
temperature reaches Tf, it is held
constant for the remainder of the run.
The defaults listed here were determined by trial-and-error for a
several test cases. By no means are these optimal for all problems.
Indeed, if one is going to perform many optimizations for a given
backbone, it may be worthwhile to empirically determine better values
with a series of short runs.
Non-linear cooling schedules are presently not implemented; however,
based on trial-and-error, it does not seem that these are significantly
better than the simple linear schedule used here. Indeed, hyperbolic
schedules seem to give worse results.
10.3.2 RUNTIME
It is easy to tell when DEE has converged; only one
solution remains. For FASTER, convergence is achieved when no more
moves are possible. However, for stochastic methods, it is not quite so
straightforward to determine when convergence has been reached. In
principle, the other methods should converge to the global minimum,
given infinite time. In practice, convergence is defined by some
empirically determined number of moves without improvement (discussed
below in implementation details). Since these methods often (if not
usually) become trapped in some local minimum, it is necessary to run
these many times. The question is how many times is enough.
Unfortunately, this is not an easy question to answer. While the
default values described above appear to be pretty good, they are
certainly not optimal. For some problems, they may be overkill, while
for others, not enough.
The runtimes discussed above may be reasonable for most situations.
However, if there are many situations in which what is desired is the
best solution that can be obtained within a given period of time.For
these cases, the RUNTIME
option is useful:
RUNTIME n
(n = time in decimal minutes)
When this option is defined, the program periodically checks the time
it has spent on a particular algorithm. If RUNTIME
minutes have elapsed since the start of the optimization method block,
the program moves to the next step (usually heuristic quench
minimization on the best solutions).
RUNTIME does not include the time needed to
calculate the pair energy table or the heuristic quench minimization on
the best solutions. If multiple algorithms are used in hybrid methods
such as MC_GA, the clock is reset for each method.
10.3.3 Logfile options
Logfiles are written to assure the user that the process in question
hasn't crashed <yet>, and to also give some measure of progress,
to permit an estimation of completion time.
If LOGFILE_FLAG 1 (default), logfiles are
written to OUTPUT_PREFIX.log.
If LOGFILE_FLAG 0, neither
the OUTPUT_PREFIX.log logfile or OUTPUT_PREFIX.out file that contains sub-optimal
solutions are written. This option is often used for batch processing
when it is not desirable to clutter up a working directory with many
files.
By default, the a logfile entry is written only every 10th iteration
(for MC, "iteration" = independent MC run; for GA, "iteration" =
generation within a GA run). This may be changed with:
LOGFILE_PERIOD
n
(n = logfile line written every nth iteration; default 10)
10.4 Premature program termination
If the user wishes to gracefully exit the program pre-maturely, or (for
multi-step optimizations such as MC_GA),
jump to the next step, the logfile lists the names of files that may be
created to signal this.
touch escape.job_specific_label.pid (pid= process ID number)
will create the "escape hatch" file, and lead to premature program
termination or advancement to the next step. Depending on the job, the
actual escape or jump may not happen instantaneously.
10.5 Strengths and weaknesses of different
rotamer optimization
methods
These algorithms have different strengths and weaknesses. An excellent
reference is the work of Voigt and Mayo (Voigt et al 2000). In this
study, the
investigators tested various rotamer optimization methods and compared
their accuracy to the known global minimum calculated by DEE.
Dead-end elimination (DEE) is guaranteed to converge to an optimal
solution, if it converges at all. For some smaller problems, DEE
converges quite rapidly. For larger problems, it can take much longer
to converge. If non-pairwise filters are placed upon the search, such
as solubility, DEE cannot be used. However, there may be ways to
address some of these problems in a pairwise manner. Due to these
issues, the current implementation of DEE is recommended for use only
for single sequence rotamer optimization. For these problems, DEE
almost always converges, and is quite fast.
Unlike DEE, stochastic methods such as the genetic algorithm (GA) and
monte carlo simulated annealing (MC) are not guaranteed to reach the
global minimum (caveat; see below). For smaller problems, such as
single sequence rotamer optimization, they are often significantly
slower than DEE, due to the requirement of running multiple individual
runs (an individual MC run, of course, is much faster than any DEE
run). However, these methods have some advantages over DEE for other
problems. These methods can easily address problems with non-pairwise
search filters, such as specificity. Unlike DEE, these problems scale
well for larger problems. Indeed, for design problems, these methods
are significantly faster than DEE (even when the running time includes
thousands of independent MC runs). Finally, the fact that these are not
guaranteed to find the global minimum is often not a problem. For many
problems, sub-optimal, near-global solutions may suffice. It should be
noted that for almost all single-sequence rotamer optimization cases
tested so far, the MC-calculated solution has been shown to be
identical to the DEE-calculated global minimum, strongly suggesting
that the implementation in EGAD is robust.
While some success has been reported using heuristic quench
minimization (HQM) alone with random starting points, it requires an
enormous number of runs to find near-optimal or optimal solutions for
larger problems (Wernisch et
al 2000); unpublished observations. Since it is a purely
downhill method, the solution is highly sensitive to the starting
point; the heuristic quench procedure is very good at searching local
space, but does not sample the entire space well. In contrast, MC
samples the entire space far more efficiently, but it is not so good at
the "endgame" of finding the lowest energy near a low-energy, yet
suboptimal solution. This is due to the highly stochastic nature of MC
moves; a solution may be on a cliff wall, but if the right random
rotamer at the right random position is never picked within the
proscribed number of random tries, the minimal solution is never found.
On the other hand, HQM moves are exhaustive at each selected position,
and the current implementation guarantees that each position is
examined. The hybrid implementation used here allows the strengths of
each algorithm to complement the other's weaknesses. One may also
imagine other hybrids, such as an MC in which each move consists of
picking a random position, assaying the energy of all rotamers at that
position in a manner similar to HQM, and selecting one randomly, biased
by the boltzmann probabilities; this would contain the exhaustive
nature of HQM moves, yet permit uphill moves to escape local minima.
An individual GA run has a much better chance of finding the optimal
solution than an individual MC run. However, each individual run is
much slower (minutes vs. seconds). MC is fast, but gets trapped easily
in local minima. Therefore, it is necessary to run orders of magnitude
more independent runs for this method than for a GA. As a trade-off, it
appears that a very good general method is the JOBTYPE
mc_ga option, in which solutions from individual MC runs are
used to seed a population for GA. If one may imagine a landscape with
multiple well-separated valleys, each individual MC run (including HQM)
converges to one of the valleys. A GA seeded with these can create
hybrid solutions that may have lower energies than the already
low-energy parents. After all, a basic premise of GA is that hybrids of
good solutions are more likely to lead to better solutions than hybrids
of poor solutions; an initial population full of good solutions should
have better odds of finding the optimal solution than a completely
random initial population. It should be noted that for small problems
like single-sequence rotamer optimization or core design, the GA step
more often than not does not result in an improved solution; for these
problems, the MC-derived solution is often optimal, as discussed above.
The FASTER algorithm has some of the quasi-exhaustive properties
of HQM and DEE with the uphill moves of MC. Fixing each rotamer is
analogous to making an uphill MC move; this fixed rotamer may have a
worse energy than the current solution. However, performing an
exhaustive search at all other positions guarantees that all
single-move improvements will be identified. This method explores the
local landscape of a potential solution quasi-exhaustively by climbing
all neighboring ridges (fixing each rotamer) and looking down
(exhaustive search at all other positions). For small problems,
especially single sequence rotamer optimization, FASTER is indeed fast
(seconds), certainly much faster than DEE. Another major advantage that
FASTER has over DEE is that, like MC and GA, it can be used with
non-pairwise decomposable energies, such as our hydrogen-bonding
specificity reference model (Pokala
and Handel 2005).
While the default implementation of FASTER is the best for small
problems, including single-sequence rotamer optimization, it has
problems with large design problems. However, by increasing number_MC_cycles from 10,
to say, 200, FASTER may yield similar or
better results. Since the optimal number of MC cycles that should be
run during FASTER for large design problems has not yet been
determined, MC_GA is recommended.
10.6 Implementation details
10.6.1 Data structures and general-use
functions
The first rotamer optimization method code written in EGAD was the
genetic algorithm, thus explaining the name of the program, as well as
the names and organization of many of its data-structures. The CHROMOSOME and GENE
(and bkbnGENE for moving backbone....not
relevant here) structs are the primary repositories and conduits for
the dihedral representation of the protein structure. An individual CHROMOSOME represents a single protein
structure, and contains information about the energy, and information
for the optimization method, such as mating frequencies for genetic
algorithms. GENE genes
and bkbnGENE bkbngenes
are linked-lists within CHROMOSOME that
carry the actual information necessary for building the coordinates,
such as dihedral angles. GENE has pointers
to elements of lookupEnergy and var_pos, permitting rapid access to pair energy
lookup table values during rotamer optimization.
The genes linked list within a CHROMOSOME structure is allocated and
initialized by GA_utilities.cpp:
inoculate_sidechains, using a VARIABLE_POSITION
array as a template. This function calls GA_utilities.h:
mutate_sidechain to select a random residuetype and
biased-random rotamer for each GENE (for
each non-pro/gly position), and assign the appropriate pointers within
the GENE elements. The function GA_utilities.cpp: randomize_sidechains works in
a similar manner to generate new random choices for an already
initialized GENE linked-list within CHROMOSOME. For both these functions, if
solubility should be considered, the CHROMOSOME
is passed to solubility.cpp: solubilize_CHROMOSOME
(see solubility design section).
In contrast to these functions, GA_utilities.cpp:
mutagenize_sidechains_CHROMOSOME does not completely randomize a
pre-initialized CHROMOSOME; it uses the
inputted mut_freq mutation frequency
randomly to pick positions to alter.
GA_utilities.cpp and GA_utilities.h
contain other useful functions.
Energies are calculated for the entire CHROMOSOME
by the function CHROMOSOME_to_lookupEnergy.cpp:
CHROMOSOME_to_lookupEnergy. For MC and HQM, the function CHROMOSOME_to_lookupEnergy.cpp:
lookup_energy_sasa_charge_other is used to calculate the
interaction energy between a single rotamer at a given position and the
backbone and rotamers at all other position. Both these functions
access the lookup table through pointers in GENE,
which point to CHOICE elements within the VARIABLE_POSITION array, which in turn point to
the actual lookup table elements. If a pair of residues interact, but
the energies for that pair has not yet been read/calculated (flagged as
described in the pair energy lookup table section), CHROMOSOME_to_lookupEnergy.cpp:
load_calc_save_lookupResX is called to read/calculate, and save
this information. If the entire lookup table is read/calculated
upfront, these function will not be called.
10.6.2 Genetic algorithm (GA)
10.6.2.1 Control functions for genetic
algorithm
If "GA" is defined for protein->parameters.algorithm,
main sends protein
to GA.cpp: GA_rotamers_master_control after
the lookup table is read/generated by lookup_table.cpp:
generate_lookup_table. A CHROMOSOME
array is initialized and sent to GA.cpp: GA_rotamer_control.
After this function is completed, the rotamers, energy and sequences of
solutions are written to OUTPUT_PREFIX.out.
The lowest-energy CHROMOSOME is copied to protein->final_chr. This is then passed to CHROMOSOME_to_lookupEnergy.cpp:
final_chr_to_final_pdb_final_energy to generate the final pdb
structures and energies in protein.
GA.cpp: GA_rotamer_control manages the GA.
This takes as input an array of CHROMOSOME chrs,
which contains chrs_size CHROMOSOMEs that
are used to seed the actual GA runs. These are solutions from
independent runs of the previous optimization method; however, if this
function is called by GA.cpp:
GA_rotamers_master_control, chrs should contain random CHROMOSOMEs.
The chrs array is sorted by energy upon
entry into GA_rotamer_control. working_chrs and final_chrs
CHROMOSOME arrays are initialized; these are respectively the
working CHROMOSOME array sent to GA.cpp: evolve_sidechains for independent GA
runs, and the CHROMOSOME array which
stores the best solutions from all the runs. parameters->number_GA_cycles
independent GA runs are performed. The nth
GA run includes chrs[n] onward in the
initial population stored in working_chr;
the remainder of the population 2*parameters->pop_size
are filled with random CHROMOSOMEs. If the
passed chrs_size<0, the number of chrs array elements is |chrs_size|,
and each is randomized before each independent GA run. working_chr is passed to GA.cpp:
get_rid_of_isoenergetic_chrs in order to replace isoenergetic
(ie: identical) CHROMOSOMEs. After
seeding, working_chr is passed to evolve_sidechains, which performs the actual
genetic algorithm. The best parameters->num_GA_solns_per_cycle
solutions are copied into final_chrs.
After all the independent GA runs are completed, and if there is time
available, final_chrs is passed to evolve_sidechains for a final tournament GA. The
best chrs_size solutions in final_chrs are copied back to chrs for return back to the calling function.
10.6.2.2 Actual genetic algorithm: GA.cpp:
evolve_sidechains
All the members of the inputted initial population chrs must be
soluble, as assessed by solubility.cpp:
is_this_chr_soluble (if SOLUBILITY_CUTOFF_FLAG
0, <default> all are soluble). After being scored for
energy by CHROMOSOME_to_lookupEnergy.cpp:
CHROMOSOME_to_lookupEnergy, the chrs
array is passed to GA.cpp:
get_rid_of_isoenergetic_chrs to replace identical members.
The population is sorted by energy by sort_CHROMOSOME,
and sent to GA_utilities.cpp:
calculate_mating_frequencies. This function calculates the
relative (with respect to the best one) real energies pop_energy for each member of the population.
The rank is used to define deltaRank for
each member of the population. The mating frequencies boltzProbab and rankProbab
calculated for each CHROMOSOME by boltzmann_CHROMOSOME. boltzProbab
is the boltzmann probability of a given CHROMOSOME,
based on the relative real energy pop_energy,
while rankProbab is the boltzmann
probability based on the pseudo-energy deltaRank.
While both result in lower energy CHROMOSOMEs
mating more frequently, the double scheme used here ensures that
individuals with significantly lower energies than the others will not
take over the population; this prevents genetic drift and the founder
effect that result in premature convergence. If calculate_mating_frequencies
returns 0, the GA run is deemed to have
convereged, since the worst and best ranked members are identical; in
principle, this should never happen, except for very small problems.
The GA runs until the number of isoenergetic generations convergence_counter ≤
parameters->ga_convergence. In each generation, the
population is mated by GA_utilities.cpp:
mate_sidechains, based on the mating probabilities calculated in
the previous cycle. Mating is a very simple procedure. Two random CHROMOSOMEs are selected and copied to two child
CHROMSOME elements. The children are
allowed to undergo recombination (cross-over frequency of 50%;
accomplished by swapping nextgene elements
of the genes linked-list at cross-over points. Insoluble offspring are
solubilized by calling solubility.cpp:
solubilize_CHROMOSOME. As described above, energies of the
children are scored, identical members pruned, and mating frequencies
calculated.
If ten generations have gone by with the same best energy, the mutation
frequency is increased steadily for ten generations. If after ten
generations with increased mutation frequency the energy has still not
dropped, the mutation frequency is reset to the standard value parameters->mutation_freq for another ten
generations, or until the energy drops. If the energy has not dropped
after an addition ten generations, the mutation frequency is allowed to
rise again, and then relax until the energy drops. Once the energy
drops, the mutation frequency is reset to parameters->mutation_freq.
The simulated temperature is decreased as discussed above (simulation
temperature section).
After convergence is achieved, or if time runs out, the population is
sorted, and the best parameters->num_GA_solns_per_cycle
solutions are subjected to HQM by calling HQM_rotamers.cpp:
HQM_rotamers. This final population is sorted.
10.6.3 Monte carlo simulated annealing (MC)
10.6.3.1 Control function MC.cpp:
MC_rotamer_control
If protein->parameters.algorithm starts
with "MC", main
sends protein to MC.cpp:
MC_rotamer_control after the lookup table is read/generated by lookup_table.cpp: generate_lookup_table. The protein->chr CHROMOSOME array is allocated
with 2*protein->parameters.number_MC_cycles
members. protein->parameters.number_MC_cycles
independent MC runs are performed by calling MC.cpp:
MC_rotamers with a randomized member of the protein->chr
array as an argument.
After the defined number of independent MC runs are completed (or if
time runs out) GA.cpp:
get_rid_of_isoenergetic_chrs is called to filter protein->chr, when is then sorted. The best protein->parameters.number_MC_solns are
subjected to HQM. protein->chr is
again filtered to eliminate isoenergetic solutions and sorted.
If protein->parameters.algorithm contains
"GA", the quenched protein->chr
array is passed to GA.cpp: GA_rotamer_control
for performing GA runs, using the MC/HQM solutions as seeds.
The rotamers, energy and sequences of solutions are written to OUTPUT_PREFIX.out. The lowest-energy CHROMOSOME is copied to protein->final_chr.
This is then passed to CHROMOSOME_to_lookupEnergy.cpp:
final_chr_to_final_pdb_final_energy to generate the final pdb
structures and energies in protein.
10.6.3.2 Actual simulated annealing: MC.cpp:
MC_rotamers
The inputted CHROMOSOME chr must be
soluble, as assessed by solubility.cpp:
is_this_chr_soluble (if SOLUBILITY_CUTOFF_FLAG
0, then it is soluble). A working CHROMOSOME
dummychr and a working GENE i_gene
are allocated. chr is copied to dummychr. Bona fide Floating positions
are identified, and their var_pos indicies
loaded into the indicies_of_floating_positions
array. If SOLUBILITY_CUTOFF_FLAG 1, the
SASA, transfer energy, and charge components due to fixed positions are
added up by solubility.cpp:
fixed_position_sasa_charge.
The actual simulated annealing runs until parameters->mc_convergence
iterations pass without decreasing the energy. In each iteration, a
random floating position i is picked. The dummychr genes element corresponding to this
position is copied to i_gene. The
interaction energy of the rotamer currently at position i with all other floating rotamers at all other
non-i positions is calculated with CHROMOSOME_to_lookupEnergy.h:
lookup_energy_sasa_charge_other. If SOLUBILITY_CUTOFF_FLAG
1, this function also returns the SASA, transfer energy, and
charge components of the other floating non-position i rotamers. The internal energy and interaction
energy with fixed atoms energy_var_fix is
added to this energy.
i_gene is then mutated to a random
residuetype and rotamer. If SOLUBILITY_CUTOFF_FLAG
1, i_gene is repeatedly mutated
until dummychr as a whole is soluble (see
solubility design section). The energy of this new rotamer is
calculated as described above for the current rotamer. If the
Metropolis criterion is satisfied, the move is accepted, i_gene copied into dummychr,
and the energy of the entire structure calculated with CHROMOSOME_to_lookupEnergy.cpp:
CHROMOSOME_to_lookupEnergy. If the new dummychr
is the lowest energy one visited thus far, it is copied into the
inputted chr. Finally, the simulation
temperature is dropped as described above.
The implementation described here does not require calculation of the
energy for the entire structure in order to make a move; instead, only
the energies involving the position under consideration need to be
calculated. This saves a tremendous amount of time. Indeed, if complete
energy calculation were required, MC would be significantly slower than
GA, since the vast majority of MC moves are likely to be rejected.
10.6.4 Heuristic quench minimization (HQM): HQM_rotamers.cpp: HQM_rotamers
The start of this function is virtually identical to the start of MC.cpp: MC_rotamers.
The actual HQM runs until hqm_convergence
iterations pass without a decrease in energy. A random floating
position is picked. The genes element in dummychr.genes
corresponding to this position is identified and copied to the working GENE i_gene. The energy of this rotamer with all
other floating sidechains is calculated with CHROMOSOME_to_lookupEnergy.h:
lookup_energy_sasa_charge_other, and the energy_var_fix
added on.
The lowest energy rotamer at this position is identified by calling HQM_rotamers.h: find_best_res_res_rot. This
function exhaustively measures the energy of all rotamers permitted at
this position to find the lowest energy one (rotamers that cause dummychr to become insoluble are identified and
rejected). If a lower energy rotamer is indeed found, the relevant gene
in dummychr.genes is updated.,
An array already_used_positions contains
information on whether a position has been used or not. HQM_rotamers.cpp: has_this_position_been_used
checks this array and returns 1 if it has been used. If the energy
drops, the already_used_positions array is
reset. Convergence is assumed when every position has been visited
without a drop in energy.
The final dummychr is copied to chr, and the total energy calculated with CHROMOSOME_to_lookupEnergy.cpp:
CHROMOSOME_to_lookupEnergy.
As with the MC implementation, the HQM implementation described here
does not require calculation of the energy for the entire structure in
order to make a move; instead, only the energies involving the position
under consideration need to be calculated. Indeed, since all moves are
required to be downhill, the energy of the entire structure needs to be
calculated only once, at the end.
10.6.5 Fast and accurate side-chain topology
and energy
refinement (FASTER)
If "FASTER" is defined for protein->parameters.algorithm, main sends protein
to somewhat_FASTER.cpp: somewhat_FASTER after
the lookup table is read/generated by lookup_table.cpp:
generate_lookup_table. protein->parameters.number_MC_cycles
independent MC/HQM are run. The best of these is copied to the best_chr CHROMOSOME.
As with HQM and MC, bona fide moving positions are identified
and stored in the indicies_of_floating_positions
array, which is used for randomly picking positions to freeze.
Similar to HQM, an array already_used_positions
contains information on whether a position has been used or not. HQM_rotamers.cpp: has_this_position_been_used
checks this array and returns 1 if it has been used. The outer loop
selects an unused position, exhaustively freezes each rotamer in turn,
and uses to find the optimal rotamer at all other positions. The frozen
rotamer and the optimal rotamers are stored in best_rotamer_chr
CHROMOSOME. After optimal rotamers are selected for all the
non-frozen positions, best_rotamer_chr is
scored for the total energy. If its energy is lower than that for the
current best_chr, it replaces it, and the
convergence counter (converged_flag) is
reset. Convergence is assumed if all the positions are visited and
frozen without an improvement in energy.
10.6.6 Dead-end elimination (DEE)
The DEE implemented in EGAD was written primarily by Mark Voorhies; the
reader is directed to his documentation. At present, the DEE
source code is distributed under LGPL
license, not
GPL.
If "DEE" is defined for protein->parameters.algorithm, main sends protein
to DeeControl.cpp: DeeControl after the
lookup table is read/generated by lookup_table.cpp:
generate_lookup_table. If bounding_flag 1
and protein->parameters.dee_E_bounding 0.0
(defaults set in input_stuff.cpp: input_stuff),
protein->parameters.number_MC_cycles is
set to 20, and protein
is sent to MC.cpp: MC_rotamer_control;
this performs 20 MC runs and heuristic quench minimizes the top two
solutions. The energy of the best solution in protein->E_working
(plus E_bounding_ceiling)
is defined as the bounding energy E_ref. dee_utilities.cpp: initialize_lookuptable_for_dee
is called to generate RESIMER structs
within protein->var_pos; this is used
to permit conversion from resimer
(residue-rotamer; no distinction between rotamers from different
residuetype) arguments to the i_res, i_res_rot description
used in the rest of EGAD. The various DEE functions are called in turn.
At
present, lines are written to the logfile indicating where a monte
carlo routine that has access to the DEE elimination table should go.
The function ProteinDeeSpace.DeeSoln_to_CHROMOSOME
is used to generate the final structure from the elimination table, if
the DEE converged. If it did not, protein is sent to MC.cpp: MC_rotamer_control to generate an MC
solution; this should be replaced by a monte carlo routine that has
access to the DEE elimination table.
10.6.7 Self-consistent mean field
optimization (SCMF)
The mini_VARIABLE_POSITION and mini_CHOICE structs
are used locally within scmf.cpp. These
are stripped down analogs of VARIABLE_POSITION
and CHOICE used for storing information
during scmf optimization.
protein->parameters.number_scmf_cycles
SCMF optimizations are performed. At the start of each run, each
rotamer at each position is assigned some initial probability, as
described above. The SCMF optimization iterations are repeated until
convergence is achieved, or protein->parameters.max_scmf_iterations
is exceeded, or time runs out. Based on the probabilities calculated in
the previous cycle (stored in ERESROT.P[0]),
the mean field energy of each rotamer is calculated. These energies are
used to calculate new rotamer probabilities stored in ERESROT.P[1]. The scmf_lambda
factor described above is used to generate the weighted average of the
current iteration's probabilities and the previous iteration's
probabilities. The self consistency is checked by calling scmf.cpp: conf_matrix_diff. The probabilities in
ERESROT.P[0] are updated with the ones
stored in ERESROT.P[1].
After convergence is achieved, if the total SCMF energy for the current
run is lower than any of the previous runs, the structure with the most
probable rotamer at each position is placed in protein->final_chr;
the rotamer probabilities are copied into the mini_varpos
struct described above.
After all the runs are completed, protein->parameters.number_scmf_solns
random structures are generated, biased by the rotamer probabilities in
the lowest energy SCMF ensemble. Each of these are further optimized by
MC/HQM. The lowest energy structure is stored in protein->final_chr.
As with the other methods, CHROMOSOME_to_lookupEnergy.cpp:
final_chr_to_final_pdb_final_energy is called to generate the
final structures and energies for output.
Please note than the residue/rotamer probabilities within the protein->var_pos structure are not altered;
subsequent calls to MC or GA will continue to use the default uniform
residue and database-derived rotamer probabilities. In order to get
these methods to access SCMF-calculated probabilities, pass protein->var_pos to scmf.cpp:
convert_scmf_probabs_to_rotamer_probabs, as described below.
The rotamer and residue probabilities, entropies, and energies
calculated by SCMF may be used for other tasks. For example, within
EGAD, pK_calculate.cpp: pK_calculate uses
SCMF-generated probabilities (see pK calculation
section). These data are stored within the lookup table
datastructure:
lookupEnergy[i].lookupRes[i_res].P
probability of residuetype i_res at
position i
lookupEnergy[i].lookupRes[i_res].TdS
entropy of residuetype i_res at position i
lookupEnergy[i].lookupRes[i_res].E_mf
energy of residuetype i_res at position i
ERESROT.P[1]
probability of rotamer i_res_rot at
position i
As mentioned above, the SCMF probabilities stored in the lookup table
are not accessible by optimization methods like MC or GA; these
functions call GA_utilities.h: mutate_sidechain,
which uses the probabilities stored in the VARIABLE_POSITION
datastructure (default to uniform residue probabilities, and
database-derived rotamer probabilities). Passing protein->var_pos
to scmf.cpp:
convert_scmf_probabs_to_rotamer_probabs after SCMF optimization
will modify the probability arrays within protein->var_pos
to match the SCMF-calculated probabilities, thus permitting MC and GA
to make moves biased by these probabilities.
10.6.8 Simulation temperature
As discussed above, both MC and GA use a high simulation temperature to
permit uphill moves early in the optimization. This temperature decays
linearly from T0 to Tf,
wtih dT change at each cycle (default
values listed above). Both MC and GA call GA_utilities.cpp:
update_simulation_temperature for temperature updates. In the
present implementation, after the final temperature Tf is achieved, the temperature is maintained.
While the implementation here appears to work well for general
purposes, users may try alternative forms by modifying this function.
Many other temperature schedules have been described in the literature,
including non-linear schedules, and oscillations after Tf has been reached.
back to Table of Contents