Documentation by Hahnbeom Park (hahnbeom@u.washington.edu)
This protocol brings large scale structure refinement starting from a pool of
HybridizeMover works as basic sampling unit during overall iterative global energy optimization process. Objective function for global optimization is set as Rosetta all-atom energy by default, but user-defined restraints can be also incorporated -- for instance co-evolution restraints and so on -- as weighted sum to total score.
Protein structure determination using metagenome sequence data. Sergey Ovchinnikov, Hahnbeom Park, Neha Varghese, Po-Ssu Huang, Georgios A. Pavlopoulos, David E. Kim, Hetunandan Kamisetty, Nikos C. Kyrpides, David Baker. Science 2017, 355:294-298.
Protein homology model refinement by large scale energy optimization. Hahnbeom Park, Sergey Ovchinnikov, David E Kim, Frank DiMaio, and David Baker. Proc Natl Acad Sci USA 2018.
The composite of Rosetta app and python master script here carries out genetic-algorithm-inspired structural refinement. Key concepts in genetic algorithm are a) Parent selection, b) Crossover or mutational (structural) operations for generating new offspring structural pools from those parents, c) Pool control after new structure generations, and d) optionally, logics for preventing from early convergence, that is, maintaining sufficient structural diversity during the procedure. The app and script contains logics brought from Conformational Space Annealing (CSA) such as annealing distance-threshold for clustering as iteration proceeds, parent selection based on number of times used without discovering new competative structure (nuse), which improves a,c,d) over typical genetic algorithms. Structural operations, b), mostly rely on HybridizeMover, which is optimized for cross-over style structural operations for homology modeling problems.
Step 0. Diversification stage Generate "relatively" diverse models (i.e. share same topology but not too close) : not part of this documentation since there could be various methods for doing this
Step 1. Evolution stage iterate below process N times (typically n~50)
Required to begin the first iteration. Command line using Rosetta public app:
$ROSETTA/main/source/bin/iterhybrid_selector.linuxgccrelease \
-in:file:silent $1 -in:file:template_pdb $2 -cm:similarity_cut $3 \
-out:file:silent ref.out -out:nstruct $4 \
-out:prefix iter0 -score:weights ref2015_cart \
-silent_read_through_errors -in:file:silent_struct_type binary -out:file:silent_struct_type binary -mute core basic \
(optional.1 for rescoring with restraints) -cst_fa_file fa.cst -set_weights atom_pair_constraint 1.0
(optional.2 for dumping adaptive cst) -constraint:dump_cst_set cen.pair.cst
(optional.3 for deformation penalty) -cm:refsimilarity_cut $5
(optional.4 for quota setup for each input silent, should match number of input silents) -cm:quota_per_silent 0.7 0.3
$5, optional: estimated Similarity-To-ReferenceStructure in GDT-HA scale, puts penalty if any structure gets dissimilar to reference structure than this value, default is 25.0
IMPORTANT: "-out:prefix iter0" is necessary to reformat input silent readable by IterationMaster.py. Please check if you included this option correctly if you get failure message "ERROR: pdbs not extracted correctly!".
See optional.2 above.
All the python/bash scripts required for iterative process can be found at:
$Rosetta/main/source/scripts/python/public/iterative_hybridize/
Copy over files at the directory to wherever convinient (say $SCRIPTDIR). Prepare these files and copy it to working directory; note that file names should EXACTLY MATCH.
* init.pdb : Reference structure (e.g. homology model) in pdb format
* input.fa : sequence in fasta format
* t000_.3mers, t000_.9mers: Rosetta fragment library files
(please refer to https://www.rosettacommons.org/docs/wiki/application_documentation/utilities/app-fragment-picker for picking fragments)
* cen.cst, fa.cst : Rosetta restraint file used at centroid / full-atom stage
See above "Generating adaptive restraints from pool of structures".
"fa.cst" is used for model ranking during process and can be ignored.
* ref.out : Rosetta silent file containing pool of structures for evolutionary algorithm.
Size of pool during evolutionary process follows number of structures in this file.
Once files are prepared, run command line below (for default options):
python $SCRIPTDIR/IterationMaster.py -iha [model accuracy] -nodefile [nodefile] >& iterhyb.log
* model accuracy : % in GDT-HA scale;
(20/40/60) mean (completely wrong / roughly correct / correct)
* nodefile: a text file containing nodes to distribute;
using 4 cores at n001 will be like:
n001
n001
n001
n001
More options:
-native [pdb] # native structure for monitoring model accuracy during process
-debug # turn on debug mode
-niter [int] # number of iterations, default=50
-simple # run simpler protocol with predefined options
# used for Robetta server & refinement w/ co-evolution data
-mulfactor_phase0 [int] # scale factor for number of sampling at initial phase;
# default=2 (twice more at beginning)
Command line using Rosetta public app:
$ROSETTA/main/source/bin/iterhybrid_selector.linuxgccrelease \
-in:file:silent $1 \
-in:file:template_pdb $2 \
-in:file:template_silent $6 -similarity_cut $3 -cm:similarity_limit 0.2 \
-out:nstruct $4 -out:file:silent sel.out \
-out:prefix iter.$niter \
-silent_read_through_errors -in:file:silent_struct_type binary -out:file:silent_struct_type binary -mute core basic
(optional.1 for rescoring) -score:weights ref2015_cart -cst_fa_file fa.cst -set_weights atom_pair_constraint 1.0
(optional.2 for dumping adaptive cst) -constraint:dump_cst_set cen.pair.cst
(optional.3 for deformation penalty) -cm:refsimilarity_cut $5
(optional.4 for parent information update) -cm:seeds $7
Output models after each iteration are always clustered and sorted based on their energy (+full-atom restraint if provided) thus picking the lowest index model(s) is most direct way of selecting representative models. These "model[1-5].pdb" can be found at "workdir/iter_[niter]/" if the whole process is normally finished.
Alternately, structure averaging on full trajectory can be performed:
cat iter_*/gen.out > gen.total.out
$ROSETTA/main/source/bin/avrg_silent.linuxgccrelease -database $ROSETTADB \
-in:file:template_pdb iter_[niter]/model1.pdb -out:prefix avrg \
-cm:similarity_cut 0.5 \
-in:file:silent gen.total.out -silent_read_through_errors > avrg.log
"avrg.relaxed.pdb" generated after this command is structure-averaged + regularized model. -cm:similarity_cut takes the same structural distance metric described for iterhybrid_selector app; smaller the close structures are, and roughly 0.2 is family-level similarity, 0.6 is fold-level similarity.
See Analyzing Results: Tips for analyzing results generated using Rosetta