The Features Scientific Benchmark is used to compare batches of structures coming from different sources with respect to local chemical and geometric features. This tutorial describes how to run the features scientific benchmark. The general steps are
The features scientific benchmark stores feature data databases.
Generating the feature database involves extracting feature information from each structure. Usually this requires specifying the following information
The coordinates of the structures for used to extract the feature information can be supplied in any format recognized by the rosetta_scripts application.
pdb :
Silent Files :
Database Input :
Use the ReportToDB mover with the Rosetta XML scripting to specify which features should be extracted to the features database. (Note: The TrajectoryReportToDB mover can also be used in Rosetta scripts or C++ to report features in trajectory form multiple times to DB for a single output).
Each FeaturesReporter is responsible for extracting a certain type of features to the features database. Select a set FeaturesReporters and then include them as subtags to the ReportToDB mover tag in the rosetta_scripts XML. See this page for more information on the mover beyond what is covered here. The SimpleMetricFeatures allows the use of ANY Simple Metric to be output in a features database.
<ROSETTASCRIPTS>
<SCOREFXNS>
<ScoreFunction name="s" weights="score12_w_corrections"/>
</SCOREFXNS>
<MOVERS>
<ReportToDB name="features" database_name="scores.db3">
<ScoreTypeFeatures/>
<StructureScoresFeatures scorefxn="s"/>
</ReportToDB>
</MOVERS>
<PROTOCOLS>
<Add mover_name="features"/>
</PROTOCOLS>
</ROSETTASCRIPTS>
<ReportToDB> tag
name : Mover identifier so it can be included in the PROTOCOLS block of the RosettaScripts
database_name (&string): Name of the output database. Can also be specified via cmd-line.
batch_description (&string): (Optional) Batch description. Can also be specified via cmd-line.
Database Connection Options : for options of how to connect to the database
sample_source (&string) : Short text description stored in the sample_source table
protocol_id (&int) : (optional) Set the protocol_id in the protocols table rather than auto-incrementing it.
task_operations (&task): Restrict extracting features to a relevant subset of residues. Since task operations were designed as tasks for side-chain remodeling, residue features are reported when the residue is "packable". If a features reporter involves more than one residue, the convention is that it is only reported if each residue is specified; however, this feature is supported by every Reporter.
database_separate_db_per_mpi_process (&bool) (Default=false) : For use with MPI-Sqlite3 Features Reporter running.
cache_size (&int) : The maximum amount of memory to use before writing to the database ( sqlite3 only ).
<feature> tag (subtag of <ReportToDB>)
Since ReportToDB is simply a mover, it can be included in any Rosetta Protocol. For example, to extract the features from a set of pdb files listed in structures.list , and the above script saved in parser_script.xml , execute the following command:
rosetta_scripts.linuxgccrelease -output:nooutput -l structures.list -parser:protocol parser_script.xml
This will generate an SQLite3 database file scores.db3 containing the features defined in each of the specified FeatureReporters for each structure in structures.list . See the features integration test (rosetta/main/test/integration/tests/features) for a working example.
The Features Reporters can be run in parallel either through MPI or through a batch-type run.
For MPI-based runs, make sure to compile MPI-mode Rosetta.
By default, Rosetta is compiled with Sqlite3 Support. Sqlite3 does not support parallel process writing to one database, so they are split during the MPI run for each processor and merged at the end through a script.
In order to use MPI with Features runs for Sqlite3 database output, just add -separate_db_per_mpi_process
to the command line or add an option to ReportToDB in your xml script, for example:
<ROSETTASCRIPTS>
<MOVERS>
<ReportToDB name=features database_name=example.db3 batch_description=example database_separate_db_per_mpi_process=1>
Run Features (for example): mpiexec -np 101 rosetta_scripts.mpi.linuxclangrelease -parser:protocol antibody_features.xml -ignore_unrecognized_res
On completion, merge the databases (see merging for more)
These work without any additional MPI flags, but you will need compile and run Rosetta with the appropriate flags and drivers. See the database input/output page for more information.
Batch runs can be done by manually partitioning a sample source into batches, generating features database for each batch and merging them together. See the features_parallel integration test (rosetta/main/test/integration/tests/features_parallel) for a working example.
For example if there are 1000 structures split into 4 batches then the scripts for the run processing the first batch would contain:
<ReportToDB name=features_reporter db="features.db3_01" sample_source="batch1" protocol_id=1 first_struct_id=1>
...
</ReportToDB>
and the script for the run processsing the second batch would contain:
<ReportToDB name=features_reporter db="features.db3_02" sample_source="batch2" protocol_id=2 first_struct_id=26>
...
</ReportToDB>
On completion, merge the databases (see merging for more)
After the runs are complete, locate the merge.sh script (rosetta/main/test/scientific/cluster/features/sample_sources/merge.sh) and run
bash /path/to/merge.sh features.db3 features.db3_*
Which will merge the features from each of the features.db3_xx database into features.db3 .
The features scientific benchmark has sample source templates which are used to setup configuration information to do feature extraction for an input dataset.
Each sample_source template is a folder in main/tests/features/sample_sources/ with the following files:
Specify which sample sources to use by editing rosetta/main/test/scientific/cluster/features/sample_sources/benchmark.list. See the Sample Sources page for details about each sample source.
Run features scientific benchmark using the features.py script
Rosetta/main/tests/features/features.py [OPTIONS] [RUN]
These are the command line options used to run features.py
[ACTION]
[OPTIONS]
--run-type :
--output-dir : The base directory into which the feature databases will be generated (each sample sources will be in it's own directory).
The features scientific benchmark can be run locally or on a MPI-based cluster. The computational time and space requirements differ for the different stages of the analysis.
Each batch of structures should be as large and representative as possible. If you are generating structure predictions to compare against experimental data, here are some guidelines:
The features scientific benchmark supports single-threaded , MPI and Condor computational environments. The feature extraction process only uses the rosetta_scripts application and the jd2 job distributor. So if you are to get those to work on your platform, it should be possible to get the features scientific benchmark to work as well. Specific configuration information for the following job schedulers is provided.