Comparing sample sources takes one or more databases of features and a set of analysis scripts and generated plots and statistics.
The compare_sample_sources.R
script (located in rosetta/main/tests/features/
) runs feature analysis scripts on feature databases. It takes as input:
It then passes this configuration information to each analysis script. Usually the analysis script will generate the plots and statistics in
path_to_output_dir/<sample_source1>[_<sample_source2>[_...]]/<output_format>/
There are several ways to run compare_sample_sources.R
, which are described below. If you forget later, try
compare_sample_sources.R --help
To run a single analysis script on one or more database,
cd project/features
path/to/rosetta/main/tests/features/compare_sample_sources.R [OPTIONS] --script <analysis_script.R> path/to/features_<ss_id1>.db3 [path/to/features_<ss_id2>.db3 [...]]
Note:
--script
is [".", "path/to/rosetta/main/tests/features"]
.project/features/build/
project/features/compare_sample_sources_iscript.R
. In an interactive R session, >source("compare_sample_sources_iscript.R")
will return the script and leave you to do further interactive analysis.
To run one or more features analysis with a JSON configuration file:
cd project/features
path/to/rosetta/main/tests/features/compare_sample_sources.R --config analysis_configurations/<features_analysis>.json
The advantage of running compare_sample_source.R
from a configuration script include:
The format for the analysis configuration uses the .json
format, which is basically nested python dictionaries and lists with numbers and strings as leaves. Many text readers and programming languages can unerstand JSON files (ex python, subl, PyCharm). Here is an example script that was to generate figure 3 in the Rosetta Scientific Benchmarking paper:
{
"output_dir" : "build/general_analysis",
"sample_source_comparisons" : [
{
"sample_sources" : [
{
"database_path" : "/home/momeara/scr2/data/sp2_8b/top8000_sp2_r47244_120205/features_top8000_sp2_r47244_120205.db3",
"id" : "Native",
"reference" : true
},
{
"database_path" : "/home/momeara/scr2/dun10_vs_bicubic/top8000_relax_r46440_111213/features_top8000_relax_r46440_111213.db3",
"id" : "Relaxed Native Score12",
"reference" : false,
"model" : "Score12"
},
{
"database_path" : "/home/momeara/scr2/data/sp2_no_lj_correction/top8000_relax_sp2_no_lj_correction_r48561_120518/features_top8000_relax_sp2_no_lj_correction_r48561_120518.db3",
"id" : "Relaxed Native NewHB",
"reference" : false,
"model" : "NewHB"
},
{
"database_path" : "/home/momeara/scr2/data/sp2_8b/top8000_relax_sp2_r47244_120205/features_top8000_relax_sp2_r47244_120205.db3",
"id" : "Relaxed Native NewHB LJcorr",
"reference" : false,
"model" : "NewHB"
}
],
"analysis_scripts" : [
"scripts/analysis/plots/hbonds/hydroxyl_sites/OHdonor_AHdist.R"
],
"output_formats" : [
"output_small_eps"
]
}
]
}
Notes:
sample_sources
block, the database_path
and id
are the only required tags. Additional tags are passed into the analysis script as columns in the sample_sources
parameter to the run
function.analysis scripts
, output_dir
, output_formats
are parsed in the same way as if they are on the command line.compare_sample_sources_iscript.R
When using sqlite3 feature database, I like to setup the following directory structure. (Note: I generate the sample_source directories as a result of a cluster run using rosetta/main/tests/features/features.py
and templates in rosetta/main/tests/features/sample_sources/
. This is described in detail here.)
project/
sample_source_id1_r#######_YYMMDD/ #e.g. Top8000
features_sample_source_id1_r#######_YYMMDD.db3 #experimental, X-ray
features.xml
flags
<log_files>
sample_source_id2_r#######_YYMMDD/ #e.g. Top8000_relax
features_sample_source_id2_r#######_YYMMDD.db3 #relaxed natives
features.xml
flags
<log_files>
sample_source_id3_r#######_YYMMDD/ #e.g. Top8000_relax_new
features_sample_source_id3_r#######_YYMMDD.db3 #relaxed_natives with
features.xml #alternative score function
flags
<log_files>
features/ #run compare_sample_sources.R from here
compare_sample_sources_iscript.R #autogenerated to re-run interactively
analysis_configurations/
feature_analysis1.json #These are passed in
feature_analysis2.json #with the --config flag
feature_analysis3.json
build/
sample_source_id1_sample_source_id2_sample_source_id3/
analysis_script_id1/
output_pdf_huge/
<plots.pdf>
output_pdf_print/
<plots.pdf>
...
analysis_script_id2/
...
Notes and thoughts on the directory layout:
project
directory helps maintain separation when doing different projects or major iterations of the project and keeps it separate from the rosetta
code base. It is much easier to stage commits and update revisions if your workspace isn't mixed in! And when you get it in a good state, it is easier to wrap it up and share it with collaborators, publish, etc.flags
, features.xml
and log files with the features database makes it is easy to remember how it was generated and re-run it in case it needs to be updated.analysis_configurations
together makes it easy to share and re-run the analysis scripts.