The scripts and input files that accompany this demo can be found in the
demos/public
directory of the Rosetta weekly releases.
KEYWORDS: STRUCTURE_PREDICTION LOOPS
This tutorial will guide you through modeling XP_005597007, a predicted cyclic GMP-AMP synthase in Equus caballus (horse). XP_005597007 is homologous to the human protein PDB: 4o68. The crystal structure for 4O68 will be used as the template to guide comparative modeling. Specific regions of low sequence identity will then be remodeled using Rosetta's CCD and KIC loop building protocols.
Comparative modeling requires various input files that are either generated manually or downloaded from the internet. These files have already been created and are available in their appropriate directories but it is recommended that you try to gather/generate these files yourself. Boldface indicates specific filenames. Italics indicates webpage entries such as query terms or menu selections.
To start, create your own working directory by typing:
mkdir my_model
Prepared files can be copied from the indicated directories into your working directory at any step if you wish to skip creating a particular file yourself.
With many comparative modeling applications, you will only have your target's amino acid sequence to start with.
MAASASAADRTEAQSSGLERASSGRRHLALTSRRSTSTPSVAAIGGWARLFGQRAVARWQGSFCGSTSVIFCLL
from the beginning of the sequence.GEF
from the end of the sequence.XP_truncated.fasta should look like this:
>XP0
ISAPNEFDVMFKLEVPRIELEEYCNSGAHYFVKFKRNPKGNPLSQFLEEEILSASKMLSKFRKIIK
EEIKHVEDTDVIVERKRRGSPAVTLLIRKPKEISVDIILALESKSSWPASTKEGLPINNWLGTKVKNSLR
RQPFYLVPKHAKEGNGFQEETWRLSFSHIEKDILKNHGQSKTCCETHGVKCCRKDCLKLMKYLLEQLKKK
FGNRKELDKFCSYHVKTAFFHVCTQDPHDSQWHSNDLESCFDNCVTYFLHCLKTERLEHYFIPGVNLFSQ
DQIEKISKEFLSKQIEYERNNGYPVF
The prepared XP_trunacted.fasta can be found in the 1_setup/ directory
The homologous human protein will be used as the template for comparative modeling. This structure is available on the RCSB Protein Data Bank (PDB). The raw structures from the PDB often contain information not necessary for comparative modeling such as attached T4 lysozyme and/or specific ligands. Once a PDB is downloaded for use as a template, this extra information must be removed before it can be used for comparative modeling with RosettaCM.
In addition to extra residues, these PDB's contain additional information that is not useful for Rosetta and may cause problems during the modeling. A script has been prepared to remove all of this extraneous information. (Note: clean_pdb.py expects to get the PDB filename in all capital letters, without the ".pdb" ending, followed by the chain letter of the chain to extract from the file.)
~/rosetta_workshop/rosetta/tools/protein_tools/scripts/clean_pdb.py 4O68_TRUNCATED A
This will generate 4O68_TRUNCATED_A.pdb and 4O68_A.fasta.
Comparative modeling uses template structures to guide initial placement of target amino acids in three-dimensional space. This is done according to the sequence alignment of target and template. Residues in the target sequence will be assigned the coordinates of those residues they align with in the template structure. Residues in the target sequence that do not have an alignment partner in any template will be filled in during loop building.
The prepared alignment can be found in the 2_threading/ directory.
Loop building will use fragments to remodel loops and fill in missing residues that didn't align with any template residues.
**Prepared fragment files can be found in the 3_loopbuild/ directory.
Thread the target sequence over the template PDB using the included script:
python2.7 ~/rosetta_workshop/rosetta/tools/protein_tools/scripts/thread_pdb_from_alignment.py \
--template=4O68_A --target=XP0 --chain=A --align_format=clustal \
XP_4O68.aln 4O68_TRUNCATED_A.pdb XP0_on_4O68.pdb
Despite the high sequence identity, certain residues in XP0 could not be aligned to residues in the template. These residues were not assigned any coordinates during the threading process. It is necessary to fill in these missing residues to complete the modeling. In addition to missing residues, loops may also be defined for regions of low identity or important regions where a greater degree of sampling is necessary.
This step requires the following files to be in the same directory:
The loops file tells Rosetta which regions of the protein to remodel using CCD and KIC.
Any residues that were not assigned coordinates during the threading have the coordinates 0.000 0.000 0.000 in XP0_on_4O68.pdb. View this file to find which residues must be remodeled and included in loop definitions. Additionally, you may wish to include regions of lower identity based on the alignment. The loop file contains one loop region per line. The two residue numbers included for each line must be residues that have previously been assigned coordinates. These residues are the "anchor" residues and will be not remodeled during the loop building process. In other words, if you wish to remodel residues 10-20, you will define the loop region as "9 21." For the provided alignment, the following loop regions have been defined for remodeling. Depending on your aligment, you may need slightly different loop values.
LOOP 34 41
LOOP 93 99
LOOP 181 186
A previously generated loops file can be found in the 3_loopbuild/ directory.
The build_loops.options file is already provided for you in the 3_loopbuild/ directory.
~/rosetta_workshop/rosetta/main/source/bin/loopmodel.default.linuxgccrelease \
@build_loops.options -database ~/rosetta_workshop/rosetta/main/database
Note: You will see several "[ WARNING ] missing heavyatom" messages at the beginning. This is normal and can be ignored.
**This will take a while to run. You may want to open up a different terminal window and start the Rosetta Clustering Tutorial section below, comming back when the run is finished.
When the job is finished running, extract the PDB files from the silent files with the following command line.
~/rosetta_workshop/rosetta/main/source/bin/score_jd2.default.linuxgccrelease \
-database ~/rosetta_workshop/rosetta/main/database -in:file:silent XP0_on_4O68_loops.out \
-in:file:fullatom -out:pdb
See tutorial from De Novo Folding on "Score and extract PDBs" and "Score vs. RMSD plots" for further instructions on analysis.
To Cluster large sets of models, see the Rosetta Clustering tutorial below.
Prepare Silent Files or list of PDBs. Since we only generated 5 models previously, we will use a prepared silent file that contains enough models for clustering to be effective:
Prepare Options file:
The cluster.options file is already provided in the 4_cluster/ directory.
Rosetta ignores comment lines beginning with #.
Avoid mixing tabs and spaces. Be consistent in your formatting (tab-delimited or colon-separated)
Run the clustering.py script, which will execute the Rosetta cluster application and output a series of summary files with names specified on the commandline. (Here "cluster_summary.txt" and "cluster_histogram.txt")
python2.7 ~/rosetta_workshop/rosetta/tools/protein_tools/scripts/clustering.py \
--rosetta ~/rosetta_workshop/rosetta/main/source/bin/cluster.default.linuxgccrelease \
--database ~/rosetta_workshop/rosetta/main/database --options cluster.options \
--silent=XP0_production.out cluster_summary.txt cluster_histogram.txt
The cluster_summary.txt and associated files are provided in the 4_cluster/ directory.
Sort the cluster_summary.txt file by the score column from lowest to highest:
sort -rnk4 cluster_summary.txt > cluster_summary_sorted.txt
Look at the top 5 clusters by size:
head -n 5 cluster_summary_sorted.txt
Extract the models you are interested in viewing from the binary silent file (make sure the silent file XP0_production.out is in the directory you are running from. If not, copy it to the present directory):
~/rosetta_workshop/rosetta/main/source/bin/score_jd2.default.linuxgccrelease \
-database ~/rosetta_workshop/rosetta/main/database -in:file:silent XP0_production.out \
-in:file:silent_struct_type binary -in:file:fullatom -out:pdb -out:file:fullatom \
-in:file:tags XP0_on_4O68_loopsXP0_on_4O68_0450_0001 XP0_on_4O68_loopsXP0_on_4O68_0492_0001 \
XP0_on_4O68_loopsXP0_on_4O68_0497_0001 XP0_on_4O68_loopsXP0_on_4O68_0386_0001 \
XP0_on_4O68_loopsXP0_on_4O68_0352_0001
View the models using protein visualization tool of your choice.