This document was last updated on June 22, 2010, by Ron Jacak. The corresponding PI is Brian Kuhlman (bkuhlman@email.unc.edu).
The code for the sequence_recovery application is in src/apps/public/sequence_recovery.cc
. A demo for the application is located in demo/sequence_recovery/
.
Sequence recovery tests have been used extensively during the development of Rosetta. The following publications show output that can be obtained from this application:
This application takes as input a set of native proteins and a corresponding set of redesigned proteins and outputs a table of core, overall, and surface sequence recoveries by amino acid type. A table of the native amino acid and to what residue type Rosetta mutated these positions is also output. The output is explained in more detail in the following sections.
Sequence recovery is measured by reading in both sets of structures and then calculating the percent identity between them by iterating over all residues. The algorithm is very straight-forward. Both sets of structures are read in as Pose objects and stored in Pose vectors. The algorithm iterates over every residue of each pose in the native structures vector, and checks whether that residue is the same in the corresponding designed protein. In addition to the overall percent identity, it also computes the percent identity for core and surface positions. Core positions are defined as those having >= 24 neighbors (CBeta distance within 10 Angstroms) and surface positions are defined as having <= 16 neighbors (CBeta distance within 10 Angstroms). An optional task factory can be supplied so that percent identity is only calculated for a subset of positions within the supplied structures.
The application assumes that the order of the PDBs in both lists of structures is identical, i.e. each native PDB structure has one corresponding redesigned structure and the structures are listed in the same order.
At least two lists of PDB files must be supplied. It is acceptable to have only one PDB file in each list, if one wants to measure the sequence recovery of just one protein. If the sequence recovery of multiple PDBs is desired, the structures in each list must be in the same order.
-native_pdb_list <listfile> The list of native proteins to compare against
-redesign_pdb_list <listfile> The list of redesigned proteins to compare to the native proteins
By default, the application measures sequence recovery over all positions. A file containing task operations to apply to each pose to restrict which residues are included for measuring sequence recovery can be specified using the -parse_tagfile flag. A sample tag file is shown in the examples below. Most users will want to use the default behaviour.
-parse_tagfile <file> An XML file containing a set of task operations to limit which residues are included in calculating
percent identity
-database <path/to/rosetta/main/database> Specifies the location of the rosetta database
-se_cutoff The maximum number of neighbors a residue can have to be considered surface exposed (default: 16).
-seq_recov_filename The name of the file to output the sequence recovery table to (default: sequencerecovery.txt)
-sub_matrix_filename The name of the file to output the substitution matrix to (default: submatrix.txt)
-ignore_unrecognized_res Command line option to exclude unrecognized residues
A sample command line for running the sequence_recovery application would be as follows:
sequencerecovery.macosgccrelease -database /path/to/rosetta/main/database/ -native_pdb_list seqrecovery.list -redesign_pdb_list redesigned.list
-ignore_unrecognized_res
A sample command line for running the sequence_recovery application using a tagfile to restrict the calculation to only surface residues (even though this is done by default) would be as follows:
sequencerecovery.macosgccrelease -database path/to/rosetta/main/database -native_pdb_list seqrecovery.list -redesign_pdb_list redesigned.list
-parse_tagfile taskoperations.tagfile -ignore_unrecognized_res
Example taskoperations.tagfile
<TASKOPERATIONS>
<RestrictNonSurfaceToRepacking name=rnstr surface_exposed_nb_count_cutoff=16 exclude_interfaces_from_restriction=false/>
</TASKOPERATIONS>
Another example where one would want to use the tagfile option would be in calculating the number of native residues recovered over all interface residues in a set of redesigned protein-protein complexes. The tagfile would specify a task operation that restricts the set of residues being considered to only those at the interface between the two proteins.
The easiest way to prepare the structure list files is to execute the following commands:
ls *.pdb > natives.list; sort natives.list -o natives.list;
ls *.pdb > redesigns.list; sort redesigns.list -o redesigns.list;
The sequence_recovery application outputs two text files: 1) a file containing the core, overall, and surface native sequence recovery by amino acid type and 2) a table showing the breakdown of what residues were designed in place of each type.
Example output of using the standard Rosetta energy function "score12" on a set of 32 proteins is shown below. The first line denotes what each column represents.
Residue | No.correct core | No.native core | No.designed core | No.correct/No.native core | No.correct/No.designed core | No.correct | No.native | No.designed | No.correct/No.native | No.correct/No.designed | Residue | No.correct surface | No.native surface | No.designed surface | No.correct/No.native surface | No.correct/No.designed surface
ALA 57 103 94 0.55 0.61 118 385 259 0.31 0.46 ALA 9 144 30 0.06 0.30
CYS 0 12 0 0.00 --- 0 27 0 0.00 --- CYS 0 2 0 0.00 ---
ASP 10 20 29 0.50 0.34 80 268 354 0.30 0.23 ASP 49 167 221 0.29 0.22
GLU 2 7 16 0.29 0.12 54 289 308 0.19 0.18 GLU 31 180 158 0.17 0.20
PHE 37 60 69 0.62 0.54 98 193 302 0.51 0.32 PHE 6 23 68 0.26 0.09
GLY 56 69 66 0.81 0.85 302 400 368 0.75 0.82 GLY 164 210 205 0.78 0.80
HIS 10 18 41 0.56 0.24 30 118 176 0.25 0.17 HIS 5 37 32 0.14 0.16
ILE 41 83 70 0.49 0.59 105 242 231 0.43 0.45 ILE 7 33 39 0.21 0.18
LYS 2 11 9 0.18 0.22 62 325 285 0.19 0.22 LYS 30 193 148 0.16 0.20
LEU 66 122 92 0.54 0.72 180 378 454 0.48 0.40 LEU 18 54 174 0.33 0.10
MET 10 31 25 0.32 0.40 20 90 70 0.22 0.29 MET 1 20 5 0.05 0.20
ASN 3 14 11 0.21 0.27 33 206 158 0.16 0.21 ASN 20 110 120 0.18 0.17
PRO 6 18 7 0.33 0.86 61 229 77 0.27 0.79 PRO 29 144 40 0.20 0.73
GLN 3 17 6 0.18 0.50 18 186 172 0.10 0.10 GLN 10 94 110 0.11 0.09
ARG 4 14 20 0.29 0.20 37 185 262 0.20 0.14 ARG 10 85 76 0.12 0.13
SER 20 37 106 0.54 0.19 79 258 348 0.31 0.23 SER 25 119 128 0.21 0.20
THR 19 41 62 0.46 0.31 61 291 225 0.21 0.27 THR 30 145 110 0.21 0.27
VAL 47 98 67 0.48 0.70 107 308 194 0.35 0.55 VAL 8 60 37 0.13 0.22
TRP 10 23 21 0.43 0.48 32 76 145 0.42 0.22 TRP 4 12 56 0.33 0.07
TYR 9 40 27 0.22 0.33 39 158 224 0.25 0.17 TYR 6 23 98 0.26 0.06
Total 412 838 0.492 1516 4612 0.329 Total 462 1855 0.249
As can be seen from the native sequence recovery table, the standard Rosetta energy function "score12" obtains an overall native sequence recovery of 32.9% on a set of 32 proteins. Core recovery is much higher with 49.2% of the native residues being recovered. In addition to providing the core, overall, and surface native sequence recoveries, users can see which residue types Rosetta is recovering best and worst. For example, Rosetta designed less than one-third as many prolines as are present in the native structures overall, and nearly twice as many tryptophans. These tryptophans appear to be getting placed on the surface as the number of tryptophans on the surface is nearly four times the number in the native proteins. Serine is overrepresented in the core, with there being nearly three times as many serines in the core of the redesigns as in the core of the native structures. This implies that either Rosetta is having trouble fitting other residues in the core (perhaps due to clashes from the Lennard-Jones term), the hydrogen bond energy terms are weighted too heavily and favoring too many serines, or the desolvation penalty for serine is not high enough.
The corresponding substitution matrix for the above run is below:
AA_TYPE nat_ALA nat_CYS nat_ASP nat_GLU nat_PHE nat_GLY nat_HIS nat_ILE nat_LYS nat_LEU nat_MET nat_ASN nat_PRO nat_GLN nat_ARG nat_SER nat_THR nat_VAL nat_TRP nat_TYR
sub_ALA 118 6 6 2 5 8 1 8 6 7 7 5 15 9 7 19 9 15 6 0
sub_CYS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sub_ASP 29 3 80 22 4 14 7 6 28 19 4 34 31 13 8 20 12 12 0 8
sub_GLU 26 1 22 54 6 7 9 9 26 13 7 18 12 16 23 17 19 14 3 6
sub_PHE 9 1 12 11 98 5 11 4 15 18 3 4 3 6 9 6 16 11 10 50
sub_GLY 10 1 7 2 2 302 3 1 9 4 1 10 1 4 2 6 0 1 0 2
sub_HIS 7 1 7 9 17 5 30 6 5 18 2 7 8 11 5 5 9 6 5 13
sub_ILE 6 1 4 10 3 2 3 105 10 16 2 5 0 2 8 5 11 34 4 0
sub_LYS 20 0 14 34 1 6 8 7 62 18 4 19 4 20 15 19 16 10 3 5
sub_LEU 8 2 21 31 5 6 6 19 41 180 15 15 17 21 13 7 17 18 5 7
sub_MET 5 1 1 3 1 0 1 4 4 11 20 0 1 4 7 1 1 0 4 1
sub_ASN 22 0 18 5 0 14 5 1 9 5 4 33 7 8 8 12 3 2 0 2
sub_PRO 7 0 2 1 0 0 0 2 2 1 0 0 61 0 0 0 0 0 1 0
sub_GLN 17 0 12 20 3 7 3 6 20 6 1 7 8 18 9 8 18 9 0 0
sub_ARG 13 0 13 22 3 7 6 11 31 15 6 13 6 19 37 21 19 12 1 7
sub_SER 62 3 20 11 2 11 5 4 14 16 3 12 40 6 10 79 30 14 1 5
sub_THR 7 7 6 13 5 2 2 12 16 15 5 8 6 9 5 17 61 27 1 1
sub_VAL 3 0 4 6 1 0 0 27 9 5 2 2 0 4 1 3 18 107 0 2
sub_TRP 4 0 9 10 8 2 6 4 8 6 2 4 5 6 9 2 13 5 32 10
sub_TYR 12 0 10 23 29 2 12 6 10 5 2 10 4 10 9 11 19 11 0 39
The output sequence recovery and substitution matrix files can be easily imported into Excel (split on whitespace) to further analyze the results.
The sequence_recovery application is being released for the first time with Rosetta v3.3.