Back to Mover page.
Autogenerated Tag Syntax Documentation:
Align the PDBInfo of a pose to some given sequences. This mover does not alter geometry or sequence of the pose. The only thing that is altered is the PDBInfo. The goal of this mover is to re-number/re-chain a pose so that its PDBInfo is in sync with some reference. There are two modes. In 'single' mode all the target sequences are appended in order and the entire pose is aligned to that sequence. In 'multiple' mode we attempt to align the sequence of each chain of the pose to each of the target sequences. If we experience a perfect match, we will renumber based on that target sequence. For more information, please reference the online documentation. WARNING 1: if you have a very long sequence, (1000s of residues) this protocol can use a lot of memory due to the SmithWaterman alignment. Be cautious of this. WARNING 2: This will most likely not give you the correct results if your pose or target sequence contain carbohydrate or other residues that are represented using [ ] or ( ).
<AlignPDBInfoToSequences name="(&string;)" mode="(&string;)"
json_fns="(&string;)" throw_on_fail="(false &bool;)"
sequence_alignment_cut_max="(10 &positive_integer;)" >
<Target name="(&string;)" sequence="(&string;)" chains="(&string;)"
segmentIDs="(&string;)" insCodes="(&string;)"
residue_numbers="(&string;)" />
</AlignPDBInfoToSequences>
Subtag Target: Targets and settings of the sequences to align to.
Purpose: The goal of this application is to provide the ability to align the pose's sequence (PDBInfo) to reference sequences. This is particularly useful when performing structure prediction, ie: If you are predicting the structure of residues 341-400 with Rosetta the output pdb will often start with residue 1 and end at residue 60. Adding this mover will make sure that the pdb's numbering will be correct (ie numbered as residue 341->400) at the end.
Example application:
<ROSETTASCRIPTS>
<MOVERS>
<AlignPDBInfoToSequences name="apts" mode="multiple" >
<Target sequence="MKVKIKCWNGVATWLWVANDENCGICRMAFNGCCPDCKVPGDDCPLVWGQCSHCFHMHCILKWLHAQQVQQHCPMCRQEWKFKE" chains="C,D" />\n"
<Target sequence="LSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKA" chains="U,u" />\n"
</AlignPDBInfoToSequences>
</MOVERS>
<PROTOCOLS>
<Add mover="apts" />
</PROTOCOLS>
<OUTPUT />
</ROSETTASCRIPTS>
Each Target
must have sequence
and chains
defined, but can also have segmentIDs
and/or insCodes
defined if you would like to define those as well.
If you don't want to use Target
blocks you can supply 1 or more json files (comma separated) with the "json_fns" option.
the json files must follow the format: (List of Dictionaries, with required keys of chain
+sequence
(optional keys of segmentIDs
and insCodes
)
[
{
"chains": ["C"],
"sequence": "MKVKIKCWNGVATWLWVANDENCGICRMAFNGCCPDCKVPGDDCPLVWGQCSHCFHMHCILKWLHAQQVQQHCPMCRQEWKFKE"
},
{
"chains": ["D"],
"sequence": "MSTLFPSLFPRVTETLWFNLDRPCVEETELQQQEQQHQAWLQSIAEKDNNLVPIGKPASEHYDDEEEEDDEDDEDSEEDSEDDEDMQDMDEMNDYNESPDDGEVNEVDMEGNEQDQDQWMI"
},
{
"chains": ["G", "W"],
"sequence": "MLRRKPTRLELKLDDIEEFENIRKDLETRKKQKEDVEVVGGSDGEGAIGLSSDPKSREQMINDRIGYKPQPKPNNRSSQFGSLEF"
},
]
because there's no standard format that can really handle the level of control I want, I've decided to go with the following:
in 'single' mode:
given the json file (or its equivalent Target
block)
[
{
"sequence": "SEQ",
"chains": ["A", "B", "C"],
"segmentIDs": ["SEQ"]
},
{
"sequence": "HERE",
"chains": ["D"],
"segmentIDs": ["HERE"]
},
]
the mover will alter the PDBInfo of the a pose with two chains with sequences ["SEQ", "HERE"] to:
the residue numbering:
[1, 2, 3, 1, 2, 3, 4]
chains:
["A", "B", "C", "D", "D", "D", "D"]
segmentIDs:
["SEQ", "SEQ", "SEQ", "HERE", "HERE", "HERE", "HERE"]
This is useful, but normally you have 20+ chains and their order isn't consistent between experiments so instead I would suggest using the 'multiple' mode
multiple mode:
given a pose with sequences ["SEQ", "HERE", "SEQ"]
you could align this with the json (or equivalent Target
block):
[
{
"sequence": "SEQ",
"chains": ["A", "B"],
"segmentIDs": ["SEQ1", "SEQ2"]
},
{
"sequence": "HERE",
"chains": ["D"],
"segmentIDs": ["HERE"]
},
]
and this will result in the:
residue numbering:
[1, 2, 3, 1, 2, 3, 4, 1, 2, 3]
chains:
["A", "A", "A", "D", "D", "D", "D", "B", "B", "B"]
segmentIDs:
["SEQ1", "SEQ1", "SEQ1", "HERE", "HERE", "HERE", "HERE", "SEQ2", "SEQ2", "SEQ2"]
so using 'single' mode you have more per 'residue' control but less flexibility on the pose ordering, whereas in 'multiple' mode you have less per 'residue' control but lots of flexibility on the pose ordering.
I find myself using the "multiple" format significantly more often.
{
"sequence": "HERE",
"chains": ["D"],
"segmentIDs": ["HERE"],
"residue_numbering": [22]
},
Given the above configuration the resulting pdb numbering would be: 22, 23, 24, 25
so by having residue_numbering set with 1 number you can set the starting number of the chain.
Another thing you can do is specifically set the residue numbering of every residue via a
list that is the size of sequence
. ie:
"residue_numbering": [20, 50, 51, 99]
would yield pdb numbering with 20, 50, 51, 99
One big problem (or feature) with this mover is that the sequences of the pose chains must identically match the target sequences. Even if you're off by just 1 residue, the SmithWaterman can fail. and your pose will not be properly aligned. Be extra careful when blindly using this protocol in combination with downstream protocols!