AlignPDBInfoToSequences

Back to Mover page.

AlignPDBInfoToSequences

Autogenerated Tag Syntax Documentation:

Align the PDBInfo of a pose to some given sequences. This mover does not alter geometry or sequence of the pose. The only thing that is altered is the PDBInfo. The goal of this mover is to re-number/re-chain a pose so that its PDBInfo is in sync with some reference. There are two modes. In 'single' mode all the target sequences are appended in order and the entire pose is aligned to that sequence. In 'multiple' mode we attempt to align the sequence of each chain of the pose to each of the target sequences. If we experience a perfect match, we will renumber based on that target sequence. For more information, please reference the online documentation. WARNING 1: if you have a very long sequence, (1000s of residues) this protocol can use a lot of memory due to the SmithWaterman alignment. Be cautious of this. WARNING 2: This will most likely not give you the correct results if your pose or target sequence contain carbohydrate or other residues that are represented using [ ] or ( ).

<AlignPDBInfoToSequences name="(&string;)" mode="(&string;)"
        json_fns="(&string;)" throw_on_fail="(false &bool;)"
        sequence_alignment_cut_max="(10 &positive_integer;)" >
    <Target name="(&string;)" sequence="(&string;)" chains="(&string;)"
            segmentIDs="(&string;)" insCodes="(&string;)"
            residue_numbers="(&string;)" />
</AlignPDBInfoToSequences>

mode: (REQUIRED) Which mode to run in. options: ['single', 'multiple']
json_fns: The name of the json sequence file(s) (separated by ',')
throw_on_fail: throw on failure.
sequence_alignment_cut_max: Maximum number of cuts to allow when doing the sequence alignment.

Subtag Target: Targets and settings of the sequences to align to.

sequence: (REQUIRED) sequence of current chain/protein
chains: (REQUIRED) chains to set with current protein
segmentIDs: segmentIDs to set for the current protein
insCodes: insertion codes to set for the current protein
residue_numbers: Residue numbers to set for current chain (must be of size 1 (used as a starting number), or the size of 'sequence', or empty)

Purpose: The goal of this application is to provide the ability to align the pose's sequence (PDBInfo) to reference sequences. This is particularly useful when performing structure prediction, ie: If you are predicting the structure of residues 341-400 with Rosetta the output pdb will often start with residue 1 and end at residue 60. Adding this mover will make sure that the pdb's numbering will be correct (ie numbered as residue 341->400) at the end.

Example application:

<ROSETTASCRIPTS>
	<MOVERS>
		<AlignPDBInfoToSequences name="apts" mode="multiple" >
			<Target sequence="MKVKIKCWNGVATWLWVANDENCGICRMAFNGCCPDCKVPGDDCPLVWGQCSHCFHMHCILKWLHAQQVQQHCPMCRQEWKFKE" chains="C,D" />\n"
			<Target sequence="LSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKA" chains="U,u" />\n"
		</AlignPDBInfoToSequences>
	</MOVERS>
	<PROTOCOLS>
		<Add mover="apts" />
	</PROTOCOLS>
	<OUTPUT />
</ROSETTASCRIPTS>

Each Target must have sequence and chains defined, but can also have segmentIDs and/or insCodes defined if you would like to define those as well.

If you don't want to use Target blocks you can supply 1 or more json files (comma separated) with the "json_fns" option. the json files must follow the format: (List of Dictionaries, with required keys of chain+sequence (optional keys of segmentIDs and insCodes)

[
	{
		"chains": ["C"],
		"sequence": "MKVKIKCWNGVATWLWVANDENCGICRMAFNGCCPDCKVPGDDCPLVWGQCSHCFHMHCILKWLHAQQVQQHCPMCRQEWKFKE"
	},
	{
		"chains": ["D"],
		"sequence": "MSTLFPSLFPRVTETLWFNLDRPCVEETELQQQEQQHQAWLQSIAEKDNNLVPIGKPASEHYDDEEEEDDEDDEDSEEDSEDDEDMQDMDEMNDYNESPDDGEVNEVDMEGNEQDQDQWMI"
	},
	{
		"chains": ["G", "W"],
		"sequence": "MLRRKPTRLELKLDDIEEFENIRKDLETRKKQKEDVEVVGGSDGEGAIGLSSDPKSREQMINDRIGYKPQPKPNNRSSQFGSLEF"
	},
]

Option Descriptions

because there's no standard format that can really handle the level of control I want, I've decided to go with the following:

in 'single' mode: given the json file (or its equivalent Target block)

[
   {
       "sequence": "SEQ",
       "chains": ["A", "B", "C"],
       "segmentIDs": ["SEQ"]
   },
   {
       "sequence": "HERE",
       "chains": ["D"],
       "segmentIDs": ["HERE"]
   },

]

the mover will alter the PDBInfo of the a pose with two chains with sequences ["SEQ", "HERE"] to:

the residue numbering:

[1, 2, 3, 1, 2, 3, 4]

chains:

["A", "B", "C", "D", "D", "D", "D"]

segmentIDs:

["SEQ", "SEQ", "SEQ", "HERE", "HERE", "HERE", "HERE"]

This is useful, but normally you have 20+ chains and their order isn't consistent between experiments so instead I would suggest using the 'multiple' mode

multiple mode:

given a pose with sequences ["SEQ", "HERE", "SEQ"] you could align this with the json (or equivalent Target block):

[
   {
       "sequence": "SEQ",
       "chains": ["A", "B"],
       "segmentIDs": ["SEQ1", "SEQ2"]
   },
   {
       "sequence": "HERE",
       "chains": ["D"],
       "segmentIDs": ["HERE"]
   },
]

and this will result in the:

residue numbering:

[1, 2, 3, 1, 2, 3, 4, 1, 2, 3]

chains:

["A", "A", "A", "D", "D", "D", "D", "B", "B", "B"]

segmentIDs:

["SEQ1", "SEQ1", "SEQ1", "HERE", "HERE", "HERE", "HERE", "SEQ2", "SEQ2", "SEQ2"]

so using 'single' mode you have more per 'residue' control but less flexibility on the pose ordering, whereas in 'multiple' mode you have less per 'residue' control but lots of flexibility on the pose ordering.

I find myself using the "multiple" format significantly more often.

Residue numbering

 {
     "sequence": "HERE",
     "chains": ["D"],
     "segmentIDs": ["HERE"],
     "residue_numbering": [22]
 },

Given the above configuration the resulting pdb numbering would be: 22, 23, 24, 25 so by having residue_numbering set with 1 number you can set the starting number of the chain. Another thing you can do is specifically set the residue numbering of every residue via a list that is the size of sequence. ie:

     "residue_numbering": [20, 50, 51, 99]

would yield pdb numbering with 20, 50, 51, 99

Caveats--common errors, when not to use this mover

One big problem (or feature) with this mover is that the sequences of the pose chains must identically match the target sequences. Even if you're off by just 1 residue, the SmithWaterman can fail. and your pose will not be properly aligned. Be extra careful when blindly using this protocol in combination with downstream protocols!