CS-Rosetta RNA: Using chemical shifts with RNA structure refinement

KEYWORDS: STRUCTURE_PREDICTION NUCLEIC_ACIDS RNA

The cs_rosetta_rna application refines (minimizes) and scores a RNA structure under the hybrid CS-ROSETTA-RNA all-atom energy function:

E(hybrid) = E(Rosetta) + E(shift)

where E(Rosetta) is the standard Rosetta all-energy function for RNA [1], and E(Shift) is the chemical shift pseudo-energy term [2]. Input RNA PDB structures can be generated by FARFAR [1] and/or Stepwise Assembly [3] structure modeling methods, or can be an experimental NMR or crystallographic structure.

Example inputs

rosetta_inputs/GA-AG_mismatch/1MIS_NMR.pdb
NMR PDB structure of a tandem GA:AG mismatch internal loop.
rosetta_inputs/GA-AG_mismatch/1MIS_exp_1H_chem_shifts.str
Experimental non-exchangeable 1H chemical shift data for the tandem GA:AG mismatch interna loop.

Each line in this file represent one chemical shift data point and contains the following nine space-delimited columns (based on the STAR v2.1 format):
```
col 1: Atom_shift_assign_ID (INT)
col 2: Residue_author_seq_code (INT)
col 3: Residue_seq_code (INT)
col 4: Residue_label (STRING)
col 5: Atom_name (STRING)
col 6: Atom_type (STRING)
col 7: Chem_shift_value (FLOAT)
col 8: Chem_shift_value_error (STRING)
col 9: Chem_shift_ambiguity_code (STRING)
```
Note that residue_seq_code (col 3), residue_label (col 4), and atom_name (col 5) should be consistent with the data in the PDB file. Also, col 8 and col 9 are currently not used internally by the cs_rosetta_rna application.
rosetta_inputs/GA-AG_mismatch/1MIS_params
Parameters file (in FARNA format) for the tandem GA:AG mismatch.

Also, the input data files for all 23 RNA motifs benchmarked in ref. [2] are provided in the Supplemental Data Zip file, available at the following URI: http://dx.doi.org/10.1038/nmeth.2876

Example command-lines

Refine (minimize) and score a PDB structure under the hybrid CS-ROSETTA-RNA all-atom energy function:

<path-to-rosetta-bin>/cs_rosetta_rna.<release> \
    -mode minimize_pdb \
    -pdb <input_pdb> \
    -score:rna_chemical_shift_exp_data <exp_cs_data_file> \
    -params_file <input_param_file> \
    -analytic_etable_evaluation false

Score (but not refine) a PDB structure under the hybrid CS-ROSETTA-RNA all-atom energy function:

<path-to-rosetta-bin>/cs_rosetta_rna.<release> \
    -mode score_pdb \
    -pdb <input_pdb> \
    -score:rna_chemical_shift_exp_data <exp_cs_data_file> \
    -params_file <input_param_file> \
    -analytic_etable_evaluation false

Refine (minimize) and score the tandem GA:AG_mismatch NMR PDB structure under the CS-ROSETTA-RNA all-atom energy function:

~/Rosetta/rosetta_git/Rosetta/main/source/bin/cs_rosetta_rna.graphics.macosgccrelease \
    -mode minimize_pdb \
    -pdb rosetta_inputs/GA-AG_mismatch/1MIS_NMR.pdb \
    -score:rna_chemical_shift_exp_data rosetta_inputs/GA-AG_mismatch/1MIS_exp_1H_chem_shifts.str \
    -params_file rosetta_inputs/GA-AG_mismatch/1MIS_params \
    -analytic_etable_evaluation false

Score (but not refine) the tandem GA:AG_mismatch NMR PDB structure under the hybrid CS-ROSETTA-RNA all-atom energy function:

~/Rosetta/rosetta_git/Rosetta/main/source/bin/cs_rosetta_rna.graphics.macosgccrelease \
    -mode score_pdb \
    -pdb rosetta_inputs/GA-AG_mismatch/1MIS_NMR.pdb \
    -score:rna_chemical_shift_exp_data rosetta_inputs/GA-AG_mismatch/1MIS_exp_1H_chem_shifts.str \
    -params_file rosetta_inputs/GA-AG_mismatch/1MIS_params \
    -analytic_etable_evaluation false

Refine (minimize) and score the UAAC tetraloop NMR PDB structure under the CS-ROSETTA-RNA all-atom energy function:

~/Rosetta/rosetta_git/Rosetta/main/source/bin/cs_rosetta_rna.graphics.macosgccrelease \
    -mode score_pdb \
    -pdb rosetta_inputs/UAAC_loop/4A4R_NMR.pdb \
    -score:rna_chemical_shift_exp_data rosetta_inputs/UAAC_loop/4A4R_exp_1H_chem_shifts.str \
    -params_file rosetta_inputs/UAAC_loop/4A4R_params \
    -analytic_etable_evaluation false

Score (but not refine) the UAAC tetraloop NMR PDB structure under the hybrid CS-ROSETTA-RNA all-atom energy function:

~/Rosetta/rosetta_git/Rosetta/main/source/bin/cs_rosetta_rna.graphics.macosgccrelease \
    -mode minimize_pdb \
    -pdb rosetta_inputs/UAAC_loop/4A4R_NMR.pdb \
    -score:rna_chemical_shift_exp_data rosetta_inputs/UAAC_loop/4A4R_exp_1H_chem_shifts.str \
    -params_file rosetta_inputs/UAAC_loop/4A4R_params \
    -analytic_etable_evaluation false

Optional arguments

-score::rna_chemical_shift_H5_prime_mode MODE

Specify how to handle assignment of the diastereotopic H5' and H5'' proton pair. Valid modes:
- LEAST_SQUARE_IGNORE_DUPLICATES (default)
  
  In this mode, the assignments of H5' and H5'' protons will be based on which values give better agreement between the experimental and back-calculated chemical shifts. Uses this mode, if the experimental non-exchangeable 1H chemical shift are not unambiguously assigned.
- UNIQUE
  
  In this mode, the assignments H5' and H5'' proton will be used "as is" and the cs_rosetta_rna will not attempt to switch the H5' and H5'' assignments. Use this mode only if the the experimental non-exchangeable 1H chemical shift data have unambiguous assignments of the diastereotopic 1H5´ and 2H5´ protons. Note that this is uncommon.

Outputs

A breakdown of the hybrid CS-ROSETTA-RNA all-atom energy terms, e.g:

------------------------------------------------------------
 Scores                       Weight   Raw Score Wghtd.Score
------------------------------------------------------------
 fa_atr                       0.230    -125.447     -28.853
 fa_rep                       0.120       8.314       0.998
 fa_intra_rep                 0.003      81.488       0.236
 fa_intra_RNA_base_phos_atr   0.230       0.000       0.000
 fa_intra_RNA_base_phos_rep   0.120       0.000       0.000
 lk_nonpolar                  0.320       2.123       0.679
 lk_nonpolar_intra_RNA        0.320       3.768       1.206
 fa_elec_rna_phos_phos        1.050      -0.074      -0.078
 ch_bond                      0.420     -30.523     -12.820
 rna_torsion                  2.900       2.721       7.892
 rna_sugar_close              0.700       3.171       2.220
 fa_stack                     0.125    -199.844     -24.981
 geom_sol_fast                0.620      56.483      35.020
 geom_sol_fast_intra_RNA      0.620       1.978       1.226
 hbond_sr_bb_sc               0.620       0.000       0.000
 hbond_lr_bb_sc               2.400       0.000       0.000
 hbond_sc                     2.400     -20.116     -48.279
 hbond_intra                  2.400       0.000       0.000
 atom_pair_constraint         1.000       0.000       0.000
 angle_constraint             1.000       0.000       0.000
 rna_bulge                    0.450       0.000       0.000
 rna_chem_shift               4.000       1.232       4.928
 linear_chainbreak            5.000       0.009       0.047
-----------------------------------------------------------
 Total weighted score:                              -60.558

The total hybrid CS-ROSETTA-RNA all-atom energy, e.g:
```
hybrid_CS-ROSETTA-RNA_all-atom energy: -60.5579
```
The chemical shift RMSD, e.g:
```
chem_shift_RMSD: 0.143299
```
The chem_shift_RMSD (in ppm unit) is the root-mean-deviation between the 'back-calculated' and the experimental 1H chemical shift. A low chem_shit_RMSD indicates that the RNA 3D structure agrees well with the experimental 1H chemical shift data
The RNA PDB structure after refinement under the hybrid CS-ROSETTA-RNA all-atom energy function (if -mode minimize_pdb). The refined PDB is outputted to the run directory under the filename: <in_pdb_basename>_out.

Best Practices

Figure 1: Breakdown of the secondary structure of the tandem GA:AG mismatch internal loop:

                       1    6                              1
                    5'-CGGACG-3'                        5'-CG
Entire structure:      ||**||               H1 helix:      ||
                    3'-GCAGGC-5'                        3'-GC
                       12   7                              12

                                                                6
                         GA                                    CG-3'
2x2 mismatch:            **                 H2 helix:          ||
                         AG                                    GC-5'
                                                                7

How many canonical base-pairs should be included at each helical boundary?

2 base-pairs should be included at each helical boundary (for rationale, see ref. [2]).

For example, in the case of the tandem GA:AG mismatch internal loop, the structure consists of the a 2 base-pairs H1 helix, a 2x2 mismatch, and a 2 base-pairs H2 helix.
Which atoms' chemical shift data should be included?

The chemical shift data of all non-exchangeable proton should be included in the chemical shift data file.

The non-exchangeable protons consist of the H1', H2', H3', H4', H5' and H5'' ribose protons, and the H2, H5, H6 and H8 base protons.

Data lines belonging to other atom types will be ignored.
Which nucleotides' chemical shift data should be included?

The chemical shift data of all nucleotides EXCEPT those that are right at 5' and 3' edges should be included in the chemical shift data file.

For example, in the case of the tandem GA:AG mismatch internal loop, the chemical shift data of all nucleotides except C1, G6, C7 and G12 should be included.
How to prepare the parameters file.
1. Add a "OBLIGATE PAIR" line for each helical base-pair located right at the 5' and 3' edges of the structure.
  
  In the case of the tandem GA:AG mismatch internal loop, the OBLIGATE PAIRS are "C1-G12" and "G6-C7":
```
OBLIGATE   PAIR 1 12 H H A
OBLIGATE   PAIR 6 7 H H A
```
  Note that the cs_rosetta_rna app will refine (minimize) ALL nucleotides EXCEPT nucleotides that are specified as "OBLIGATE PAIR", which will be be kept static.
2. Add "ALLOW_INSERT" lines to include all non-canonical loop nucleotides position:
  
  In the case of the tandem GA:AG mismatch internal loop, the "ALLOW_INSERT" nucleotide positions are G3, A4, G9 and A10:
```
ALLOW_INSERT 3 4
ALLOW_INSERT 9 10
```
3. Add "CUTPOINT_CLOSED" line to include the position intermediately 5' of the first non-canonical loop nucleotide position.
  
  In the case of the tandem GA:AG mismatch internal loop, the first non-canonical loop nucleotide position is G3. The "CUTPOINT_CLOSED" position is the position intermediately 5' of G3, which is G2:
```
CUTPOINT_CLOSED 2
```
  Note that if the "CUTPOINT_CLOSED" line was not included in the parameter line, the cs_rosetta_rna app will still be able run by selecting a random loop position as the cutpoint_closed position. However, it is recommended that the "CUTPOINT_CLOSED" line be explictly included to prevents this random selection.
4. Add "CUTPOINT_OPEN" line for all position intermediately 5' of chain-breaks.
  
  In the case of the tandem GA:AG mismatch internal loop, there is chain-break between G6 and C7. The "CUTPOINT_OPEN" is the position intermediately 5' of the chain-break which is G6:
```
CUTPOINT_OPEN 6
```
Adding all the above parameter lines together, we get the parameter file for the tandem GA:AG mismatch ("rosetta_inputs/GA-AG_mismatch/1MIS_params"):
```
OBLIGATE   PAIR 1 12 H H A
OBLIGATE   PAIR 6 7 H H A
ALLOW_INSERT 3 4
ALLOW_INSERT 9 10
CUTPOINT_CLOSED 2
CUTPOINT_OPEN 6
```
Finally, the cs_rosetta_rna app can also run WITHOUT an input parameter file, although this is not recommended. For this case, a simple fold-tree with not chain-break/cutpoints will be used and all nucleotides will be refined (minimized).
How to specify the chemical shift data for the diastereotopic H5' and H5'' proton pairs.
- If two chemical shift data points are measured for the diastereotopic H5' and H5'' protons pair and unambiguous assignment is possible, then include correct the unambiguous assignment in the data lines, e.g.:
```
1  1  1 G H5'  H   4.180 . . 
2  1  1 G H5'' H   4.540 . .  
```
  In this case, please explicitly include the command line option: -score:rna_chemical_shift_H5_prime_mode UNIQUE
- If two chemical shift data points are measured for the diastereotopic H5' and H5'' protons pair BUT unambiguously assignment is not possible, then include either of the two possible assignments in the data lines, e.g.:
```
1  1  1 G H5'  H   4.180 . . 
2  1  1 G H5'' H   4.540 . .  

          OR

1  1  1 G H5'  H   4.540 . . 
2  1  1 G H5'' H   4.180 . .  
```
  The cs_rosetta_rna app with automatically select the assignments which leads to better agreement between the experimental and back-calculated chemical shift.
- If only one chemical shift data point is measured for the diastereotopic H5' and H5'' proton pair AND unambiguous assignment is not possible, then please still include two chemical shift data lines (with same cs-value), one for each proton, e.g.:
```
1  1  1 G H5'  H   4.180 . . 
2  1  1 G H5'' H   4.180 . . 
```

References

[1] Das, R., Karanicolas, J. & Baker, D. Nat Methods 7, 291-294 (2010).

[2] Sripakdeeving, P. et al. Nature Methods 11, 413–416 (2014).

[3] Sripakdeevong, P., Kladwang, W. & Das, R. Proc Natl Acad Sci U S A 108, 20573-20578 (2011).