Documentation created by Brahm Yachnin (brahm.yachnin@rutgers.edu), Khare laboratory, and Chris Bailey-Kellogg (cbk@cs.dartmouth.edu). Last edited May 24, 2021.
The mhc_energy_tools
are companion Python scripts for use with the MHCEpitopeEnergy design-centric guidance scoreterm. In brief, the mhc_score.py
script is used to analyze existing protein sequences, PDB files, or lists of peptides to identify immunogenic hotspot sequences. The mhc_gen_db.py
script is used to generate a SQL database based from a FASTA or PDB file, with various methods of indicating designable residues and their identities. These databases can then be used to score poses while packing with a mhc_energy
term and the MHCEpitopePredictorExternal
predictor (see MHCEpitopeEnergy for more details).
If you use the Python tools, either in the context of mhc_epitope or independently, please cite the appropriate papers as described here.
The MHCEpitopeEnergy energy method, mhc_epitope scoreterm, mhc energy tools, and benchmark are described in the following publication. If you make use of them, please cite the following paper:
Yachnin BJ, Mulligan VK, Khare SD, and Bailey-Kellogg C. (2021). MHCEpitopeEnergy, a flexible Rosetta-based biotherapeutic deimmunization platform. J Chem Inf Model 61(5):2368-2382. doi: 10.1021/acs.jcim.1c00056. https://pubmed.ncbi.nlm.nih.gov/33900750/
In addition to the Rosetta functionality, mhc_epitope makes use of several de-immunization resources developed outside of the Rosetta community. As a condition for using this code, it is imperative that the resources that you use are appropriately cited in any resulting publication.
The ProPred matrices are provided in the Rosetta database (/main/database/scoring/score_functions/mhc_epitope/propred8.txt
) and are used by the "default" configuration file, propred8_5.mhc
.
The matrices are obtained from http://crdd.osdd.net/raghava/propred/ If you use results based on these predictions, please cite Singh, H. and Raghava, G.P.S. (2001) ProPred: Prediction of DR binding sites. Bioinformatics, 17(12):1236-37. http://www.ncbi.nlm.nih.gov/pubmed/11751237
NetMHCII can be incorporated into external databases using the mhc-energy-tools. If you use NetMHCII, please cite Improved methods for predicting peptide binding affinity to MHC class II molecules. Jensen KK, Andreatta M, Marcatili P, Buus S, Greenbaum JA, Yan Z, Sette A, Peters B, Nielsen M. Immunology. 2018 Jan 6. doi: 10.1111/imm.12889
IEDB is a public database of experimentally-validated immune epitopes. The iedb_data.mhc
file references the database file iedb_data.db
that contains data derived from the IEDB. If you use either of these files, you must cite the IEDB. In addition, the mhc-energy-tools provides resources to update and build custom database files based on IEDB data.
Please see the following instructions from the IEDB related to using any of these resources:
Users are requested to cite the IEDB when they present information obtained from the IEDB: http://www.iedb.org. The journal reference for the IEDB was updated in 2018 and should be cited as: Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2019 Jan 8;47(D1):D339-D343. doi: 10.1093/nar/gky1006. PMID: 30357391
All publications or presentations referring to data generated by use of the IEDB Resource Analysis tools should include citations to the relevant reference(s), found at http://tools.iedb.org/main/references/
mhc_score.py
can take various types of input for scoring.
mhc_score.py ILVEQACFPSL
.--fa mysequence.fas
names a file in FASTA-ish format with a SINGLE sequence to be used as input. The >
title line is optional.--fsa sequences.fas
names a file in FASTA-ish format with MULTIPLE sequences to be used as input. The >
title line is mandatory for each sequence.--pdb myprotein.pdb
names a PDB file, from which the sequence will be pulled and used as input.
--chain X
specifies which chain of the PDB file to look at. If not specified, the entire PDB file will be scored.--pep peptides.txt
names a file with one sequence per line. Each line will be used as an input sequence.Additional notes:
The Predictors determine how to score the sequences.
--propred
uses the Propred matrices to score the sequences. This is the default behaviour. It will look first in the current directory, then in the Rosetta database. It will attempt to look for the Rosetta database first using relative paths, assuming tools
and main
are cloned in the same directory. If that fails, if the environment variable $ROSETTA
exists, it will assume main
is cloned within that directory and will look for the database there.--matrix MATRIX
uses the MATRIX
file to score the sequences.--netmhcii
uses the netmhcII executable, pointed to by the environment variable $NMHOME
, to score the sequences.--db DB
uses a pre-computed SQL database, as generated using mhc_gen_db.py
, to score the sequences.--csv CSV
uses a pre-computed CSV file, as generated using mhc_gen_db.py
, to score the sequences.--allele_set SET
tells the Predictor to use SET
as the set of alleles.
all
, southwood98
, or test
are allowed. If using test
, only the DRB1_0101 allele will be used. The default (--allele_set all
) is to use all alleles specified in the matrix.test
will score only the DRB1_0101 allele, and greenbaum11
(27 alleles), southwood98
(8 alleles) and paul15
(7 alleles) will use published lists of alleles. all
will use all 61 available alleles. test
is the default. More alleles takes longer (not a huge issue with a single protein sequence), but may be more accurate.--alleles ALLELE_LIST
allows you to provide a custom list of alleles to use. Not used by default.--epi_thresh EPI_THRESH
sets the threshold of what constitutes a "hit." By default, this is 5.00, which means that the top 5% of binders are considered hits.--noncanon {error,warn,silent}
allows you to specify the behaviour if a residue other than the 20 canonical AAs is encountered (for example, ligands, non-canonical amino acids, etc.). Options are error
(script will exit), warn
(a warning will be printed), or silent
. HOW ARE NCAAS DEALT WITH HERE? AS CHAIN BREAKS? WHAT IS THE DEFAULT BEHAVIOUR?--netmhcii_score {rank,absolute}
specifies whether to use the ranked score (i.e. is this in the top X% of binders) or absolute score (i.e. the affinity of binding) to determine if something is a hit. This will have an impact on the --epi_thresh
parameter. By default, rank
is used.--db_unseen {error,warn,score}
specifies, in a database predictor, what to do if a peptide that is absent from the database is encountered. Options are error
(script will exit), warn
(a warning will be printed, and a score of 0 will be assigned to that peptide), or score
(the peptide will be scored using --db_unseen_score
). warn
is the default. IS THIS AN ACCURATE DESCRIPTION OF THE BEHAVIOUR?
--db_unseen_score SCORE
specifies the penalty to apply if we encounter an "unseen" peptide and we are using --db_unseen score
. A penalty of 100 is the default.The scoring results can be output in various ways. In any case where a filename is specified, using the $
character will substitute the sequence name in cases where multiple sequences are being scored.
--report {total,hits,full}
specifies what kind of report to output to standard output. total
will provide the total score for each sequence. hits
will provide a report by allele for each peptide that is identified as a hit (i.e. score > 0). full
will provide a report by allele for all peptides in the sequence. Default is total
.
--report_file FILE
will output the report generated by --report
to a file in CSV format. If using multiple sequences, the $
symbol in the filename will be substituted with the sequence name or number.--plot_hits_file FILE
will output the results to a graphical file format. matplotlib
must be installed for this to work. Plot format will be determined by the filename extension given.Add some examples, probably from the demos.
The mhc_gen_db.py
script uses a lot of the same machinery as mhc_score.py
, and so many of the same options work the same way as described above.
The input options work in the same way as mhc_score.py
, but only one sequence can be used. This means the only available options are as follows:
--fa
--pdb
/--chain
You may not specify a multi-FASTA file (--fsa
), a peptides file (--pep
), or sequences from standard input or the command line.
The option --firstres
allows you to specify the residue number of the first residue in your PDB file. In cases where you have a PSSM based on PDB sequence (that doesn't necessarily start with residue 1), you will need this to correct the offset in the chain.
Inherent in making a database of scores is the need to determine what your design space should be. This is a complex question to do the combinatorial aspects of sliding window peptides, as outlined in detail here. Because it is impossible to score all of design space even in a very small region, we provide a number of ways of describing which amino acids to allow at each position, and which positions to sample.
Note that regardless of these options, the entire input sequence will be scored and stored in the database.
You may use one (and only one) of the following ways of specifying what amino acids to try at each position.
--aa_csv CSV_FILE
specifies a CSV file where each line includes a position number followed by all allowed amino acids (one-letter code) at each position, each separated by commas. (Wild-type is automatically added in if missing.) See tools/mhc_energy_tools/demo/in/2sak_A.region_muts.csv
for an example.--pssm PSSM_FILE
specifies a PSSM to use. Two PSSM formats are officially supported, though the script will try to parse unsupported formats as well. The two formats are the output from the web and command line versions of PSIBLAST. For more details on how to make a PSSM, see here. For examples of supported PSSMs, see tools/mhc_energy_tools/pssm_examples
.
--pssm_thresh PSSM_THRESH
sets the PSSM threshold above which a residue will be allowed. The PSSM should contain the log (base 2) of the observed substitution frequency at a given position divided by the expected substitution frequency at that position. A score > 1 means that the residue is observed more often than would be expected at random. The default is 1, though this can lead to very large database sizes if not increased.There are two options to limit which positions to sample. This is useful if, for example, you have a PSSM for the entire protein, but only want to score epitopes in specific regions.
--positions
allows you to specify which regions to allow to mutate. Anything not included will not be scores (except the wild-type sequence). It should be formatted as follows: --positions 3-30
to target a single region (residues 3-30, inclusive) and --positions 3-30,113-142
to target multiple regions (residues 3-30 and 113-142, inclusive).--lock
prevents regions from mutating. It is specified in the same format as --positions
.--positions
and --lock
are specified, --lock
overrides --positions
. A lock
position will not be mutated even if specified by --positions
.All predictors available in mhc_score.py
are available with mhc_gen_db.py
, and are configured in the same way. Specifically:
--propred
--netmhcii
--matrix
In addition, the --netmhcii_raw NETMHC_OUTPUT
option allows the raw output from a NetMHCII run to be used as input.
You may also use a SQL DB or CSV file generated by this script as your predictor. To do so, provide the path to said database/CSV file with the --db_in
or --csv_in
options.
All of the options from mhc_score.py
are available with mhc_gen_db.py
to configure the predictors, except --db_unseen
. Specifically:
--allele_set
--alleles
--epi_thresh
--noncanon
--netmhcii_score
(for NetMHCII only)The database will be output to a file specified by either the --db
or --csv
options. If the file already exists at that location, the script will normally fail. You can specify to add to the existing database by providing the --augment
argument for SQL DBs only. If you do so, it will verify that the database is using the same type of predictor and same list of alleles. If so, it will add any peptides that are missing from that database to the database. For either SQL DBs or CSV DBs, you may specify the --overwrite
command to delete the existing file and start from scratch.
If no database is given, the script will run, following instructions from the additional output options, without scoring or saving the database.
Additional output options:
--estimate_size
will calculate the number of peptides to be scored as configured. This is a good check to do before actually creating your database to get an idea of how big it will be. You can get a "back-of-napkin" estimate of the size of a SQL database by multiplying the number of peptides by 100, giving you an approximate file size in bytes for a 7 allele database. A database containing 3 million peptides will be about 300 MB, for example. The CSV file database will be about half the size.--peps_out PEPTIDE.TXT
will save all of the peptides to be scored in PEPTIDES.TXT
, with one peptide per line.--res_out RESFILE.RES
will generate a resfile based on the design space specified. Only positions with more than one allowed amino acid will be listed in the resfile, using PIKAA to list them. Recall that wild-type is always allowed, so this is listing PIKAA lines for all positions that allow at least one non-native amino acid to be sampled.
--chain X
, in addition to specifying what chain to use when a PDB file is used as input, will decide what chain ID to use in the resfile if a FASTA file is used as input. If input is using --fa
, --chain
is mandatory.--res_header COMMAND
applies COMMAND
to the resfile header, above the start
keyword, which will apply that command to all unspecified residues. By default, no header is included.--firstres
parameters can be used to specify the real number of the first residue, allowing the PSSM and PDB sequence numbering to match up.Because NetMHCII is slow compared to matrix-based scoring, spreading the computation for NetMHCII databases over multiple processors is enabled. Use the following commands to control multiprocessor database generation:
--nproc #
determines the number of processors to use. By default, it uses 1 (no multiprocessing).--batch #
determines the number of peptides to pass to each processor at a time. NetMHCII will score that batch, and then another batch will be submitted to that processor until all peptides are scored.Add some examples, probably from the demos.
The mhc_data_db.py
script is similar to mhc_gen_db.py
, but uses the IEDB database of experimentally-validated epitopes to construct the database. The user must download the IEDB to a local mysql server (can be done with this script) or as CSV files and provided as input.
If you have a working mysql server, the easiest way of obtaining the IEDB is to download it using this script. You can do this using the following options:
--iedb_fresh_mysql DATABASE_NAME
indicates that you want to download a fresh copy of the IEDB. DATABASE_NAME is a local mysql database that you can write to. Note that if this database exists, it will be overwritten! By default, "iedb" will be used as the database name.--mysql_user USER
and --mysql_pw PASSWORD
are the username and password needed to access the local mysql database. By default, these will be "root" and no password.Alternatively, you can download the IEDB as CSV files from their website.
When generating your database/CSV file to use with Rosetta, you must specify how to access the IEDB.
If you are using a local mysql database, provide the following options:
iedb_mysql DATABASE_NAME
gives the name of the local mysql database (downloaded as described above).--mysql_user USER
and --mysql_pw PASSWORD
are the username and password needed to access the local mysql database. By default, these will be "root" and no password.If you are using an IEDB CSV file, provide the following option:
iedb_csv CSV_FILE
, where CSV_FILE is the name of the IEDB CSV file.You may specify what alleles to use. Four allele sets are supported: test, greenbaum11, paul15, southwood98, and hlaII. These can be selected using the --allele_set
option. Alternatively, individual alleles can be specified using the --alleles
option (only if not using a predictor, see below).
IEDB has MHC ligand binding data, and also MHC ligand elution data. If you are using a local mysql database, you can specify to use one, the other, or both:
--assay_mhc_ligand_binding
controls whether to include the ligand binding data. Can be all
to use, or none
to leave out.--assay_mhc_ligand_elution
controls whether to include the ligand binding data. Can be all
to use, or none
to leave out.If you are using an IEDB CSV file, you must curate the CSV appropriately yourself.
In addition, the IEDB contains data with peptide of arbitrary length. To work with mhc_epitope
, we must reduce these to an appropriate 9mer size. This can be done in one of three ways:
--cores all
will add all 9mers from the peptide. For example, if the peptide is an 11mer, three entries will be added to the database: one ranging from positions 1-9, one from 2-10, and one from 3-11.--cores predicted_good
means that any 9mer that beats a particular threshold, according to a predictor (see below) will be added to the database.--cores predicted_best
is not currently implemented, but will add the best 9mer (strongest binder), according to a predictor (see below).--cores full
will put the entire peptide, regardless of its length, into the database. This behaviour is not supported in Rosetta, but might be useful for other applications.If you choose to select cores (i.e. predicted_good
or predicted_best
), you need to specify a predictor to score/rank them. Currently, you may use the following:
--propred
to use Propred.--pred_csv CSV_FILE
to use a precomputed CSV file generated by this script or mhc_gen_db.py
.--pred_db SQL_DB_FILE
to use a precomputed SQL DB file generated by this script or mhc_gen_db.py
.--matrix MATRIX_FILE
to use a generic scoring matrix.Note that the same allele_set
specified for the IEDB will be used with your predictor.
In addition, you may pass options to control the predictor:
--epi_thresh THRESH
sets the predictor's threshold for counting a peptide as a binder. Defaults to 5.0.--noncanon
sets how to deal with noncanonical residues.The database will be output to a file specified by either the --db
or --csv
options. If the file already exists at that location, the script will normally fail. You can specify to add to the existing database by providing the --augment
argument for SQL DBs only. If you do so, it will verify that the database is using the same type of predictor and same list of alleles. If so, it will add any peptides that are missing from that database to the database. For either SQL DBs or CSV DBs, you may specify the --overwrite
command to delete the existing file and start from scratch.
Add some examples, probably from the demos.
To do (installation and setting environment)
The following is the complete list of alleles supported for each Predictor. In brackets, which allele_sets it occurs in is indicated (test, paul15, greenbaum11).