RosettaEvolutionaryLigand (REvoLd) is an evolutionary algorithm to efficiently sample small molecules from fragment-based combinatorial make-on-demand libraries like Enamine REAL. It has been designed to optimize a normalized fitness score based on RosettaLigand, but can be used with any RosettaScript. Within 24 hours it can bring you from knowing nothing about potential ligands to a list of promising compounds. All you need is an input structure as a target, the combinatorial library definition and some idea about where the binding site is.
REvoLd is typically run multiple times with independent starts to sample the chemical space better. Each run finds different low-energy binders. A single run samples between 1,000 and 4,000 ligands. Depending on your available hardware and time, I suggest 10 to 20 runs.
Eisenhuth, Paul, et al. "REvoLd: Ultra-Large Library Screening with an Evolutionary Algorithm in Rosetta." arXiv preprint arXiv:2404.17329 (2024). Preprint paper
This section is intended to allow you to run REvoLd as fast as possible and provide you with all information needed. However, a lot of this depends on work outside the scope of REvoLd, so it might change without being updated here. If you struggle to compile REvoLd or retrieve the Rosetta code, please contact the RosettaCommons. If you can't acquire the combinatorial library through BioSolveIt, contact Enamine directly, use any other provider, or create one yourself.
We highly recommend running REvoLd only with MPI support through mpirun/mpiexec/srun/etc. Depending on the size of your combinatorial library you might need 200-300GB RAM (in total, not per CPU). We recommend using 50-60 CPUs per run.
We are using the Enamine REAL space in our paper and in drug discovery campaign we are participating in. However, any combinatorial library is suitable. The required fields are mentioned under input.
Enamine Ltd. has outsourced the licensing of their REAL space input data to BioSolveIt. You can contact them here https://www.biosolveit.de/contact/. Based on our experience, if you plan to use the REAL input data exclusively for academic research, the NDA process is straightforward and generally unproblematic. Feel free to mention Leipzig University and your plan to use REvoLd for your academic research.
The code is available through github: https://github.com/RosettaCommons/rosetta
You can download it with:
git clone https://github.com/RosettaCommons/rosetta.git
This will create a directory called rosetta at your current work directory containing the code.
Make sure you cloned the latest Rosetta version and have a MPI compiler available (for example mpiCC).
# navigate into your local Rosetta clone
cd Path/to/rosetta/source
# overwrite compile settings
cp tools/build/site.settings.release tools/build/site.settings
# start compile
./scons.py -j <num of processors> revold mode=release extras=mpi
It might happen that you run into issues of mpi versions or that SCons is giving you an error “mpiCC not found”. In such cases, uncomment two lines in tools/build/site.settings under the "override" portion and change the path to your correct version of mpi:
"cxx" : "/path/to/mpicxx",
"cc" : "/path/to/mpicc",
Additionally, make sure SCons tries to use the correct mpi compiler available on your machine. If you have mpicc available, but SCons tries to use mpiCC, it will crash. In that case, open tools/build/basics.settings and change all occurrences of mpiCC to mpicc.
REvoLd requires a single protein structure as target. Remember to prepare it. Additionally, you need the definition of the combinatorial space to sample from. This requires two files, one defining reactions and one defining reagents (or fragments and rules how to link them). They can be obtained under a NDA from vendors like Enamine or from any other source including self made. The definition consists of two white-space separated files with header lines including the following fields:
reactions: reaction_id (a name or number to identify the reaction), components (number of fragments participating in the reaction), Reaction (smarts string defining the reaction or fragment coupling)
reagents: SMILES (defining the reagent), synton_id (unique identifier for the reagent), synton# (specifiyng the position when applying the SMARTS reaction, [1,...,components]), reaction_id (matching identifier to link the reagent to a reaction)
Lastly, REvoLd requires a RosettaScript which will be applied multiple times to each protein-ligand complex for docking and scoring. We are using the RosettaLigand script (xml file paper). The xml file is slightly changed from the original publication to fix some syntax errors and provide a binding site large enough to fit most small molecules. You might want to change box_size in the Tranform tag and width in the ScoringGrid tag to suite your goals.
Summary required files:
You can use any Rosetta options on the command line or as a flags file. Following are the required options for REvoLd and its specific optional settings.
-in:file:s Protein structure used as target
-parser:protocol RosettaScript used for refining and scoring protein-ligand complexes
-ligand_evolution:xyz Centroid position for initial ligand placement
-ligand_evolution:reagent_file Path to reagents file
-ligand_evolution:reaction_file Path to reactions file
-ligand_evolution:main_scfx Name of the scoring function specified in the docking protocol which should be used for scoring compounds.
-ligand_evolution:options Path to the REvoLd evolution options file, allows changing the evolutionary optimization, detailed below
-ligand_evolution:external_scoring Triggers vHTS mode, detailed below
-ligand_evolution:smiles_file Triggers vHTS mode, detailed below
-ligand_evolution:n_scoring_runs How often should the scoring protocol be applied to each complex. Defaults to 150.
-ligand_evolution:ligand_chain Name of the ligand chain used in the docking protocol. Defaults to X.
-ligand_evolution:pose_output_directory Directory to which all calculated poses will be written. Defaults to the run directory.
-ligand_evolution:score_mem_path Path to a former results file to load as score memory. Needs to be in REvoLd output format. REvoLd will check if a ligand is present in the memory file and use its score instead of docking it.
-ligand_evolution:main_term Name of the main term used as fitness function. Defaults to lid_root2. Terms are detailed below.
-ligand_evolution:n_generations For how many generations should REvoLd optimize. Defaults to 30.
mpirun -np 20 bin/revold.mpi.linuxgccrelease \
-in:auto_setup_metals \
-in:file:s 5ZBQ_0001_A.pdb \
-parser:protocol docking_perturb.xml \
-ligand_evolution:xyz -46.972 -19.708 70.869 \
-ligand_evolution:main_scfx hard_rep \
-ligand_evolution:reagent_file reagents_short.txt \
-ligand_evolution:reaction_file reactions_short.txt \
Important: Never start multiple REvoLd runs in the same directory, as they will overwrite each others results.
After running REvoLd you will find several files in your run directory:
Details can be found in the publication. In short, REvoLd starts with a population of ligands (default 200) randomly sampled from the combinatorial input space. Each ligand is added to a copy of the target pose and placed at the specified xyz position. The docking protocol is applied n_scoring_runs times (default 150). Each apply is followed by scoring the protein-ligand pose with the specified main scoring function. The resulting scores are used to calculate the REvoLd specific scores, further referred to as fitness scores:
The specified main term (default lid_root2) is used to select the fittest docking result for each ligand and the corresponding pose is written to file. It's score is also used as fitness for the ligand. After scoring each ligand, the population needs to be shrunk to simulate selective pressure (default to a size of 50 ligand). This is done through a selector (default tournament selection). Finally, the evolutionary optimization cycle starts and is repeated until the maximum number of generations is reached (default 30). After each generation all ligand information is saved.
Each generation starts with selecting individuals from the current population to produce offspring. This can be flexibly modified through combinations of selectors and offspring factories. The resulting offspring can be the same as their parents to preserve well fit ligands for future generations, mutations switching only a single fragment or crossover between two parents combining fragments from both.
REvoLd can also be used score a list of smiles distributed over multiple processors if the aforementioned options external_scoring and smiles_file are set. Each processor from the mpirun call will select its own equally sized chunk of smiles from the smiles_file, turn each smiles into a protein-ligand complex, place it at the specified xyz position and apply the specified RosettaScript n_scoring_runs times. No pdbs will be written during vHTS to reduce the disk space requirements, but each process will create a file called external_results_.csv. This file will contain as many lines as defined by external_scoring for each ligand corresponding to the n best results from docking. Each line looks like smiles;term1;term2;...;termN
REvoLd evolutionary optimization cycle is very modular and can be changed. However, this option is usually not important if you want to use REvoLd. The default settings are benchmarked and have shown reliable performance. However, if you want to change it, use -ligand_evolution:options to specify a xml file:
<Population main_selector="std_tournament" supported_size="50"/>
<PopulationInit init_type="random" size="200"/>
<PopulationInit init_type="best_loaded" size="25" selection="1000"/>
<Scorer similarity_penalty="0.5" similarity_penalty_threshold="0.95"/>
<Selector name="remove_elitist" type="elitist" size="15" remove="True"/>
<Selector name="std_tournament" type="tournament" size="15" remove="False" tournament_size="15" acceptance_chance="0.75"/>
<Selector name="std_roulette" type="roulette" size="15" remove="False" consider_positive="False"/>
<Factory name="std_mutator" type="mutator" size="30" reaction_weight="1.0" reagent_weight="2.0" min_similarity="0.6" max_similarity="0.99"/>
<Factory name="drastic_mutator" type="mutator" size="30" reaction_weight="0.0" reagent_weight="1.0" min_similarity="0.0" max_similarity="0.25"/>
<Factory name="reaction_mutator" type="mutator" size="30" reaction_weight="1.0" reagent_weight="0.0" min_similarity="0.6" max_similarity="0.99"/>
<Factory name="large_crossover" type="crossover" size="60"/>
<Factory name="std_identity" type="identity" size="15"/>
<!--order is important here-->
<EvolutionProtocol selector="std_roulette" factory="std_mutator"/>
<EvolutionProtocol selector="std_roulette" factory="large_crossover"/>
<EvolutionProtocol selector="std_roulette" factory="drastic_mutator"/>
<EvolutionProtocol selector="std_roulette" factory="reaction_mutator"/>
<EvolutionProtocol selector="remove_elitist" factory="std_identity"/>
<EvolutionProtocol selector="std_roulette" factory="std_mutator"/>
<EvolutionProtocol selector="std_roulette" factory="large_crossover"/>
The tags are explain in more detail:
There are extensive checks included when you parse a new evolutionary protocol including informative error outputs. We suggest to simply play around with this system if you are interested in writing your own protocol.