SILCS-BML¶
Background¶
Bayesian Machine Learning (BML) Markov Chain Monte Carlo optimization allows improved predictability of the LGFE from SILCS-MC docking or pose refinement through reweighting of the FragMaps when experimental data is available for compounds in a training set. As the standard FragMaps act as priors for the optimization, the number of compounds in the training set may be relatively small (i.e., as little as 10 compounds). The ability to use a small training set allows the approach to be used in early stage ligand optimization, enabling improved predictions, and facilitates the approach to be used and updated iteratively during the drug design process, which can involve splitting the compounds into congeneric series. SILCS-BML optimization involves reweighting of the FragMap contributions to the LGFE to better reproduce selected target data. BML reweighting can target the RMSD, Pearson correlation (R), or Percent Correct (PC) metrics. The reweighting should be performed in the presence of restraints, including flat-bottom (default), hard wall, or harmonic potentials. Output from the program is directly input into a new params file to be used in subsequent SILCS-MC docking. Reweighting includes k-fold cross validation. Note that the quality of the reweighted parameters is judged based on the results from the subsequent SILCS-MC docking. In cases of overfitting, the new LGFE scores will yield poor correlations. Additional details on SILCS-BML are available in references [23] and [15]
SILCS-BML Using the CLI¶
SILCS-BML is currently available only in CLI. BML is performed in five steps as follows:
Generate fitting data:
$SILCSBIODIR/silcs-bml/1_gen_fitting_data posepdbsdir=<location and name of directory with docking poses as pdbs>The above command will set up SILCS-BML runs by generating training and validation sets based on PDB files of docked ligands under investigation and a user-provided list of experimental binding affinity data for the same ligands. The docked ligands should correspond to those generated from SILCS-MC docking, located in the
minconfpdb
directory with GFE scores of the different classified atoms included in each PDB file.Note
SILCS-BML requires pdbs generated from SILCS-MC docking and experimental data provided as a list of experimental binding affinity data (
exp.dat
file in the directory in which the1_gen_fitting_data
command is being executed or filepath specified by the user as an optional parameter). The SILCS-MC docked poses in PDB format can be generated using thesilcs-mc/2_cals_lgfe_min_avg_sd
command with optionpdb=true
.Required parameter:
Path and name of directory containing ligand docked poses:
posepdbsdir=<location and name of directory with docking poses as pdbs>The ligand docked poses must be in PDB format derived from SILCS-MC docking. This directory will correspond to
3_silcsmc/minconfpdb
if SILCS-MC was implemented with default parameters as described in SILCS-MC Docking Using the CLI.Optional parameters:
Path and name of data file containing experimental binding affinities in kcal/mol:
exp=<experimental data in same alphanumeric order as pose pdbs; default=exp.dat>The binding affinity data must be listed in a single column text file with one value per line. The data should be ordered such that the binding affinities of the ligands are listed in the same order as the pdb file names of the same ligands in alphanumeric order. By default, the file name will be
exp.dat
.The number of sets used for cross-validation:
kfold=<k-fold validation; default=5>The supplied data sample will be split into a number of sets for cross-validation. By default,
kfold=5
, and the sample data will be split into 5 sets.Note
The 0th set
set=0
will always be prepared and it will use all data as training data for the SILCS-BML refinement.Path and name of output directory:
silcsbmldir=<location and name of output directory; default=6_silcsbml>Option to skip the confirmation of the order of the experimental data:
confirm=<the order of the exp data has been confirmed; true/false; default=false>When
confirm=false
, the ligand pdb file names will be printed on the terminal in the order in which the experimental data is expected to be (alphanumeric order). The user will be prompted to cancel or continue the command after viewing the order of the ligands with eithery
to continue orn
to cancel. If the user is confident in the order of the experimental data, this confirmation stage can be skipped withconfirm=true
.
Run the refinement code:
$SILCSBIODIR/silcs-bml/2_run_refine posepdbsdir=<location and name of directory with docking poses as pdbs>The
2_run_refine
command will execute the SILCS-BML fitting based on the setup from the previous step. The initial RMSD, Pearson correlation coefficient, and percent correct of the ligand LGFEs and experimental binding affinities will be printed on the terminal next to the final RMSD, Pearson correlation coefficient, and percent correct of the final LGFEs optimized by reweighting FragMap contributions targeting the user-defined metric (target
). Additionally, the sets with the best RMSD (lowest RMSD), Pearson correlation coefficient (highest R), and percent correct (highest PC) will be printed on the terminal. This information will also be saved in6_silcsbml/2_run_refine.out
.Required parameters:
Path and name of directory containing ligand docked poses:
posepdbsdir=<location and name of directory with docking poses as pdbs>The ligand docked poses must be in PDB format derived from SILCS-MC docking. This directory should correspond to the same directory specified in the first
$SILCSBIODIR/silcs-bml/1_gen_fitting_data
step.Optional parameters:
Path and name of output directory:
silcsbmldir=<location and name of output directory; default=6_silcsbml>The path and name of the output directory should correspond to the same path and name specified in the first
$SILCSBIODIR/silcs-bml/1_gen_fitting_data
step.Target metric for optimization (integer):
target=<choose optimization target (1: RMSD | 2: Pearson Correlation | 3: Percent correct); default=3>SILCS-BML allows user to define the target property to be optimized with 3 options to choose from: Root mean square deviation (RMSD) with
target=1
, Pearson correlation (R) withtarget=2
, or percent correct (PC) withtarget=3
.Option to scale weights to yield original LGFE range (integer):
scale=<scale weights to yield original LGFE range (1:Yes | 0:No); default=1>The SILCS-BML fitting can be constrained such that the resulting LGFE values of the ligands are in the same range of the original set of LGFE values. This scaling can be turned off (
scale=0
) or kept on (scale=1
).Restraint type:
restraint=<choose restraint type (1: Flat-bottom | 2: Hard wall | 3: Harmonic); default=1>Currently, 3 types of restraint are offered: flat-bottom (
restraint=1
), hard wall (restraint=2
), and harmonic (restraint=3
). By default, flat-bottom restraints are used (restraint=1
).The restraint types are described in further detail below for weighting factor \(x\), user defined force constant \(\kappa\), and user-defined upper and lower limits of \(x\), \({x_u}\) and \({x_l}\), respectively. User-defined parameters \(\kappa\), \({x_u}\), and \({x_l}\) can be defined using the
kappa
,upper
, andlower
option bulletted below, after the description of restraint types.[1]. Flat-bottom potential restraint (
restraint=1
)\[ \begin{align}\begin{aligned}y = \kappa (\frac{1}{1+\exp{\frac{x-x_l}{d}}}) \textrm{ if}, x < 1\\y = \kappa (\frac{1}{1+\exp{\frac{x_u-x}{d}}}) \textrm{ if}, x > 1\end{aligned}\end{align} \]Where \(d\) determines how flat the bottom is between \({x_l}\) and \({x_u}\) limits. The value of \(d\) is fixed at 0.01.
Below is a plot showing this restraint for \(\kappa = 1\) (
kappa=1
), \({x_l} = 0.6\) (lower=0.6
), and \({x_u} = 1.4\) (upper=1.4
):[2]. Hard-wall potential restraint (
restraint=2
)[3]. Harmonic potential restraint (
restraint=3
)Restraint force constant (integer):
kappa=<force constant for restraint; default=300>Upper limit of weighting factor (decimal):
upper=<upper limit of weighting factor for restraint type 1 or 2; default=2>Lower limit of weighting factor (decimal):
lower=<lower limit of weighting factor for restraint type 1 or 2; default=0.1>Minimum percentage of compounds with specific FragMap type to be included in optimization (decimal):
mincmpds=<minimum percentage of compounds with specific FragMap type for that type to be included in the optimization; default=30>This option selects FragMap types to be fully optimized based on a minimum percentage of compounds with that specific FragMap type. FragMap types represented by less than the defined percentage will be optimized to a limited extent defined as (0.8~1.2) scaling of the input weight of that FragMap type. This prevents overfitting of underrepresented FragMaps. By default,
mincmpds=30
.
Generate SILCS-MC Parameter File:
$SILCSBIODIR/silcs-bml/3_gen_silcsmc_inp posepdbsdir=<location and name of directory with docking poses as pdbs>After determining the optimal FragMap weighting in the second step, the ligands can be re-docked using SILCS-MC parameter files reflecting the tuned FragMap weights. The SILCS-MC parameter file with the optimal FragMap weighting for each cross-validation set (
params_reweight_*.inp
) can be generated using the above command. The number ofparams_reweight_*.inp
files will correspond to the number of cross-validation sets specified using thek-fold
parameter in the first1_gen_fitting_data
step.Required parameter:
Path and name of directory containing ligand docked poses:
posepdbsdir=<location and name of directory with docking poses as pdbs>The ligand docked poses must be in PDB format derived from SILCS-MC docking. This directory should correspond to the same directory specified in the
$SILCSBIODIR/silcs-bml/1_gen_fitting_data
and$SILCSBIODIR/silcs-bml/2_run_refine
steps.Optional parameters:
Path and name of directory containing SILCS-BML fitting output:
silcsbmldir=<location and name of output directory; default=6_silcsbml>The path and name of the output directory should correspond to the same path and name specified in the previous two steps.
Path and name of SILCS-MC parameter file:
paramsfile=<silcs-mc params.inp file to be reweighted; default=params.inp>The
3_gen_silcsmc_inp
command will edit theparams.inp
file (typically3_silcsmc/params.inp
from original SILCS-MC docking) that was used to dock and produce the poses stored in the directory specified for theposepdbsdir
parameter. The new parameter files will be stored in6_silcsbml
(or the user-specifiedsilcsbmldir
) asparams_reweight_*.inp
.
Run SILCS-MC with the BML optimized FragMaps:
$SILCSBIODIR/silcs-bml/4_run_silcsmc_bml prot=<prot PDB> ligdir=<location and name of directory containing ligand mol2/sdf>The above command will initiate SILCS-MC docking runs using the new parameter file(s) (
params_reweight_*.inp
) generated in the previous step, which specifies the reweighted FragMaps for the LGFE calculations. By default, the parameter set with the best percent correct will be used. The user can specify which parameter set to use with the optionalrunset
parameter.Required parameters:
Path and name of target protein PDB file:
prot=<prot PDB>Path and name of directory containing ligand structure files:
ligdir=<location and name of directory containing ligand mol2/sdf>Typically,
ligdir
can be the same directory as theposepdbsdir
parameter used in the previous1_gen_fitting_data
step assuming the user has used the standard3_silcsmc/minconfpdb
directory as theposepdbsdir
parameter.If the
ligdir
parameter is used, each molecule must be stored in its own individual file (Mol2 or SDF file) as only one molecule per file in the specifiedligdir
will be processed for SILCS-MC. If the user has an SD/SDF file containing multiple molecules, use thesdfile
parameter instead of theligdir
parameter.Optional parameters:
Specify cross-validation set SILCS-MC parameter file (
params_reweight_*.inp
) to use:runset=<set # to run silcs-mc with re-weighted paramfile; All / BestPC / BestR / No. between [0-kfold]; default=BestPC>By default, the
paramsfile
originating from the cross-validation set with the best final percent correct will be used. The user can choose to specify which set to use with the optionalrunset
parameter. The user can choose to perform SILCS-MC docking using theparams_reweight_*.inp
file(s) generated for:
all (
runset=All
) cross-validation setsonly the best percent correct (
runset=BestPC
) cross-validation setonly the best Pearson correlation coefficient (
runset=BestR
) cross-validation seta specific cross-validation set (
runset=<any integer number from 0 to k-fold>
)Path and name of directory containing SILCS-BML fitting output:
silcsbmldir=<location and name of output directory; default=6_silcsbml>The path and name of the output directory should correspond to the same path and name specified in the previous three steps.
Path and name of directory containing FragMaps:
mapsdir=<location and name of directory containing FragMaps; default=maps>Path and name of SD file containing all ligands:
sdfile=<location and name of SD/SDF file>Option to bundle jobs:
bundle=<bundle multiple (single) jobs into larger jobs; default=true>Number of jobs to bundle into one:
nproc=<number of jobs to bundle; default=8>
Collect LGFE and poses from SILCS-MC simulations:
$SILCSBIODIR/silcs-bml/5_calc_lgfe_bml ligdir=<location and name of directory containing ligand mol2/sdf>After the SILCS-MC docking simulations are complete, the above command will collect the resulting LGFE and docked poses of all ligands. The new RMSD, Pearson correlation coefficient, and percent correct of the re-docked ligands’ LGFE scores versus the experimental binding affinities will be printed to the terminal. The following output files will be generated for the user to review:
6_silcsbml/report_bml_<runset>.csv
containing the same information as printed on the terminal
6_silcsbml/3_silcsmc_bml_<runset>/minconfpdb
containing the docked poses of the ligands
6_silcsbml/3_silcsmc_bml_<runset>/lgfe.csv
containing the new ligand LGFE scores
6_silcsbml/report_bml_<runset>.png
containing a correlation plot of the the experimental binding affinities versus ligand LGFE scores before and after SILCS-BML fittingRequired parameter:
Path and name of directory containing ligand structure files:
ligdir=<location and name of directory containing ligand mol2/sdf>The same
ligdir
as the previous step should be used unless the user has specifiedsdfile
instead ofligdir
in the previous step.Optional parameters:
Specify cross-validation set SILCS-MC parameter file (
params_reweight_*.inp
) to use:runset=<set # to run silcs-mc with re-weighted paramfile; All / BestPC / BestR / No. between [0-kfold]; default=BestPC>The
runset
parameter should be the same as the previous4_run_silcsmc_bml
step. Description of therunset
parameter can be found in the previous step.Path and name of directory containint SILCS-BML fitting output:
silcsbmldir=<location and name of output directory; default=6_silcsbml>The path and name of the output directory should correspond to the same path and name specified in the previous step.
Path and name of SD file containing all ligands:
sdfile=<location and name of SD file>Number of top scoring poses to be collected:
npose=<number of best scoring poses sorted by LGFE; default=1>Option to output poses in PDB format:
pdb=<write PDB file in minconfpdb folder; true/false; default=false>When the
pdb
keywoad is set totrue
, the top scoring poses will be saved in PDB format along with SDF file format. By default,pdb=false
.Option to print ligand SMILES strings in output
6_silcsbml/3_silcsmc_bml_<runset>/lgfe.csv
file:smiles=<write SMILES string to lgfe.csv; default=false>If
smiles=true
is used, the6_silcsbml/3_silcsmc_bml_<runset>/lgfe.csv
output file will contain a column with the SMILCS string of each ligand.Calculate 4DBA descriptors and LGFE-4DBA scores:
fourdba=<calc 4 descriptors and LGFE-4DBA score; true/false; default=false>The bioavailablility of compounds can be evaluated using 4D Bioavailability (4DBA) scores. Compounds with higher 4DBA scores are more likely to have oral drug-like characteristics When
fourdba=true
additional calculations will be performed to calculate the 4DBA score for the ligands in addtion to LGFE score and LGFE-4DBA score. Please refer to 4D Bioavailability (4DBA) Calculation for additional details on 4DBA scores.Path to python executable:
python=<path to python executable>If the default path to python does not include the necessary modules, the path to the python executable with the necessary module installed must be specified with the
python
parameter. The default python path can be checked using thewhich python
command. Alternatively, entering the$SILCSBIODIR/silcs-bml/5_calc_lgfe_bml
command will output the default path for python. The matplotlib package is required for plotting. Iffourdba=true
, then the rdkit, tqdm, ipython and scipy packages are also required. Please refer to Python 3 Requirement for more information.
Note
Note that once the new params_reweight_*.inp
files with the optimized FragMap
weighting factors for each cross-validation set is produced it may be used for
SILCS-MC docking and pose refinement using the 1_run_silcsmc_custom command and
specifying paramsfile=params_reweight_<best performing cross-validation set>.inp
where params_reweight_*.inp
is from step 3 (3_gen_silcsmc_inp
) above.