SILCS-BML --------- Background ^^^^^^^^^^ Bayesian Machine Learning (BML) Markov Chain Monte Carlo optimization allows improved predictability of the LGFE from SILCS-MC docking or pose refinement through reweighting of the FragMaps when experimental data is available for compounds in a training set. As the standard FragMaps act as priors for the optimization, the number of compounds in the training set may be relatively small (i.e., as little as 10 compounds). The ability to use a small training set allows the approach to be used in early stage ligand optimization, enabling improved predictions, and facilitates the approach to be used and updated iteratively during the drug design process, which can involve splitting the compounds into congeneric series. SILCS-BML optimization involves reweighting of the FragMap contributions to the LGFE to better reproduce selected target data. BML reweighting can target the RMSD, Pearson correlation (R), or Percent Correct (PC) metrics. The reweighting should be performed in the presence of restraints, including flat-bottom (default), hard wall, or harmonic potentials. Output from the program is directly input into a new params file to be used in subsequent SILCS-MC docking. Reweighting includes k-fold cross validation. Note that the quality of the reweighted parameters is judged based on the results from the subsequent SILCS-MC docking. In cases of overfitting, the new LGFE scores will yield poor correlations. Additional details on SILCS-BML are available in references :cite:`Ustach:2019` and :cite:`Goel:2021` SILCS-BML Using the CLI ^^^^^^^^^^^^^^^^^^^^^^^ SILCS-BML is currently available only in CLI. BML is performed in five steps as follows: 1. **Generate fitting data:** :: $SILCSBIODIR/silcs-bml/1_gen_fitting_data posepdbsdir= The above command will set up SILCS-BML runs by generating training and validation sets based on PDB files of docked ligands under investigation and a user-provided list of experimental binding affinity data for the same ligands. The docked ligands should correspond to those generated from SILCS-MC docking, located in the ``minconfpdb`` directory with GFE scores of the different classified atoms included in each PDB file. .. note:: SILCS-BML requires pdbs generated from SILCS-MC docking and experimental data provided as a list of experimental binding affinity data (``exp.dat`` file in the directory in which the ``1_gen_fitting_data`` command is being executed or filepath specified by the user as an *optional parameter*). The SILCS-MC docked poses in PDB format can be generated using the ``silcs-mc/2_cals_lgfe_min_avg_sd`` command with option ``pdb=true``. *Required parameter*: * Path and name of directory containing ligand docked poses: :: posepdbsdir= The ligand docked poses must be in PDB format derived from SILCS-MC docking. This directory will correspond to ``3_silcsmc/minconfpdb`` if SILCS-MC was implemented with default parameters as described in :ref:`silcsmc_docking_using_cli`. *Optional parameters*: * Path and name of data file containing experimental binding affinities in kcal/mol: :: exp= The binding affinity data must be listed in a single column text file with one value per line. The data should be ordered such that the binding affinities of the ligands are listed in the same order as the pdb file names of the same ligands in alphanumeric order. By default, the file name will be ``exp.dat``. * The number of sets used for cross-validation: :: kfold= The supplied data sample will be split into a number of sets for cross-validation. By default, ``kfold=5``, and the sample data will be split into 5 sets. .. note:: The 0th set ``set=0`` will always be prepared and it will use all data as training data for the SILCS-BML refinement. * Path and name of output directory: :: silcsbmldir= * Option to skip the confirmation of the order of the experimental data: :: confirm= When ``confirm=false``, the ligand pdb file names will be printed on the terminal in the order in which the experimental data is expected to be (alphanumeric order). The user will be prompted to cancel or continue the command after viewing the order of the ligands with either ``y`` to continue or ``n`` to cancel. If the user is confident in the order of the experimental data, this confirmation stage can be skipped with ``confirm=true``. 2. **Run the refinement code:** :: $SILCSBIODIR/silcs-bml/2_run_refine posepdbsdir= The ``2_run_refine`` command will execute the SILCS-BML fitting based on the setup from the previous step. The initial RMSD, Pearson correlation coefficient, and percent correct of the ligand LGFEs and experimental binding affinities will be printed on the terminal next to the final RMSD, Pearson correlation coefficient, and percent correct of the final LGFEs optimized by reweighting FragMap contributions targeting the user-defined metric (``target``). Additionally, the sets with the best RMSD (lowest RMSD), Pearson correlation coefficient (highest R), and percent correct (highest PC) will be printed on the terminal. This information will also be saved in ``6_silcsbml/2_run_refine.out``. *Required parameters*: * Path and name of directory containing ligand docked poses: :: posepdbsdir= The ligand docked poses must be in PDB format derived from SILCS-MC docking. This directory should correspond to the same directory specified in the first ``$SILCSBIODIR/silcs-bml/1_gen_fitting_data`` step. *Optional parameters*: * Path and name of output directory: :: silcsbmldir= The path and name of the output directory should correspond to the same path and name specified in the first ``$SILCSBIODIR/silcs-bml/1_gen_fitting_data`` step. * Target metric for optimization (integer): :: target= SILCS-BML allows user to define the target property to be optimized with 3 options to choose from: Root mean square deviation (RMSD) with ``target=1``, Pearson correlation (R) with ``target=2``, or percent correct (PC) with ``target=3``. * Option to scale weights to yield original LGFE range (integer): :: scale= The SILCS-BML fitting can be constrained such that the resulting LGFE values of the ligands are in the same range of the original set of LGFE values. This scaling can be turned off (``scale=0``) or kept on (``scale=1``). * Restraint type: :: restraint= Currently, 3 types of restraint are offered: flat-bottom (``restraint=1``), hard wall (``restraint=2``), and harmonic (``restraint=3``). By default, flat-bottom restraints are used (``restraint=1``). The restraint types are described in further detail below for weighting factor :math:`x`, user defined force constant :math:`\kappa`, and user-defined upper and lower limits of :math:`x`, :math:`{x_u}` and :math:`{x_l}`, respectively. User-defined parameters :math:`\kappa`, :math:`{x_u}`, and :math:`{x_l}` can be defined using the ``kappa``, ``upper``, and ``lower`` option bulletted below, after the description of restraint types. [1]. Flat-bottom potential restraint (``restraint=1``) .. math:: y = \kappa (\frac{1}{1+\exp{\frac{x-x_l}{d}}}) \textrm{ if}, x < 1 y = \kappa (\frac{1}{1+\exp{\frac{x_u-x}{d}}}) \textrm{ if}, x > 1 Where :math:`d` determines how flat the bottom is between :math:`{x_l}` and :math:`{x_u}` limits. The value of :math:`d` is fixed at 0.01. Below is a plot showing this restraint for :math:`\kappa = 1` (``kappa=1``), :math:`{x_l} = 0.6` (``lower=0.6``), and :math:`{x_u} = 1.4` (``upper=1.4``): .. figure:: ../images/silcs-bml/flat-bottom.png :width: 80% [2]. Hard-wall potential restraint (``restraint=2``) .. math:: y = \mathrm{0} \textrm{ if}, \mathrm{lower < x < upper} y = \kappa \textrm{ else} Below is a plot showing this restraint for :math:`\kappa = 1` (``kappa=1``), :math:`{x_l} = 0.6` (``lower=0.6``), and :math:`{x_u} = 1.4` (``upper=1.4``): .. figure:: ../images/silcs-bml/hard-wall.png :width: 80% [3]. Harmonic potential restraint (``restraint=3``) .. math:: y = \kappa (x-1)^2 .. figure:: ../images/silcs-bml/harmonic.png :width: 80% * Restraint force constant (integer): :: kappa= * Upper limit of weighting factor (decimal): :: upper= * Lower limit of weighting factor (decimal): :: lower= * Minimum percentage of compounds with specific FragMap type to be included in optimization (decimal): :: mincmpds= This option selects FragMap types to be fully optimized based on a minimum percentage of compounds with that specific FragMap type. FragMap types represented by less than the defined percentage will be optimized to a limited extent defined as (0.8~1.2) scaling of the input weight of that FragMap type. This prevents overfitting of underrepresented FragMaps. By default, ``mincmpds=30``. 3. **Generate SILCS-MC Parameter File:** :: $SILCSBIODIR/silcs-bml/3_gen_silcsmc_inp posepdbsdir= After determining the optimal FragMap weighting in the second step, the ligands can be re-docked using SILCS-MC parameter files reflecting the tuned FragMap weights. The SILCS-MC parameter file with the optimal FragMap weighting for each cross-validation set (``params_reweight_*.inp``) can be generated using the above command. The number of ``params_reweight_*.inp`` files will correspond to the number of cross-validation sets specified using the ``k-fold`` parameter in the first ``1_gen_fitting_data`` step. *Required parameter*: * Path and name of directory containing ligand docked poses: :: posepdbsdir= The ligand docked poses must be in PDB format derived from SILCS-MC docking. This directory should correspond to the same directory specified in the ``$SILCSBIODIR/silcs-bml/1_gen_fitting_data`` and ``$SILCSBIODIR/silcs-bml/2_run_refine`` steps. *Optional parameters*: * Path and name of directory containing SILCS-BML fitting output: :: silcsbmldir= The path and name of the output directory should correspond to the same path and name specified in the previous two steps. * Path and name of SILCS-MC parameter file: :: paramsfile= The ``3_gen_silcsmc_inp`` command will edit the ``params.inp`` file (typically ``3_silcsmc/params.inp`` from original SILCS-MC docking) that was used to dock and produce the poses stored in the directory specified for the ``posepdbsdir`` parameter. The new parameter files will be stored in ``6_silcsbml`` (or the user-specified ``silcsbmldir``) as ``params_reweight_*.inp``. 4. **Run SILCS-MC with the BML optimized FragMaps:** :: $SILCSBIODIR/silcs-bml/4_run_silcsmc_bml prot= ligdir= The above command will initiate SILCS-MC docking runs using the new parameter file(s) (``params_reweight_*.inp``) generated in the previous step, which specifies the reweighted FragMaps for the LGFE calculations. By default, the parameter set with the best percent correct will be used. The user can specify which parameter set to use with the optional ``runset`` parameter. *Required parameters*: * Path and name of target protein PDB file: :: prot= * Path and name of directory containing ligand structure files: :: ligdir= Typically, ``ligdir`` can be the same directory as the ``posepdbsdir`` parameter used in the previous ``1_gen_fitting_data`` step assuming the user has used the standard ``3_silcsmc/minconfpdb`` directory as the ``posepdbsdir`` parameter. If the ``ligdir`` parameter is used, each molecule must be stored in its own individual file (Mol2 or SDF file) as only one molecule per file in the specified ``ligdir`` will be processed for SILCS-MC. If the user has an SD/SDF file containing multiple molecules, use the ``sdfile`` parameter instead of the ``ligdir`` parameter. *Optional parameters*: * Specify cross-validation set SILCS-MC parameter file (``params_reweight_*.inp``) to use: :: runset= By default, the ``paramsfile`` originating from the cross-validation set with the best final percent correct will be used. The user can choose to specify which set to use with the optional ``runset`` parameter. The user can choose to perform SILCS-MC docking using the ``params_reweight_*.inp`` file(s) generated for: - all (``runset=All``) cross-validation sets - only the best percent correct (``runset=BestPC``) cross-validation set - only the best Pearson correlation coefficient (``runset=BestR``) cross-validation set - a specific cross-validation set (``runset=``) * Path and name of directory containing SILCS-BML fitting output: :: silcsbmldir= The path and name of the output directory should correspond to the same path and name specified in the previous three steps. * Path and name of directory containing FragMaps: :: mapsdir= * Path and name of SD file containing all ligands: :: sdfile= * Option to bundle jobs: :: bundle= * Number of jobs to bundle into one: :: nproc= 5. **Collect LGFE and poses from SILCS-MC simulations:** :: $SILCSBIODIR/silcs-bml/5_calc_lgfe_bml ligdir= After the SILCS-MC docking simulations are complete, the above command will collect the resulting LGFE and docked poses of all ligands. The new RMSD, Pearson correlation coefficient, and percent correct of the re-docked ligands' LGFE scores versus the experimental binding affinities will be printed to the terminal. The following output files will be generated for the user to review: - ``6_silcsbml/report_bml_.csv`` containing the same information as printed on the terminal - ``6_silcsbml/3_silcsmc_bml_/minconfpdb`` containing the docked poses of the ligands - ``6_silcsbml/3_silcsmc_bml_/lgfe.csv`` containing the new ligand LGFE scores - ``6_silcsbml/report_bml_.png`` containing a correlation plot of the the experimental binding affinities versus ligand LGFE scores before and after SILCS-BML fitting *Required parameter*: * Path and name of directory containing ligand structure files: :: ligdir= The same ``ligdir`` as the previous step should be used unless the user has specified ``sdfile`` instead of ``ligdir`` in the previous step. *Optional parameters*: * Specify cross-validation set SILCS-MC parameter file (``params_reweight_*.inp``) to use: :: runset= The ``runset`` parameter should be the same as the previous ``4_run_silcsmc_bml`` step. Description of the ``runset`` parameter can be found in the previous step. * Path and name of directory containint SILCS-BML fitting output: :: silcsbmldir= The path and name of the output directory should correspond to the same path and name specified in the previous step. * Path and name of SD file containing all ligands: :: sdfile= * Number of top scoring poses to be collected: :: npose= * Option to output poses in PDB format: :: pdb= When the ``pdb`` keywoad is set to ``true``, the top scoring poses will be saved in PDB format along with SDF file format. By default, ``pdb=false``. * Option to print ligand SMILES strings in output ``6_silcsbml/3_silcsmc_bml_/lgfe.csv`` file: :: smiles= If ``smiles=true`` is used, the ``6_silcsbml/3_silcsmc_bml_/lgfe.csv`` output file will contain a column with the SMILCS string of each ligand. * Calculate 4DBA descriptors and LGFE-4DBA scores: :: fourdba= The bioavailablility of compounds can be evaluated using 4D Bioavailability (4DBA) scores. Compounds with higher 4DBA scores are more likely to have oral drug-like characteristics When ``fourdba=true`` additional calculations will be performed to calculate the 4DBA score for the ligands in addtion to LGFE score and LGFE-4DBA score. Please refer to :ref:`4dba_calculation` for additional details on 4DBA scores. * Path to python executable: :: python= If the default path to python does not include the necessary modules, the path to the python executable with the necessary module installed must be specified with the ``python`` parameter. The default python path can be checked using the ``which python`` command. Alternatively, entering the ``$SILCSBIODIR/silcs-bml/5_calc_lgfe_bml`` command will output the default path for python. The `matplotlib` package is required for plotting. If ``fourdba=true``, then the `rdkit`, `tqdm`, `ipython` and `scipy` packages are also required. Please refer to :ref:`python3_requirement` for more information. .. note:: Note that once the new ``params_reweight_*.inp`` files with the optimized FragMap weighting factors for each cross-validation set is produced it may be used for SILCS-MC docking and pose refinement using the 1_run_silcsmc_custom command and specifying ``paramsfile=params_reweight_.inp`` where ``params_reweight_*.inp`` is from step 3 (``3_gen_silcsmc_inp``) above.