.. _substrate-prediction:

CGC Substrate Prediction
=========================

Introduction
-------------

Identifying the target substrates of CGCs is crucial for understanding their biological functions.
Since 2023, run_dbCAN has included functionality to predict substrates for identified CGCs using two complementary methods:

1. **Sequence Similarity Method**: Compares CAZymes in the CGC against a reference database of characterized Polysaccharide Utilization Loci (dbCAN-PUL)

2. **Domain Composition Method**: Analyzes the combination of CAZy domains present in the CGC

These methods are detailed in our `dbCAN3 paper <https://academic.oup.com/nar/article/51/W1/W115/7147496>` and provide predictions for various plant cell wall polysaccharides,
including cellulose, xylan, pectin, and more.

.. note::
   The substrate prediction functionality has been changed from BLASTp into DIAMOND BLASTp.
   Substrate prediction uses DIAMOND BLASTP rather than standard BLASTP for sequence similarity searches,
   significantly improving performance without sacrificing accuracy.

Basic Usage
------------

After running CGC identification, you can predict substrates for the identified CGCs:

.. code-block:: shell

   run_dbcan substrate_prediction --output_dir <OUTPUT_DIRECTORY> --db_dir <DATABASE_DIRECTORY>

The command automatically processes CGC data found in the output directory(generated by previous steps) and generates substrate predictions.

Examples
---------

.. code-block:: shell
   :caption: Prokaryotic genome analysis

   run_dbcan substrate_prediction --output_dir output_EscheriaColiK12MG1655_fna --db_dir db

.. code-block:: shell
   :caption: Prokaryotic protein analysis

   run_dbcan substrate_prediction --output_dir output_EscheriaColiK12MG1655_faa --db_dir db

.. code-block:: shell
   :caption: JGI fungal protein analysis

   run_dbcan substrate_prediction --output_dir output_Xylhe1_faa --db_dir db

.. code-block:: shell
   :caption: Custom fungal protein analysis

   run_dbcan substrate_prediction --output_dir output_Xylona_heveae_TC161_faa --db_dir db

Parameters
------------

.. list-table::
   :widths: 30 70
   :header-rows: 1

   * - Parameter
     - Description
   * - ``--output_dir``
     - Directory containing CGC data and for output files
   * - ``--db_dir``
     - Directory containing substrate prediction reference databases

For other detailed parameters, please check the ``API documentation`` on the left.

Output Files
-------------

Substrate prediction generates the following output files:

* ``substrate_prediction.tsv`` - Main output with predicted substrates for each CGC
* ``PUL_blast.out`` - Raw DIAMOND search results
* ``synteny_pdf/`` - Synteny plots directory. It shows the gene cluster mappings between PULs and CGCs



