Quick Start Guide
================

This guide helps you get started with run_dbCAN using example data and explains the generated output files.

The tool offers two approaches:

1. **Automated analysis** - Complete workflow with a single command

2. **Step-by-step analysis** - Breaking down the process for troubleshooting or customization

Here we performed the `Automated analysis` for each example file. For the `Step-by-step analysis` analysis, please refer to the documentation `user_guide`.

Example Data
--------------

We provide several example datasets in the `example_data <https://bcb.unl.edu/dbCAN2/download/test/>`_ directory for testing purposes.


Database Download
------------------
First, download the database files required for the analysis.
**Make sure you have installed successfully and activated the `run_dbcan` environment.**

.. code-block:: shell

    # Download database files
    run_dbcan database \
      --db_dir db

    # Optional: use --aws_s3 for faster and more stable downloads from AWS S3
    # run_dbcan database --db_dir db --aws_s3


CAZyme Annotation
------------------

Let's annotate Carbohydrate-Active enZYmes (CAZymes) in our example data.

**Example 1: Prokaryotic Genome (DNA)**

.. code-block:: shell

    # Download example prokaryotic genome (E. coli K-12 MG1655)
    wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.fna -O EscheriaColiK12MG1655.fna

    # Run CAZyme annotation
    run_dbcan CAZyme_annotation \
      --input_raw_data EscheriaColiK12MG1655.fna \
      --mode prok \
      --output_dir output_EscheriaColiK12MG1655_fna \
      --db_dir db

.. _Escherichia coli Strain MG1655: https://www.ncbi.nlm.nih.gov/nuccore/U00096.2

**Example 2: Prokaryotic Proteome (Protein)**

.. code-block:: shell

    # Download example prokaryotic proteome
    wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.faa -O EscheriaColiK12MG1655.faa

    # Run CAZyme annotation (specify input format for protein sequences)
    run_dbcan CAZyme_annotation \
      --input_raw_data EscheriaColiK12MG1655.faa \
      --mode protein \
      --output_dir output_EscheriaColiK12MG1655_faa \
      --db_dir db \

**Example 3: Eukaryotic Proteome (NCBI)**

.. code-block:: shell

    # Download example eukaryotic proteome
    wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_euk_test/Xylona_heveae_TC161.faa -O Xylona_heveae_TC161.faa

    # Run CAZyme annotation
    run_dbcan CAZyme_annotation \
      --input_raw_data Xylona_heveae_TC161.faa \
      --mode protein \
      --output_dir output_Xylona_heveae_TC161_faa \
      --db_dir db

.. _Xylona heveae TC161: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_001619985.1/

**Example 4: Eukaryotic Proteome (JGI)**

.. code-block:: shell

    # Download example JGI format proteome
    wget -q https://bcb.unl.edu/dbCAN2/download/test/JGI_test/Xylhe1_GeneCatalog_proteins_20130827.aa.fasta -O Xylhe1_GeneCatalog_proteins_20130827.aa.fasta

    # Run CAZyme annotation
    run_dbcan CAZyme_annotation \
      --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
      --mode protein \
      --output_dir output_Xylhe1_faa \
      --db_dir db

.. _Xylhe1: https://mycocosm.jgi.doe.gov/Xylhe1/Xylhe1.home.html

CAZyme Annotation Output Files
-----------------------------

After running CAZyme annotation, you'll find these output files:

``uniInput.faa``
    Unified input file for all tools, generated by Prodigal (for nucleotide input) or provided by the user (for protein input).

``dbCANsub_hmm_results.tsv``
    Results from pyHMMER search using dbCAN_sub-HMM database.

``diamond.out``
    Results from DIAMOND BLAST search against CAZy database.

``dbCAN_hmm_results.tsv``
    Results from pyHMMER search using dbCAN-HMM database.

``overview.tsv``
    Consolidated summary of CAZyme predictions across all tools. We recommend focusing on results predicted by at least two tools.

CGC (CAZyme Gene Cluster) Annotation
-----------------------------------

Next, let's identify and analyze CAZyme gene clusters (CGCs).

**Example 1: Prokaryotic Genome with Generated GFF**

.. code-block:: shell

    # Run CGC annotation with automatically generated GFF
    run_dbcan easy_CGC \
      --input_raw_data EscheriaColiK12MG1655.fna \
      --mode prok \
      --output_dir output_EscheriaColiK12MG1655_fna_CGC \
      --db_dir db \
      --input_gff gff \
      --gff_type prodigal

**Example 2: Prokaryotic Proteome with External GFF**

.. code-block:: shell

    # Download example GFF file
    wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.gff -O EscheriaColiK12MG1655.gff

    # Run CGC annotation with provided GFF
    run_dbcan easy_CGC \
      --input_raw_data EscheriaColiK12MG1655.faa \
      --mode protein \
      --output_dir output_EscheriaColiK12MG1655_faa_CGC \
      --db_dir db \
      --input_gff EscheriaColiK12MG1655.gff \
      --gff_type NCBI_prok

**Example 3: Eukaryotic Proteome with External GFF**

.. code-block:: shell

    # Download example eukaryotic GFF file
    wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_euk_test/Xylona_heveae_TC161.gff -O Xylona_heveae_TC161.gff

    # Run CGC annotation
    run_dbcan easy_CGC \
      --input_raw_data Xylona_heveae_TC161.faa \
      --mode protein \
      --output_dir output_Xylona_heveae_TC161_faa_CGC \
      --db_dir db \
      --input_gff Xylona_heveae_TC161.gff \
      --gff_type NCBI_euk

**Example 4: JGI Format Data**

.. code-block:: shell

    # Download JGI format GFF file
    wget -q https://bcb.unl.edu/dbCAN2/download/test/JGI_test/Xylhe1_GeneCatalog_proteins_20130827.gff -O Xylhe1_GeneCatalog_proteins_20130827.gff

    # Run CGC annotation
    run_dbcan easy_CGC \
      --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
      --mode protein \
      --output_dir output_Xylhe1_faa_CGC \
      --db_dir db \
      --input_gff Xylhe1_GeneCatalog_proteins_20130827.gff \
      --gff_type JGI

CGC Annotation Output Files
-----------------------------

In addition to the CAZyme annotation outputs, CGC analysis produces:

``non_CAZyme.faa``
    Non-CAZyme protein sequences extracted from uniInput.faa based on overview results.

``diamond.out.tc``
    DIAMOND BLAST results against TCDB for transporter protein annotation.

``TF_hmm_results.tsv``
    pyHMMER results using TF-HMM database for transcription factor identification.

``STP_hmm_results.tsv``
    pyHMMER results using STP-HMM for signal transduction protein identification.

``total_cgc_info.tsv``
    Comprehensive annotation of all signature proteins (CAZymes, TC, TF, STP).

``cgc.gff``
    Input file for CGCFinder in GFF format, generated from the input GFF and signature annotations.

``cgc_standard_out.tsv``
    Standard output from CGCFinder showing identified CAZyme gene clusters.

Substrate Prediction
----------------------

Finally, let's predict substrates for the identified CAZymes and CGCs.

**Example 1: Prokaryotic Genome**

.. code-block:: shell

    # Run substrate prediction
    run_dbcan easy_substrate \
      --input_raw_data EscheriaColiK12MG1655.fna \
      --mode prok \
      --output_dir output_EscheriaColiK12MG1655_fna_sub \
      --db_dir db \
      --input_gff gff \
      --gff_type prodigal

**Example 2: Prokaryotic Proteome**

.. code-block:: shell

    # Run substrate prediction
    run_dbcan easy_substrate \
      --input_raw_data EscheriaColiK12MG1655.faa \
      --mode protein \
      --output_dir output_EscheriaColiK12MG1655_faa_sub \
      --db_dir db \
      --input_gff EscheriaColiK12MG1655.gff \
      --gff_type NCBI_prok

**Example 3: Eukaryotic Proteome**

.. code-block:: shell

    # Run substrate prediction
    run_dbcan easy_substrate \
      --input_raw_data Xylona_heveae_TC161.faa \
      --mode protein \
      --output_dir output_Xylona_heveae_TC161_faa_sub \
      --db_dir db \
      --input_gff Xylona_heveae_TC161.gff \
      --gff_type NCBI_euk

**Example 4: JGI Format Data**

.. code-block:: shell

    # Run substrate prediction
    run_dbcan easy_substrate \
      --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
      --mode protein \
      --output_dir output_Xylhe1_faa_sub \
      --db_dir db \
      --input_gff Xylhe1_GeneCatalog_proteins_20130827.gff \
      --gff_type JGI

Substrate Prediction Output Files
-----------------------------------

In addition to previous outputs, substrate prediction produces:

``substrate_prediction.tsv``
    Final output containing predicted substrates for each CAZyme gene cluster.

``PUL_blast.out``
    DIAMOND blastp results from comparing CGCs against dbCAN-PULs database.

``synteny_pdf/``
    Directory containing synteny plots showing gene cluster mappings between PULs and CGCs.