Quick Start Guide#

This guide helps you get started with run_dbCAN using example data and explains the generated output files.

The tool offers two approaches:

  1. Automated analysis - Complete workflow with a single command

  2. Step-by-step analysis - Breaking down the process for troubleshooting or customization

Here we performed the Automated analysis for each example file. For the Step-by-step analysis analysis, please refer to the documentation user_guide.

Example Data#

We provide several example datasets in the example_data directory for testing purposes.

Database Download#

First, download the database files required for the analysis. Make sure you have installed successfully and activated the `run_dbcan` environment.

# Download database files
run_dbcan database \
  --db_dir db

# Optional: use --aws_s3 for faster and more stable downloads from AWS S3
# run_dbcan database --db_dir db --aws_s3

CAZyme Annotation#

Let’s annotate Carbohydrate-Active enZYmes (CAZymes) in our example data.

Example 1: Prokaryotic Genome (DNA)

# Download example prokaryotic genome (E. coli K-12 MG1655)
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.fna -O EscheriaColiK12MG1655.fna

# Run CAZyme annotation
run_dbcan CAZyme_annotation \
  --input_raw_data EscheriaColiK12MG1655.fna \
  --mode prok \
  --output_dir output_EscheriaColiK12MG1655_fna \
  --db_dir db

Example 2: Prokaryotic Proteome (Protein)

# Download example prokaryotic proteome
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.faa -O EscheriaColiK12MG1655.faa

# Run CAZyme annotation (specify input format for protein sequences)
run_dbcan CAZyme_annotation \
  --input_raw_data EscheriaColiK12MG1655.faa \
  --mode protein \
  --output_dir output_EscheriaColiK12MG1655_faa \
  --db_dir db \

Example 3: Eukaryotic Proteome (NCBI)

# Download example eukaryotic proteome
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_euk_test/Xylona_heveae_TC161.faa -O Xylona_heveae_TC161.faa

# Run CAZyme annotation
run_dbcan CAZyme_annotation \
  --input_raw_data Xylona_heveae_TC161.faa \
  --mode protein \
  --output_dir output_Xylona_heveae_TC161_faa \
  --db_dir db

Example 4: Eukaryotic Proteome (JGI)

# Download example JGI format proteome
wget -q https://bcb.unl.edu/dbCAN2/download/test/JGI_test/Xylhe1_GeneCatalog_proteins_20130827.aa.fasta -O Xylhe1_GeneCatalog_proteins_20130827.aa.fasta

# Run CAZyme annotation
run_dbcan CAZyme_annotation \
  --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
  --mode protein \
  --output_dir output_Xylhe1_faa \
  --db_dir db

CAZyme Annotation Output Files#

After running CAZyme annotation, you’ll find these output files:

uniInput.faa

Unified input file for all tools, generated by Prodigal (for nucleotide input) or provided by the user (for protein input).

dbCANsub_hmm_results.tsv

Results from pyHMMER search using dbCAN_sub-HMM database.

diamond.out

Results from DIAMOND BLAST search against CAZy database.

dbCAN_hmm_results.tsv

Results from pyHMMER search using dbCAN-HMM database.

overview.tsv

Consolidated summary of CAZyme predictions across all tools. We recommend focusing on results predicted by at least two tools.

CGC (CAZyme Gene Cluster) Annotation#

Next, let’s identify and analyze CAZyme gene clusters (CGCs).

Example 1: Prokaryotic Genome with Generated GFF

# Run CGC annotation with automatically generated GFF
run_dbcan easy_CGC \
  --input_raw_data EscheriaColiK12MG1655.fna \
  --mode prok \
  --output_dir output_EscheriaColiK12MG1655_fna_CGC \
  --db_dir db \
  --input_gff gff \
  --gff_type prodigal

Example 2: Prokaryotic Proteome with External GFF

# Download example GFF file
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.gff -O EscheriaColiK12MG1655.gff

# Run CGC annotation with provided GFF
run_dbcan easy_CGC \
  --input_raw_data EscheriaColiK12MG1655.faa \
  --mode protein \
  --output_dir output_EscheriaColiK12MG1655_faa_CGC \
  --db_dir db \
  --input_gff EscheriaColiK12MG1655.gff \
  --gff_type NCBI_prok

Example 3: Eukaryotic Proteome with External GFF

# Download example eukaryotic GFF file
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_euk_test/Xylona_heveae_TC161.gff -O Xylona_heveae_TC161.gff

# Run CGC annotation
run_dbcan easy_CGC \
  --input_raw_data Xylona_heveae_TC161.faa \
  --mode protein \
  --output_dir output_Xylona_heveae_TC161_faa_CGC \
  --db_dir db \
  --input_gff Xylona_heveae_TC161.gff \
  --gff_type NCBI_euk

Example 4: JGI Format Data

# Download JGI format GFF file
wget -q https://bcb.unl.edu/dbCAN2/download/test/JGI_test/Xylhe1_GeneCatalog_proteins_20130827.gff -O Xylhe1_GeneCatalog_proteins_20130827.gff

# Run CGC annotation
run_dbcan easy_CGC \
  --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
  --mode protein \
  --output_dir output_Xylhe1_faa_CGC \
  --db_dir db \
  --input_gff Xylhe1_GeneCatalog_proteins_20130827.gff \
  --gff_type JGI

CGC Annotation Output Files#

In addition to the CAZyme annotation outputs, CGC analysis produces:

non_CAZyme.faa

Non-CAZyme protein sequences extracted from uniInput.faa based on overview results.

diamond.out.tc

DIAMOND BLAST results against TCDB for transporter protein annotation.

TF_hmm_results.tsv

pyHMMER results using TF-HMM database for transcription factor identification.

STP_hmm_results.tsv

pyHMMER results using STP-HMM for signal transduction protein identification.

total_cgc_info.tsv

Comprehensive annotation of all signature proteins (CAZymes, TC, TF, STP).

cgc.gff

Input file for CGCFinder in GFF format, generated from the input GFF and signature annotations.

cgc_standard_out.tsv

Standard output from CGCFinder showing identified CAZyme gene clusters.

Substrate Prediction#

Finally, let’s predict substrates for the identified CAZymes and CGCs.

Example 1: Prokaryotic Genome

# Run substrate prediction
run_dbcan easy_substrate \
  --input_raw_data EscheriaColiK12MG1655.fna \
  --mode prok \
  --output_dir output_EscheriaColiK12MG1655_fna_sub \
  --db_dir db \
  --input_gff gff \
  --gff_type prodigal

Example 2: Prokaryotic Proteome

# Run substrate prediction
run_dbcan easy_substrate \
  --input_raw_data EscheriaColiK12MG1655.faa \
  --mode protein \
  --output_dir output_EscheriaColiK12MG1655_faa_sub \
  --db_dir db \
  --input_gff EscheriaColiK12MG1655.gff \
  --gff_type NCBI_prok

Example 3: Eukaryotic Proteome

# Run substrate prediction
run_dbcan easy_substrate \
  --input_raw_data Xylona_heveae_TC161.faa \
  --mode protein \
  --output_dir output_Xylona_heveae_TC161_faa_sub \
  --db_dir db \
  --input_gff Xylona_heveae_TC161.gff \
  --gff_type NCBI_euk

Example 4: JGI Format Data

# Run substrate prediction
run_dbcan easy_substrate \
  --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
  --mode protein \
  --output_dir output_Xylhe1_faa_sub \
  --db_dir db \
  --input_gff Xylhe1_GeneCatalog_proteins_20130827.gff \
  --gff_type JGI

Substrate Prediction Output Files#

In addition to previous outputs, substrate prediction produces:

substrate_prediction.tsv

Final output containing predicted substrates for each CAZyme gene cluster.

PUL_blast.out

DIAMOND blastp results from comparing CGCs against dbCAN-PULs database.

synteny_pdf/

Directory containing synteny plots showing gene cluster mappings between PULs and CGCs.