Quick Start Guide

Quick Start Guide#

This guide helps you get started with run_dbCAN using example data and explains the generated output files.

The tool offers two approaches:

Automated analysis - Complete workflow with a single command
Step-by-step analysis - Breaking down the process for troubleshooting or customization

Here we performed the Automated analysis for each example file. For the Step-by-step analysis analysis, please refer to the documentation user_guide.

Example Data#

We provide several example datasets in the example_data directory for testing purposes.

Database Download#

First, download the database files required for the analysis. Make sure you have installed successfully and activated the `run_dbcan` environment.

# Download database files
run_dbcan database \
  --db_dir db

# Optional: use --aws_s3 for faster and more stable downloads from AWS S3
# run_dbcan database --db_dir db --aws_s3

CAZyme Annotation#

Let’s annotate Carbohydrate-Active enZYmes (CAZymes) in our example data.

Example 1: Prokaryotic Genome (DNA)

# Download example prokaryotic genome (E. coli K-12 MG1655)
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.fna -O EscheriaColiK12MG1655.fna

# Run CAZyme annotation
run_dbcan CAZyme_annotation \
  --input_raw_data EscheriaColiK12MG1655.fna \
  --mode prok \
  --output_dir output_EscheriaColiK12MG1655_fna \
  --db_dir db

Example 2: Prokaryotic Proteome (Protein)

# Download example prokaryotic proteome
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.faa -O EscheriaColiK12MG1655.faa

# Run CAZyme annotation (specify input format for protein sequences)
run_dbcan CAZyme_annotation \
  --input_raw_data EscheriaColiK12MG1655.faa \
  --mode protein \
  --output_dir output_EscheriaColiK12MG1655_faa \
  --db_dir db \

Example 3: Eukaryotic Proteome (NCBI)

# Download example eukaryotic proteome
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_euk_test/Xylona_heveae_TC161.faa -O Xylona_heveae_TC161.faa

# Run CAZyme annotation
run_dbcan CAZyme_annotation \
  --input_raw_data Xylona_heveae_TC161.faa \
  --mode protein \
  --output_dir output_Xylona_heveae_TC161_faa \
  --db_dir db

Example 4: Eukaryotic Proteome (JGI)

# Download example JGI format proteome
wget -q https://bcb.unl.edu/dbCAN2/download/test/JGI_test/Xylhe1_GeneCatalog_proteins_20130827.aa.fasta -O Xylhe1_GeneCatalog_proteins_20130827.aa.fasta

# Run CAZyme annotation
run_dbcan CAZyme_annotation \
  --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
  --mode protein \
  --output_dir output_Xylhe1_faa \
  --db_dir db

CAZyme Annotation Output Files#

After running CAZyme annotation, you’ll find these output files:

uniInput.faa: Unified input file for all tools, generated by Prodigal (for nucleotide input) or provided by the user (for protein input).
dbCANsub_hmm_results.tsv: Results from pyHMMER search using dbCAN_sub-HMM database.
diamond.out: Results from DIAMOND BLAST search against CAZy database.
dbCAN_hmm_results.tsv: Results from pyHMMER search using dbCAN-HMM database.
overview.tsv: Consolidated summary of CAZyme predictions across all tools. We recommend focusing on results predicted by at least two tools.

CGC (CAZyme Gene Cluster) Annotation#

Next, let’s identify and analyze CAZyme gene clusters (CGCs).

Example 1: Prokaryotic Genome with Generated GFF

# Run CGC annotation with automatically generated GFF
run_dbcan easy_CGC \
  --input_raw_data EscheriaColiK12MG1655.fna \
  --mode prok \
  --output_dir output_EscheriaColiK12MG1655_fna_CGC \
  --db_dir db \
  --input_gff gff \
  --gff_type prodigal

Example 2: Prokaryotic Proteome with External GFF

# Download example GFF file
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.gff -O EscheriaColiK12MG1655.gff

# Run CGC annotation with provided GFF
run_dbcan easy_CGC \
  --input_raw_data EscheriaColiK12MG1655.faa \
  --mode protein \
  --output_dir output_EscheriaColiK12MG1655_faa_CGC \
  --db_dir db \
  --input_gff EscheriaColiK12MG1655.gff \
  --gff_type NCBI_prok

Example 3: Eukaryotic Proteome with External GFF

# Download example eukaryotic GFF file
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_euk_test/Xylona_heveae_TC161.gff -O Xylona_heveae_TC161.gff

# Run CGC annotation
run_dbcan easy_CGC \
  --input_raw_data Xylona_heveae_TC161.faa \
  --mode protein \
  --output_dir output_Xylona_heveae_TC161_faa_CGC \
  --db_dir db \
  --input_gff Xylona_heveae_TC161.gff \
  --gff_type NCBI_euk

Example 4: JGI Format Data

# Download JGI format GFF file
wget -q https://bcb.unl.edu/dbCAN2/download/test/JGI_test/Xylhe1_GeneCatalog_proteins_20130827.gff -O Xylhe1_GeneCatalog_proteins_20130827.gff

# Run CGC annotation
run_dbcan easy_CGC \
  --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
  --mode protein \
  --output_dir output_Xylhe1_faa_CGC \
  --db_dir db \
  --input_gff Xylhe1_GeneCatalog_proteins_20130827.gff \
  --gff_type JGI

CGC Annotation Output Files#

In addition to the CAZyme annotation outputs, CGC analysis produces:

non_CAZyme.faa: Non-CAZyme protein sequences extracted from uniInput.faa based on overview results.
diamond.out.tc: DIAMOND BLAST results against TCDB for transporter protein annotation.
TF_hmm_results.tsv: pyHMMER results using TF-HMM database for transcription factor identification.
STP_hmm_results.tsv: pyHMMER results using STP-HMM for signal transduction protein identification.
total_cgc_info.tsv: Comprehensive annotation of all signature proteins (CAZymes, TC, TF, STP).
cgc.gff: Input file for CGCFinder in GFF format, generated from the input GFF and signature annotations.
cgc_standard_out.tsv: Standard output from CGCFinder showing identified CAZyme gene clusters.

Substrate Prediction#

Finally, let’s predict substrates for the identified CAZymes and CGCs.

Example 1: Prokaryotic Genome

# Run substrate prediction
run_dbcan easy_substrate \
  --input_raw_data EscheriaColiK12MG1655.fna \
  --mode prok \
  --output_dir output_EscheriaColiK12MG1655_fna_sub \
  --db_dir db \
  --input_gff gff \
  --gff_type prodigal

Example 2: Prokaryotic Proteome

# Run substrate prediction
run_dbcan easy_substrate \
  --input_raw_data EscheriaColiK12MG1655.faa \
  --mode protein \
  --output_dir output_EscheriaColiK12MG1655_faa_sub \
  --db_dir db \
  --input_gff EscheriaColiK12MG1655.gff \
  --gff_type NCBI_prok

Example 3: Eukaryotic Proteome

# Run substrate prediction
run_dbcan easy_substrate \
  --input_raw_data Xylona_heveae_TC161.faa \
  --mode protein \
  --output_dir output_Xylona_heveae_TC161_faa_sub \
  --db_dir db \
  --input_gff Xylona_heveae_TC161.gff \
  --gff_type NCBI_euk