Quick Start Guide#
This guide helps you get started with run_dbCAN using example data and explains the generated output files.
The tool offers two approaches:
Automated analysis - Complete workflow with a single command
Step-by-step analysis - Breaking down the process for troubleshooting or customization
Here we performed the Automated analysis for each example file. For the Step-by-step analysis analysis, please refer to the documentation user_guide.
Example Data#
We provide several example datasets in the example_data directory for testing purposes.
Database Download#
First, download the database files required for the analysis. Make sure you have installed successfully and activated the `run_dbcan` environment.
# Download database files
run_dbcan database \
--db_dir db
# Optional: use --aws_s3 for faster and more stable downloads from AWS S3
# run_dbcan database --db_dir db --aws_s3
CAZyme Annotation#
Let’s annotate Carbohydrate-Active enZYmes (CAZymes) in our example data.
Example 1: Prokaryotic Genome (DNA)
# Download example prokaryotic genome (E. coli K-12 MG1655)
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.fna -O EscheriaColiK12MG1655.fna
# Run CAZyme annotation
run_dbcan CAZyme_annotation \
--input_raw_data EscheriaColiK12MG1655.fna \
--mode prok \
--output_dir output_EscheriaColiK12MG1655_fna \
--db_dir db
Example 2: Prokaryotic Proteome (Protein)
# Download example prokaryotic proteome
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.faa -O EscheriaColiK12MG1655.faa
# Run CAZyme annotation (specify input format for protein sequences)
run_dbcan CAZyme_annotation \
--input_raw_data EscheriaColiK12MG1655.faa \
--mode protein \
--output_dir output_EscheriaColiK12MG1655_faa \
--db_dir db \
Example 3: Eukaryotic Proteome (NCBI)
# Download example eukaryotic proteome
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_euk_test/Xylona_heveae_TC161.faa -O Xylona_heveae_TC161.faa
# Run CAZyme annotation
run_dbcan CAZyme_annotation \
--input_raw_data Xylona_heveae_TC161.faa \
--mode protein \
--output_dir output_Xylona_heveae_TC161_faa \
--db_dir db
Example 4: Eukaryotic Proteome (JGI)
# Download example JGI format proteome
wget -q https://bcb.unl.edu/dbCAN2/download/test/JGI_test/Xylhe1_GeneCatalog_proteins_20130827.aa.fasta -O Xylhe1_GeneCatalog_proteins_20130827.aa.fasta
# Run CAZyme annotation
run_dbcan CAZyme_annotation \
--input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
--mode protein \
--output_dir output_Xylhe1_faa \
--db_dir db
CAZyme Annotation Output Files#
After running CAZyme annotation, you’ll find these output files:
uniInput.faaUnified input file for all tools, generated by Prodigal (for nucleotide input) or provided by the user (for protein input).
dbCANsub_hmm_results.tsvResults from pyHMMER search using dbCAN_sub-HMM database.
diamond.outResults from DIAMOND BLAST search against CAZy database.
dbCAN_hmm_results.tsvResults from pyHMMER search using dbCAN-HMM database.
overview.tsvConsolidated summary of CAZyme predictions across all tools. We recommend focusing on results predicted by at least two tools.
CGC (CAZyme Gene Cluster) Annotation#
Next, let’s identify and analyze CAZyme gene clusters (CGCs).
Example 1: Prokaryotic Genome with Generated GFF
# Run CGC annotation with automatically generated GFF
run_dbcan easy_CGC \
--input_raw_data EscheriaColiK12MG1655.fna \
--mode prok \
--output_dir output_EscheriaColiK12MG1655_fna_CGC \
--db_dir db \
--input_gff gff \
--gff_type prodigal
Example 2: Prokaryotic Proteome with External GFF
# Download example GFF file
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_prok_test/EscheriaColiK12MG1655.gff -O EscheriaColiK12MG1655.gff
# Run CGC annotation with provided GFF
run_dbcan easy_CGC \
--input_raw_data EscheriaColiK12MG1655.faa \
--mode protein \
--output_dir output_EscheriaColiK12MG1655_faa_CGC \
--db_dir db \
--input_gff EscheriaColiK12MG1655.gff \
--gff_type NCBI_prok
Example 3: Eukaryotic Proteome with External GFF
# Download example eukaryotic GFF file
wget -q https://bcb.unl.edu/dbCAN2/download/test/NCBI_euk_test/Xylona_heveae_TC161.gff -O Xylona_heveae_TC161.gff
# Run CGC annotation
run_dbcan easy_CGC \
--input_raw_data Xylona_heveae_TC161.faa \
--mode protein \
--output_dir output_Xylona_heveae_TC161_faa_CGC \
--db_dir db \
--input_gff Xylona_heveae_TC161.gff \
--gff_type NCBI_euk
Example 4: JGI Format Data
# Download JGI format GFF file
wget -q https://bcb.unl.edu/dbCAN2/download/test/JGI_test/Xylhe1_GeneCatalog_proteins_20130827.gff -O Xylhe1_GeneCatalog_proteins_20130827.gff
# Run CGC annotation
run_dbcan easy_CGC \
--input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
--mode protein \
--output_dir output_Xylhe1_faa_CGC \
--db_dir db \
--input_gff Xylhe1_GeneCatalog_proteins_20130827.gff \
--gff_type JGI
CGC Annotation Output Files#
In addition to the CAZyme annotation outputs, CGC analysis produces:
non_CAZyme.faaNon-CAZyme protein sequences extracted from uniInput.faa based on overview results.
diamond.out.tcDIAMOND BLAST results against TCDB for transporter protein annotation.
TF_hmm_results.tsvpyHMMER results using TF-HMM database for transcription factor identification.
STP_hmm_results.tsvpyHMMER results using STP-HMM for signal transduction protein identification.
total_cgc_info.tsvComprehensive annotation of all signature proteins (CAZymes, TC, TF, STP).
cgc.gffInput file for CGCFinder in GFF format, generated from the input GFF and signature annotations.
cgc_standard_out.tsvStandard output from CGCFinder showing identified CAZyme gene clusters.
Substrate Prediction#
Finally, let’s predict substrates for the identified CAZymes and CGCs.
Example 1: Prokaryotic Genome
# Run substrate prediction
run_dbcan easy_substrate \
--input_raw_data EscheriaColiK12MG1655.fna \
--mode prok \
--output_dir output_EscheriaColiK12MG1655_fna_sub \
--db_dir db \
--input_gff gff \
--gff_type prodigal
Example 2: Prokaryotic Proteome
# Run substrate prediction
run_dbcan easy_substrate \
--input_raw_data EscheriaColiK12MG1655.faa \
--mode protein \
--output_dir output_EscheriaColiK12MG1655_faa_sub \
--db_dir db \
--input_gff EscheriaColiK12MG1655.gff \
--gff_type NCBI_prok
Example 3: Eukaryotic Proteome
# Run substrate prediction
run_dbcan easy_substrate \
--input_raw_data Xylona_heveae_TC161.faa \
--mode protein \
--output_dir output_Xylona_heveae_TC161_faa_sub \
--db_dir db \
--input_gff Xylona_heveae_TC161.gff \
--gff_type NCBI_euk
Example 4: JGI Format Data
# Run substrate prediction
run_dbcan easy_substrate \
--input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta \
--mode protein \
--output_dir output_Xylhe1_faa_sub \
--db_dir db \
--input_gff Xylhe1_GeneCatalog_proteins_20130827.gff \
--gff_type JGI
Substrate Prediction Output Files#
In addition to previous outputs, substrate prediction produces:
substrate_prediction.tsvFinal output containing predicted substrates for each CAZyme gene cluster.
PUL_blast.outDIAMOND blastp results from comparing CGCs against dbCAN-PULs database.
synteny_pdf/Directory containing synteny plots showing gene cluster mappings between PULs and CGCs.