CAZyme Annotation

CAZyme Annotation#

Introduction#

CAZyme annotation is a critical step in identifying and classifying Carbohydrate-Active Enzymes (CAZymes) in biological sequences. The run_dbcan tool enables comprehensive annotation of CAZymes from various input types:

Prokaryotic genomes (nucleotide sequences)
Metagenomic contigs (nucleotide sequences)
Protein sequences (prokaryotic or eukaryotic)

The annotation process integrates multiple analytical tools to ensure high sensitivity and specificity in CAZyme identification.

Command Syntax#

run_dbcan CAZyme_annotation --input_raw_data <INPUT_FILE> --output_dir <OUTPUT_DIRECTORY> --db_dir <DATABASE_DIRECTORY> --mode <MODE>

Key Parameters#

Parameter	Description
`--input_raw_data`	Path to input sequence file (FASTA format)
`--output_dir`	Directory for output files
`--db_dir`	Directory containing database files
`--mode`	Analysis mode: `prok` (prokaryote), `meta` (metagenome), or `protein` (protein sequences)
`--methods`	Optional: Specify tools to use (`diamond`, `hmm`, and/or `dbCANsub`, default is `all`) Usage: `--methods diamond --methods hmm --methods dbCANsub` for multiple methods or just choose one/two.

Usage Examples#

Analyzing Prokaryotic Genomes#

When working with bacterial or archaeal genomes, use the prok mode:

run_dbcan CAZyme_annotation --input_raw_data EscheriaColiK12MG1655.fna --output_dir output_EscheriaColiK12MG1655_fna --db_dir db --mode prok

Analyzing Protein Sequences#

For pre-translated protein sequences, use the protein mode:

run_dbcan CAZyme_annotation --input_raw_data EscheriaColiK12MG1655.faa --output_dir output_EscheriaColiK12MG1655_faa --db_dir db --mode protein

Analyzing Eukaryotic Proteins#

Eukaryotic proteins are processed the same way using protein mode:

run_dbcan CAZyme_annotation --input_raw_data Xylona_heveae_TC161.faa --output_dir output_Xylona_heveae_TC161_faa --db_dir db --mode protein

run_dbcan CAZyme_annotation --input_raw_data Xylhe1_GeneCatalog_proteins_20130827.aa.fasta --output_dir output_Xylhe1_faa --db_dir db --mode protein

Tip

For large eukaryotic datasets, consider change the computational resources with --threads to specify the number of CPU cores. The default is all cores of your machine.

Output Files#

The annotation process generates several key output files in your specified output directory:

uniInput.faa - Unified input file for all tools
overview.txt - Summary of identified CAZymes
dbCAN_hmm_results.tsv - Detailed HMMER results
diamond.out - DIAMOND search results
dbCANsub_hmm_results.tsv - dbCAN sub-HMM results including substrate specificity

Customizing the Analysis#

To customize which analytical methods are used:

Using specific tools#

run_dbcan CAZyme_annotation --input_raw_data input.fna --output_dir output --db_dir db --mode prok --methods hmm --methods diamond

Available method combinations: hmm, diamond, dbCANsub, or any combination.

Next Steps

After completing CAZyme annotation, you may want to proceed to CGC Information Generation to identify CAZyme gene clusters.

Tip

Optional signal peptide and transmembrane topology columns (SignalP 6.0 / DeepTMHMM) require separate installation and testing. See SignalP 6.0 and DeepTMHMM (optional tools).