CAZyme Gene Cluster (CGC) Identification#
Introduction#
After preparing the CGC annotation information and generating a cgc.gff file, the next step is to identify CAZyme Gene Clusters (CGCs) in your genome. The cgc_finder command analyzes the annotated genes to detect clusters involved in carbohydrate metabolism.
Basic Usage#
Once the cgc.gff file is created (automatically saved in your output directory), you can run the CGC finder:
run_dbcan cgc_finder --output_dir <OUTPUT_DIRECTORY>
Examples for Different Genome Types#
run_dbcan cgc_finder --output_dir output_EscheriaColiK12MG1655_faa
run_dbcan cgc_finder --output_dir output_EscheriaColiK12MG1655_fna/
run_dbcan cgc_finder --output_dir output_Xylona_heveae_TC161_faa/
run_dbcan cgc_finder --output_dir output_Xylhe1_faa/
CGC Prediction Rules#
run_dbCAN supports two complementary rules for predicting CGCs:
Null Gene Search (Default)
Forward and backward search with a defined number of non-significant genes. When a core/additional gene is found, the search extends to the next iteration.
Distance-Based Search (AntiSMASH-like)
Uses base-pair distance (default: 15kb) to search forward and backward for core/additional genes. The distance is measured between consecutive significant genes.
Note
You can use either rule individually or combine both for stricter CGC prediction criteria.
Advanced Usage#
To customize CGC prediction parameters:
run_dbcan cgc_finder --output_dir output_dir --use_null_genes --num_null_gene 5 --use_distance --base_pair_distance 15000 --additional_genes TC --additional_genes TF --additional_genes STP
Key Parameters#
Parameter |
Description |
|---|---|
|
Directory containing the |
|
Enable null gene search strategy (true/false) |
|
Maximum number of consecutive non-significant genes allowed |
|
Enable distance-based search strategy (true/false) |
|
Maximum distance (bp) between significant genes |
|
Types of additional genes to include (TC: Transporter, TF: Transcription Factor, STP: Signal Transduction Protein) |
Hint
CAZyme Gene Pairs
You can set --additional_genes with CAZyme and it will generate CGCs that include CAZyme pairs. This is useful for cases where you want to ensure that certain CAZymes are always included in the same cluster. For example, if you want to include a glycoside hydrolase (GH) and a glycosyltransferase (GT) together, you can specify CAZyme in the --additional_genes parameter.
Additional Gene Types
You can specify multiple additional gene types using the --additional_genes parameter:
Using multiple parameters:
--additional_genes TC --additional_genes TF
Custom Gene Types
Beyond the standard types (TC, TF, STP), you can include custom gene types such as peptidases or sulfatases. However, this requires:
Annotating these functions in your genome.
Manually updating the
cgc.gfffile to include these annotations.
Non-coding Elements
While tRNAs and other non-coding genes are included in the cgc.gff file, they are not considered formal components of CAZyme Gene Clusters (considered as null genes). You can include them using the --additional_genes parameter if needed for your analysis, but this deviates from standard CGC definitions.
Output Files#
The CGC finder generates several output files:
cgc_standard.tsv- Text file listing all identified CGCs and their components.cgc.gff- GFF-like format annotated with functional genes for CGCFinder and visualization.