CGC Substrate Prediction

CGC Substrate Prediction#

Introduction#

Identifying the target substrates of CGCs is crucial for understanding their biological functions. Since 2023, run_dbCAN has included functionality to predict substrates for identified CGCs using two complementary methods:

  1. Sequence Similarity Method: Compares CAZymes in the CGC against a reference database of characterized Polysaccharide Utilization Loci (dbCAN-PUL)

  2. Domain Composition Method: Analyzes the combination of CAZy domains present in the CGC

These methods are detailed in our dbCAN3 paper <https://academic.oup.com/nar/article/51/W1/W115/7147496> and provide predictions for various plant cell wall polysaccharides, including cellulose, xylan, pectin, and more.

Note

The substrate prediction functionality has been changed from BLASTp into DIAMOND BLASTp. Substrate prediction uses DIAMOND BLASTP rather than standard BLASTP for sequence similarity searches, significantly improving performance without sacrificing accuracy.

Basic Usage#

After running CGC identification, you can predict substrates for the identified CGCs:

run_dbcan substrate_prediction --output_dir <OUTPUT_DIRECTORY> --db_dir <DATABASE_DIRECTORY>

The command automatically processes CGC data found in the output directory(generated by previous steps) and generates substrate predictions.

Examples#

Prokaryotic genome analysis#
run_dbcan substrate_prediction --output_dir output_EscheriaColiK12MG1655_fna --db_dir db
Prokaryotic protein analysis#
run_dbcan substrate_prediction --output_dir output_EscheriaColiK12MG1655_faa --db_dir db
JGI fungal protein analysis#
run_dbcan substrate_prediction --output_dir output_Xylhe1_faa --db_dir db
Custom fungal protein analysis#
run_dbcan substrate_prediction --output_dir output_Xylona_heveae_TC161_faa --db_dir db

Parameters#

Parameter

Description

--output_dir

Directory containing CGC data and for output files

--db_dir

Directory containing substrate prediction reference databases

For other detailed parameters, please check the API documentation on the left.

Output Files#

Substrate prediction generates the following output files:

  • substrate_prediction.tsv - Main output with predicted substrates for each CGC

  • PUL_blast.out - Raw DIAMOND search results

  • synteny_pdf/ - Synteny plots directory. It shows the gene cluster mappings between PULs and CGCs