CGC Substrate Prediction#
Introduction#
Identifying the target substrates of CGCs is crucial for understanding their biological functions. Since 2023, run_dbCAN has included functionality to predict substrates for identified CGCs using two complementary methods:
Sequence Similarity Method: Compares CAZymes in the CGC against a reference database of characterized Polysaccharide Utilization Loci (dbCAN-PUL)
Domain Composition Method: Analyzes the combination of CAZy domains present in the CGC
These methods are detailed in our dbCAN3 paper <https://academic.oup.com/nar/article/51/W1/W115/7147496> and provide predictions for various plant cell wall polysaccharides,
including cellulose, xylan, pectin, and more.
Note
The substrate prediction functionality has been changed from BLASTp into DIAMOND BLASTp. Substrate prediction uses DIAMOND BLASTP rather than standard BLASTP for sequence similarity searches, significantly improving performance without sacrificing accuracy.
Basic Usage#
After running CGC identification, you can predict substrates for the identified CGCs:
run_dbcan substrate_prediction --output_dir <OUTPUT_DIRECTORY> --db_dir <DATABASE_DIRECTORY>
The command automatically processes CGC data found in the output directory(generated by previous steps) and generates substrate predictions.
Examples#
run_dbcan substrate_prediction --output_dir output_EscheriaColiK12MG1655_fna --db_dir db
run_dbcan substrate_prediction --output_dir output_EscheriaColiK12MG1655_faa --db_dir db
run_dbcan substrate_prediction --output_dir output_Xylhe1_faa --db_dir db
run_dbcan substrate_prediction --output_dir output_Xylona_heveae_TC161_faa --db_dir db
Parameters#
Parameter |
Description |
|---|---|
|
Directory containing CGC data and for output files |
|
Directory containing substrate prediction reference databases |
For other detailed parameters, please check the API documentation on the left.
Output Files#
Substrate prediction generates the following output files:
substrate_prediction.tsv- Main output with predicted substrates for each CGCPUL_blast.out- Raw DIAMOND search resultssynteny_pdf/- Synteny plots directory. It shows the gene cluster mappings between PULs and CGCs