CAZyme annotation comparison#
This section compares the performance and results of run_dbCAN v5 with previous versions, highlighting improvements in accuracy, speed, and output formatting.
Annotation Results Comparison#
We compared the annotation results between v4 and v5 using identical input datasets (E. coli K-12 MG1655 proteome). The comparison shows that core CAZyme predictions remain consistent between versions, confirming that the accuracy has been maintained while making significant improvements to code structure and performance.
Figure 1: Comparison of annotation results:run_dbCAN v4#
Figure 2: Comparison of annotation results: run_dbCAN v5#
Key Improvements in Output Format#
The new version (v5) provides several improvements in the output formatting:
More Precise Domain Boundaries
The v5 output now includes precise domain boundary information for both dbCAN-HMM and dbCAN-sub HMM:
# v5 format with precise domain boundaries NP_414747.1 -|-|- GH23(101-244) GH23_e819(102-244)+CBM50_e338(344-384)+CBM50_e338(403-442) CBM50+GH23 3 GH23_e819|CBM50_e338|CBM50_e338 # v4 format with limited domain information NP_414747.1 -|-|- GH23(101-244) GH23_e819+CBM50_e338+CBM50_e338 CBM50+GH23 3
New “Recommend Results” Column
The v5 version adds a new column showing the recommended annotation results, making it easier for users to interpret findings. Now we follow the rule:
CAZy-sub in dbCAN-HMM > dbCAN-subfam in dbCAN-sub-HMM > dbCAN-fam in dbCAN-HMMfor the final results:Gene ID EC# dbCAN_hmm dbCAN_sub DIAMOND #ofTools Recommend Results NP_414632.1 2.4.1.227:11 GT28(185-341) GT28_e46(185-341) GT28 3 GT28_e46
Cleaner DIAMOND Results
The v5 version eliminates extraneous file paths from DIAMOND results, providing cleaner output:
# v5 format with clean DIAMOND results NP_414555.1 - - - GT1 1 - # v4 format with file paths in results NP_414555.1 - - - Melli1_GeneCatalog_proteins_20150227.aa.fasta+GT1 1
Performance Comparison#
The new version shows significant performance improvements due to the implementation of pyHMMER and pyrodigal (tested on 40 CPUs):
Dataset |
V4 Runtime |
V5 Runtime |
|---|---|---|
|
32 min 24 sec |
5 min 58 sec |
Xylona heveae genome (~4.5 Mb) |
1 hr 02 min 02 sec |
5 min 50 sec |
Hint
Since the IO and computing capabilities of different server CPUs are different, this data is for reference only.