CAZyme annotation comparison#

This section compares the performance and results of run_dbCAN v5 with previous versions, highlighting improvements in accuracy, speed, and output formatting.

Annotation Results Comparison#

We compared the annotation results between v4 and v5 using identical input datasets (E. coli K-12 MG1655 proteome). The comparison shows that core CAZyme predictions remain consistent between versions, confirming that the accuracy has been maintained while making significant improvements to code structure and performance.

Comparison of annotation results between v4 and v5

Figure 1: Comparison of annotation results:run_dbCAN v4#

Comparison of annotation results between v4 and v5

Figure 2: Comparison of annotation results: run_dbCAN v5#

Key Improvements in Output Format#

The new version (v5) provides several improvements in the output formatting:

  1. More Precise Domain Boundaries

    The v5 output now includes precise domain boundary information for both dbCAN-HMM and dbCAN-sub HMM:

    # v5 format with precise domain boundaries
    NP_414747.1  -|-|-  GH23(101-244)  GH23_e819(102-244)+CBM50_e338(344-384)+CBM50_e338(403-442)  CBM50+GH23  3  GH23_e819|CBM50_e338|CBM50_e338
    
    # v4 format with limited domain information
    NP_414747.1  -|-|-  GH23(101-244)  GH23_e819+CBM50_e338+CBM50_e338  CBM50+GH23  3
    
  2. New “Recommend Results” Column

    The v5 version adds a new column showing the recommended annotation results, making it easier for users to interpret findings. Now we follow the rule: CAZy-sub in dbCAN-HMM > dbCAN-subfam in dbCAN-sub-HMM > dbCAN-fam in dbCAN-HMM for the final results:

    Gene ID  EC#  dbCAN_hmm  dbCAN_sub  DIAMOND  #ofTools  Recommend Results
    NP_414632.1  2.4.1.227:11  GT28(185-341)  GT28_e46(185-341)  GT28  3  GT28_e46
    
  3. Cleaner DIAMOND Results

    The v5 version eliminates extraneous file paths from DIAMOND results, providing cleaner output:

    # v5 format with clean DIAMOND results
    NP_414555.1  -  -  -  GT1  1  -
    
    # v4 format with file paths in results
    NP_414555.1  -  -  -  Melli1_GeneCatalog_proteins_20150227.aa.fasta+GT1  1
    

Performance Comparison#

The new version shows significant performance improvements due to the implementation of pyHMMER and pyrodigal (tested on 40 CPUs):

Dataset

V4 Runtime

V5 Runtime

  1. coli genome (~4.5 Mb)

32 min 24 sec

5 min 58 sec

Xylona heveae genome (~4.5 Mb)

1 hr 02 min 02 sec

5 min 50 sec

Hint

Since the IO and computing capabilities of different server CPUs are different, this data is for reference only.