User Guide#

Update: What’s New in run_dbCAN#

The new version of run_dbCAN introduces multiple new features and significant performance improvements, making the pipeline more user-friendly and efficient. We highly recommend users to upgrade to this version. If you have any questions or suggestions, please feel free to contact us:

All conda environments dependencies can be found at the following link: run_dbCAN Conda Environments

Key Features and Improvements#

  1. Simplified Database Downloading

    • Added a new database command for downloading database files, making the process simpler than before.

    • Supports downloading from both HTTP and AWS S3 sources (use --aws_s3 flag for faster and more stable downloads).

    • Use --cgc/--no-cgc option to control whether to download CGC-related databases.

  2. Enhanced Input Processing

    • Replaced prodigal with pyrodigal for input processing.

  3. Improved HMMER Performance

  4. Modular Code Structure

    • Reorganized the logic and structure of run_dbCAN by splitting functions into modules and following Object Oriented Programming.

    • Rewrote non-Python code in Python for improved readability.

    • Centralized parameter management using configuration files.

    • Leveraged the power of pandas for efficient data processing.

    • Added extensive logging and time reporting to make the pipeline more user-friendly.

  5. Enhanced dbCAN-sub and overview Features

    • Added coverage justifications and location information for dbCAN-sub.

    • Included CAZyme justification in the final results with an extra column called “Best Results.”

    • Now follow the rule: CAZy-sub > dbCAN-sub > dbCAN-fam for the final results.

  6. Redesigned CGCFinder

    • Now supports JGI, NCBI, and Prodigal gff formats.

    • Directly searches eukaryotic genomes, including fungi (beta function).

    • Added a new function to visualize the CGCs on the genome (beta function).

  7. Faster Substrate Prediction

  • Replaced blastp with DIAMOND for substrate prediction, significantly improving speed and efficiency.

  1. Updated Metagenomic Protocols

  1. SignalP 6.0 and DeepTMHMM (optional topology)

    • SignalP 6.0: signal peptide prediction in the CAZyme_annotation command via --run_signalp. Results are merged into overview.tsv in a SignalP column. Organism class: --signalp_org (other / euk).

    • DeepTMHMM: transmembrane topology via --run_deeptmhmm and --deeptmhmm_dir (directory containing the user-installed predict.py). Results are merged into DeepTMHMM in overview.tsv.

    • Neither tool is bundled with dbcan. Install and test them locally, then enable the flags. See SignalP 6.0 and DeepTMHMM (optional tools) and the SignalP 6.0 installation instructions.

  2. Global Logging System

    • Implemented comprehensive logging system available for all commands.

    • Use --log-level to set logging level (DEBUG/INFO/WARNING/ERROR/CRITICAL, default: WARNING).

    • Use --log-file to write logs to a file in addition to console output.

    • Use --verbose or -v flag for detailed debug logging (equivalent to –log-level DEBUG).

Hint

If you want to run the pipeline from raw metagenomic reads, please refer to the following part: metagenomics_pipeline

Otherwise, refer to the instructions below. Please note that some precomputed results may have different names compared to the previous version.

Note

For detailed instructions, refer to the respective sections in the documentation.

Change logs

References

Contributors