API Documentation#
run_dbcan Command Line Interface#
run_dbcan#
use dbCAN tools to annotate and analyze CAZymes and CGCs.
Usage
run_dbcan [OPTIONS] COMMAND [ARGS]...
Options
- -v, --verbose#
Shortcut for –log-level DEBUG (overrides –log-level if both are passed).
- Default:
False
- --log-file <log_file>#
Log file path (truncates each run); mirrors console when set.
- --log-level <log_level>#
Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.
- Default:
'WARNING'- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
CAZyme_annotation#
annotate CAZyme using run_dbcan with prokaryotic, metagenomics, and protein sequences.
Usage
run_dbcan CAZyme_annotation [OPTIONS]
Options
- -v, --verbose#
Shortcut for –log-level DEBUG (overrides –log-level if both are passed).
- Default:
False
- --log-file <log_file>#
Log file path (truncates each run); mirrors console when set.
- --log-level <log_level>#
Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.
- Default:
'WARNING'- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
- --mode <mode>#
Required Input type: prok (prokaryote DNA), meta (metagenome DNA), protein (amino-acid FASTA).
- Default:
'prok'
- --output_dir <output_dir>#
Required Output directory (created if needed). All run_dbcan result files are written here.
- --input_raw_data <input_raw_data>#
Required Path to input sequences (FASTA/FASTA.gz): nucleotide for prok/meta modes, proteins when –mode=protein.
- --db_dir <db_dir>#
Required Directory containing dbCAN database files (HMM, DIAMOND, etc.).
- --methods <methods>#
CAZyme annotation modules for CAZyme_annotation / easy_* step 1, comma-separated. Choices: diamond (CAZy DIAMOND), hmm (dbCAN HMM), dbCANsub (dbCAN-sub HMM). Example: –methods diamond,hmm
- Default:
'diamond,hmm,dbCANsub'
- --threads <threads>#
Parallel worker count for supported tools (DIAMOND, pyhmmer CPU threads, etc.).
- Default:
2
- --verbose_option#
Pass DIAMOND –verbose (more DIAMOND stderr output).
- Default:
False
- --e_value_threshold <e_value_threshold>#
Maximum E-value for DIAMOND CAZy hits (float; stricter = smaller).
- Default:
1e-102
- --large_input_threshold_mb <large_input_threshold_mb>#
If the input FASTA for the HMM step exceeds this size (MB), enable large mode automatically. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
5000
- --large, --no-large#
Streaming-safe pyhmmer mode for huge inputs (less preload, lower OOM risk). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
False
- --enable_memory_monitoring, --no-enable_memory_monitoring#
Track RAM and adapt pyhmmer batching; use –no-enable_memory_monitoring to disable. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
True
- --max_retries <max_retries>#
Retries after MemoryError during pyhmmer HMM search, halving batch each retry. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
3
- --memory_safety_factor <memory_safety_factor>#
Fraction of available RAM used when estimating automatic batch_size (0.0–1.0; lower = smaller batches). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
0.5
- --max_memory_usage <max_memory_usage>#
If system memory usage ratio exceeds this (0.0–1.0), emit warnings and tighten batching. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
0.8
- --batch_size <batch_size>#
Protein sequences per pyhmmer batch for dbCAN HMM; omit for automatic sizing from free RAM. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- --csv_buffer_size <csv_buffer_size>#
Buffer this many HMM hit rows before flushing to TSV (larger = fewer writes, slightly more RAM). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
5000
- --coverage_threshold_dbcan <coverage_threshold_dbcan>#
Minimum hit coverage (0.0–1.0 fraction of query length) for dbCAN HMM (step 1 only).
- Default:
0.35
- --e_value_threshold_dbcan <e_value_threshold_dbcan>#
Maximum domain-independent E-value to keep a dbCAN HMM hit (pyhmmer; CAZyme_annotation / easy_* step 1 only).
- Default:
1e-15
- --large_input_threshold_mb_dbsub <large_input_threshold_mb_dbsub>#
dbCAN-sub: auto large mode when input FASTA exceeds this size (MB).
- Default:
5000
- --large_dbsub, --no-large_dbsub#
dbCAN-sub: force streaming-safe pyhmmer mode.
- Default:
False
- --enable_memory_monitoring_dbsub, --no-enable_memory_monitoring_dbsub#
dbCAN-sub: enable RAM monitoring and adaptive batching.
- Default:
True
- --max_retries_dbsub <max_retries_dbsub>#
dbCAN-sub: max pyhmmer retries after MemoryError.
- Default:
3
- --memory_safety_factor_dbsub <memory_safety_factor_dbsub>#
dbCAN-sub: RAM fraction used in automatic batch_size estimate (0.0–1.0).
- Default:
0.5
- --max_memory_usage_dbsub <max_memory_usage_dbsub>#
dbCAN-sub: system memory usage ratio threshold (0.0–1.0) for warnings / throttling.
- Default:
0.8
- --batch_size_dbsub <batch_size_dbsub>#
dbCAN-sub: sequences per pyhmmer batch; omit for automatic sizing (independent from –batch_size).
- --csv_buffer_size_dbsub <csv_buffer_size_dbsub>#
dbCAN-sub: buffer this many HMM hit rows before flushing to disk.
- Default:
5000
- --coverage_threshold_dbsub <coverage_threshold_dbsub>#
Minimum hit coverage (0.0–1.0) for dbCAN-sub HMM (step 1 only).
- Default:
0.35
- --e_value_threshold_dbsub <e_value_threshold_dbsub>#
Maximum E-value to keep a dbCAN-sub HMM hit (pyhmmer; step 1 only).
- Default:
1e-15
- --force_topology, --no-force_topology#
Recompute topology columns even when overview already contains predictions.
- Default:
False
- --signalp_org <signalp_org>#
SignalP organism class: other (bacteria/archaea) or euk (eukaryotes).
- Default:
'other'- Options:
other | euk
- --deeptmhmm_python <deeptmhmm_python>#
Python interpreter used to launch DeepTMHMM predict.py.
- Default:
'python'
- --deeptmhmm_dir <deeptmhmm_dir>#
Directory that contains DeepTMHMM predict.py (only used with –run_deeptmhmm).
- --run_deeptmhmm, --no-run_deeptmhmm#
Run a user-installed DeepTMHMM predict.py and append transmembrane predictions to overview.
- Default:
False
- --run_signalp, --no-run_signalp#
Run SignalP 6 (BioLib) on translated proteins and append peptide signal columns to overview.
- Default:
False
Pfam_null_cgc#
identify CAZyme Gene Clusters(CGCs)
Usage
run_dbcan Pfam_null_cgc [OPTIONS]
Options
- -v, --verbose#
Shortcut for –log-level DEBUG (overrides –log-level if both are passed).
- Default:
False
- --log-file <log_file>#
Log file path (truncates each run); mirrors console when set.
- --log-level <log_level>#
Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.
- Default:
'WARNING'- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
- --threads <threads>#
Parallel worker count for supported tools (DIAMOND, pyhmmer CPU threads, etc.).
- Default:
2
- --db_dir <db_dir>#
Required Directory containing dbCAN database files (HMM, DIAMOND, etc.).
- --output_dir <output_dir>#
Required Output directory (created if needed). All run_dbcan result files are written here.
- --null_from_gff#
Extract null-gene proteins from cgc.gff instead of cgc_standard_out.tsv.
- Default:
False
- --coverage_threshold_pfam <coverage_threshold_pfam>#
Minimum HMM alignment coverage (0.0–1.0 fraction of HMM length) for Pfam hits.
- Default:
0.35
- --e_value_threshold_pfam <e_value_threshold_pfam>#
Maximum domain-independent E-value for Pfam hits.
- Default:
0.0001
- --run_pfam#
Run Pfam pyhmmer on null genes (Pfam_null_cgc / CGC null annotation).
- Default:
False
cgc_circle_plot#
generate circular plots for CAZyme Gene Clusters(CGCs).
Usage
run_dbcan cgc_circle_plot [OPTIONS]
Options
- -v, --verbose#
Shortcut for –log-level DEBUG (overrides –log-level if both are passed).
- Default:
False
- --log-file <log_file>#
Log file path (truncates each run); mirrors console when set.
- --log-level <log_level>#
Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.
- Default:
'WARNING'- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
- --output_dir <output_dir>#
Required Output directory (created if needed). All run_dbcan result files are written here.
cgc_finder#
identify CAZyme Gene Clusters(CGCs)
Usage
run_dbcan cgc_finder [OPTIONS]
Options
- -v, --verbose#
Shortcut for –log-level DEBUG (overrides –log-level if both are passed).
- Default:
False
- --log-file <log_file>#
Log file path (truncates each run); mirrors console when set.
- --log-level <log_level>#
Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.
- Default:
'WARNING'- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
- --output_dir <output_dir>#
Required Output directory (created if needed). All run_dbcan result files are written here.
- --feature_type <feature_types>#
GFF feature types to parse (repeat option for multiple values, e.g. –feature_type CDS –feature_type gene).
- Default:
'CDS'
- --min_cluster_genes <min_cluster_genes>#
Minimum total genes in a CGC locus.
- Default:
2
- --min_core_cazyme <min_core_cazyme>#
Minimum core CAZyme count required to retain a CGC.
- Default:
1
- --extend_gene_count <extend_gene_count>#
With –extend_mode=gene, extend each side by this many flanking genes.
- Default:
0
- --extend_bp <extend_bp>#
With –extend_mode=bp, extend each side by this many base pairs.
- Default:
0
- --extend_mode <extend_mode>#
After CGC detection, extend cluster bounds: ‘bp’ uses –extend_bp; ‘gene’ uses –extend_gene_count; ‘none’ disables.
- Default:
'none'- Options:
none | bp | gene
- --use_distance#
Also require signature genes to fall within –base_pair_distance bp.
- Default:
False
- --use_null_genes, --no-use_null_genes#
Allow null genes between CGC signatures (–no-use_null_genes: disable).
- Default:
True
- --base_pair_distance <base_pair_distance>#
Max distance (bp) between CGC signature genes when –use_distance is enabled.
- Default:
15000
- --num_null_gene <num_null_gene>#
Max number of intervening non-signature (null) genes allowed between core CGC genes.
- Default:
2
- --additional_min_categories <additional_min_categories>#
When –additional_logic=any, minimum number of distinct additional gene classes that must match.
- Default:
1
- --additional_logic <additional_logic>#
How to combine –additional_genes: ‘all’ = every listed class must be present; ‘any’ = at least –additional_min_categories classes.
- Default:
'all'- Options:
all | any
- --additional_genes <additional_genes>#
Gene class tags required alongside CAZyme for CGC signatures. Repeat: –additional_genes TC –additional_genes TF. Choices include TC, TF, STP.
- Default:
'TC'
database#
download dbCAN databases.
Usage
run_dbcan database [OPTIONS]
Options
- -v, --verbose#
Shortcut for –log-level DEBUG (overrides –log-level if both are passed).
- Default:
False
- --log-file <log_file>#
Log file path (truncates each run); mirrors console when set.
- --log-level <log_level>#
Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.
- Default:
'WARNING'- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
- --verify-ssl, --no-verify-ssl#
Verify TLS certificates for HTTPS downloads (–no-verify-ssl: insecure, not recommended).
- Default:
True
- --no-overwrite#
If set, skip downloading files that already exist in db_dir.
- Default:
False
- --resume, --no-resume#
Resume partial downloads when supported (–no-resume: always fetch from scratch).
- Default:
True
- --retries <retries>#
Retries for transient HTTP/S3 download failures.
- Default:
3
- --timeout <timeout>#
HTTP(S) request timeout in seconds per download attempt.
- Default:
30
- --aws_s3#
Download from the pinned AWS S3 release; omit for HTTP db_current (moving snapshot).
- Default:
False
- --cgc, --no-cgc#
With –cgc (default): download CGC-related DB assets. With –no-cgc: skip them (database subcommand only).
- Default:
True
- --db_dir <db_dir>#
Required Directory containing dbCAN database files (HMM, DIAMOND, etc.).
easy_CGC#
Perform complete CGC analysis: CAZyme annotation, GFF processing, and CGC identification in one step.
Usage
run_dbcan easy_CGC [OPTIONS]
Options
- -v, --verbose#
Shortcut for –log-level DEBUG (overrides –log-level if both are passed).
- Default:
False
- --log-file <log_file>#
Log file path (truncates each run); mirrors console when set.
- --log-level <log_level>#
Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.
- Default:
'WARNING'- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
- --mode <mode>#
Required Input type: prok (prokaryote DNA), meta (metagenome DNA), protein (amino-acid FASTA).
- Default:
'prok'
- --output_dir <output_dir>#
Required Output directory (created if needed). All run_dbcan result files are written here.
- --input_raw_data <input_raw_data>#
Required Path to input sequences (FASTA/FASTA.gz): nucleotide for prok/meta modes, proteins when –mode=protein.
- --db_dir <db_dir>#
Required Directory containing dbCAN database files (HMM, DIAMOND, etc.).
- --methods <methods>#
CAZyme annotation modules for CAZyme_annotation / easy_* step 1, comma-separated. Choices: diamond (CAZy DIAMOND), hmm (dbCAN HMM), dbCANsub (dbCAN-sub HMM). Example: –methods diamond,hmm
- Default:
'diamond,hmm,dbCANsub'
- --threads <threads>#
Parallel worker count for supported tools (DIAMOND, pyhmmer CPU threads, etc.).
- Default:
2
- --verbose_option#
Pass DIAMOND –verbose (more DIAMOND stderr output).
- Default:
False
- --e_value_threshold <e_value_threshold>#
Maximum E-value for DIAMOND CAZy hits (float; stricter = smaller).
- Default:
1e-102
- --large_input_threshold_mb <large_input_threshold_mb>#
If the input FASTA for the HMM step exceeds this size (MB), enable large mode automatically. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
5000
- --large, --no-large#
Streaming-safe pyhmmer mode for huge inputs (less preload, lower OOM risk). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
False
- --enable_memory_monitoring, --no-enable_memory_monitoring#
Track RAM and adapt pyhmmer batching; use –no-enable_memory_monitoring to disable. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
True
- --max_retries <max_retries>#
Retries after MemoryError during pyhmmer HMM search, halving batch each retry. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
3
- --memory_safety_factor <memory_safety_factor>#
Fraction of available RAM used when estimating automatic batch_size (0.0–1.0; lower = smaller batches). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
0.5
- --max_memory_usage <max_memory_usage>#
If system memory usage ratio exceeds this (0.0–1.0), emit warnings and tighten batching. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
0.8
- --batch_size <batch_size>#
Protein sequences per pyhmmer batch for dbCAN HMM; omit for automatic sizing from free RAM. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- --csv_buffer_size <csv_buffer_size>#
Buffer this many HMM hit rows before flushing to TSV (larger = fewer writes, slightly more RAM). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
5000
- --coverage_threshold_dbcan <coverage_threshold_dbcan>#
Minimum hit coverage (0.0–1.0 fraction of query length) for dbCAN HMM (step 1 only).
- Default:
0.35
- --e_value_threshold_dbcan <e_value_threshold_dbcan>#
Maximum domain-independent E-value to keep a dbCAN HMM hit (pyhmmer; CAZyme_annotation / easy_* step 1 only).
- Default:
1e-15
- --large_input_threshold_mb_dbsub <large_input_threshold_mb_dbsub>#
dbCAN-sub: auto large mode when input FASTA exceeds this size (MB).
- Default:
5000
- --large_dbsub, --no-large_dbsub#
dbCAN-sub: force streaming-safe pyhmmer mode.
- Default:
False
- --enable_memory_monitoring_dbsub, --no-enable_memory_monitoring_dbsub#
dbCAN-sub: enable RAM monitoring and adaptive batching.
- Default:
True
- --max_retries_dbsub <max_retries_dbsub>#
dbCAN-sub: max pyhmmer retries after MemoryError.
- Default:
3
- --memory_safety_factor_dbsub <memory_safety_factor_dbsub>#
dbCAN-sub: RAM fraction used in automatic batch_size estimate (0.0–1.0).
- Default:
0.5
- --max_memory_usage_dbsub <max_memory_usage_dbsub>#
dbCAN-sub: system memory usage ratio threshold (0.0–1.0) for warnings / throttling.
- Default:
0.8
- --batch_size_dbsub <batch_size_dbsub>#
dbCAN-sub: sequences per pyhmmer batch; omit for automatic sizing (independent from –batch_size).
- --csv_buffer_size_dbsub <csv_buffer_size_dbsub>#
dbCAN-sub: buffer this many HMM hit rows before flushing to disk.
- Default:
5000
- --coverage_threshold_dbsub <coverage_threshold_dbsub>#
Minimum hit coverage (0.0–1.0) for dbCAN-sub HMM (step 1 only).
- Default:
0.35
- --e_value_threshold_dbsub <e_value_threshold_dbsub>#
Maximum E-value to keep a dbCAN-sub HMM hit (pyhmmer; step 1 only).
- Default:
1e-15
- --coverage_threshold_stp <coverage_threshold_stp>#
Minimum hit coverage (0.0–1.0) for STP pyhmmer (step 2).
- Default:
0.35
- --e_value_threshold_stp <e_value_threshold_stp>#
Maximum E-value for STP pyhmmer hits (gff_process / easy_* step 2).
- Default:
0.0001
- --fungi, --no-fungi#
Run TF HMM search (–fungi); default skips TF pyhmmer (prokaryotes use TF DIAMOND instead).
- Default:
False
- --coverage_threshold_tf <coverage_threshold_tf>#
Minimum hit coverage (0.0–1.0) for TF pyhmmer (step 2; requires –fungi).
- Default:
0.35
- --e_value_threshold_tf <e_value_threshold_tf>#
Maximum E-value for TF pyhmmer hits (gff_process / easy_* step 2; fungi mode only).
- Default:
0.0001
- --prokaryotic, --no-prokaryotic#
Run prokaryotic TF DIAMOND step (–no-prokaryotic: skip it; use fungi TF HMM with –fungi).
- Default:
True
- --coverage_threshold_tf_diamond <coverage_threshold_tf_diamond>#
DIAMOND –query-cover for TF DIAMOND: minimum query coverage in percent (0–100; same semantics as –coverage_threshold_tc, default 35%%).
- Default:
35
- --e_value_threshold_tf_diamond <e_value_threshold_tf_diamond>#
Maximum E-value for TF DIAMOND (prokaryotic TFDB path).
- Default:
0.0001
- --coverage_threshold_tc <coverage_threshold_tc>#
DIAMOND –query-cover for TCDB search, as percent of query length (0–100; default 35 = 35%%).
- Default:
35
- --e_value_threshold_tc <e_value_threshold_tc>#
Maximum E-value for TC (transporter) DIAMOND vs TCDB.
- Default:
0.0001
- --prodigal-streaming-threshold-mb <prodigal_streaming_threshold_mb>#
When –prodigal-gff-streaming=auto, stream if GFF size exceeds this (decimal MB). Use 0 to stream any non-empty file.
- Default:
50
- --prodigal-gff-streaming <prodigal_gff_streaming>#
Prodigal GFF annotation: ‘auto’ stream-annotate when input exceeds threshold (faster for huge GFF); ‘on’ always use streaming; ‘off’ always use BCBio GFF.parse.
- Default:
'auto'- Options:
auto | on | off
- --gff_type <gff_type>#
GFF dialect label (e.g. prodigal, NCBI_prok). When –mode!=protein defaults to ‘prodigal’ if unset.
- --input_gff <input_gff>#
Annotation GFF. Required when –mode=protein; otherwise defaults to <output_dir>/uniInput.gff.
- --feature_type <feature_types>#
GFF feature types to parse (repeat option for multiple values, e.g. –feature_type CDS –feature_type gene).
- Default:
'CDS'
- --min_cluster_genes <min_cluster_genes>#
Minimum total genes in a CGC locus.
- Default:
2
- --min_core_cazyme <min_core_cazyme>#
Minimum core CAZyme count required to retain a CGC.
- Default:
1
- --extend_gene_count <extend_gene_count>#
With –extend_mode=gene, extend each side by this many flanking genes.
- Default:
0
- --extend_bp <extend_bp>#
With –extend_mode=bp, extend each side by this many base pairs.
- Default:
0
- --extend_mode <extend_mode>#
After CGC detection, extend cluster bounds: ‘bp’ uses –extend_bp; ‘gene’ uses –extend_gene_count; ‘none’ disables.
- Default:
'none'- Options:
none | bp | gene
- --use_distance#
Also require signature genes to fall within –base_pair_distance bp.
- Default:
False
- --use_null_genes, --no-use_null_genes#
Allow null genes between CGC signatures (–no-use_null_genes: disable).
- Default:
True
- --base_pair_distance <base_pair_distance>#
Max distance (bp) between CGC signature genes when –use_distance is enabled.
- Default:
15000
- --num_null_gene <num_null_gene>#
Max number of intervening non-signature (null) genes allowed between core CGC genes.
- Default:
2
- --additional_min_categories <additional_min_categories>#
When –additional_logic=any, minimum number of distinct additional gene classes that must match.
- Default:
1
- --additional_logic <additional_logic>#
How to combine –additional_genes: ‘all’ = every listed class must be present; ‘any’ = at least –additional_min_categories classes.
- Default:
'all'- Options:
all | any
- --additional_genes <additional_genes>#
Gene class tags required alongside CAZyme for CGC signatures. Repeat: –additional_genes TC –additional_genes TF. Choices include TC, TF, STP.
- Default:
'TC'
easy_substrate#
Perform complete CGC analysis: CAZyme annotation, GFF processing, CGC identification, and substrate prediction in one step.
Usage
run_dbcan easy_substrate [OPTIONS]
Options
- -v, --verbose#
Shortcut for –log-level DEBUG (overrides –log-level if both are passed).
- Default:
False
- --log-file <log_file>#
Log file path (truncates each run); mirrors console when set.
- --log-level <log_level>#
Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.
- Default:
'WARNING'- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
- -subs, --substrate_scors <substrate_scors>#
Minimum aggregated dbCAN-sub score (field name substrate_scors) to accept a substrate assignment.
- Default:
2.0
- -npsc, --num_of_protein_substrate_cutoff <num_of_protein_substrate_cutoff>#
Minimum proteins supporting a substrate call within a CGC.
- Default:
2
- -ndsc, --num_of_domains_substrate_cutoff <num_of_domains_substrate_cutoff>#
Minimum distinct substrate-associated domains required per prediction.
- Default:
2
- -hmmevalue, --hmmevalue <hmmevalue>#
Maximum dbCAN-sub HMM E-value allowed for substrate evidence.
- Default:
0.01
- -hmmcov, --hmmcov <hmmcov>#
Minimum dbCAN-sub HMM coverage (0.0–1.0) when scoring substrate evidence.
- Default:
0.0
- -evalue, --evalue_cutoff <evalue_cutoff>#
Maximum BLAST E-value for PUL–CGC homology hits.
- Default:
0.01
- -bsc, --bitscore_cutoff <bitscore_cutoff>#
Minimum BLAST bit score for PUL–CGC homology hits.
- Default:
50.0
- -cov, --coverage_cutoff <coverage_cutoff>#
Minimum BLAST query coverage (0.0–1.0) for PUL–CGC homology hits.
- Default:
0.0
- -iden, --identity_cutoff <identity_cutoff>#
Minimum BLAST identity (0.0–1.0) for PUL–CGC homology hits.
- Default:
0.0
- -eptn, --extra_pair_type_num <extra_pair_type_num>#
Comma-separated counts matching –extra_pair_type entries (same order).
- Default:
'0'
- -ept, --extra_pair_type <extra_pair_type>#
Optional comma-separated accessory gene pair types (advanced PUL matching).
- -tpn, --total_pair_num <total_pair_num>#
Minimum total informative gene pairs (CAZyme + accessory) for a link.
- Default:
2
- -cpn, --CAZyme_pair_num <cazyme_pair_num>#
Minimum CAZyme–CAZyme pairs required inside the CGC for homology scoring.
- Default:
1
- -uqcgn, --uniq_query_cgc_gene_num <uniq_query_cgc_gene_num>#
Minimum unique CGC genes participating in a PUL–CGC link.
- Default:
2
- -upghn, --uniq_pul_gene_hit_num <uniq_pul_gene_hit_num>#
Minimum unique PUL genes hit by BLAST for a valid PUL–CGC link.
- Default:
2
- --db_dir <db_dir>#
Required Database directory containing PUL/DIAMOND assets for substrate prediction.
- Default:
'./dbCAN_databases'
- -odbcanpul, --odbcanpul <odbcanpul>#
Whether to export dbCAN-PUL homology tables (pass true/false).
- Default:
True
- -odbcan_sub, --odbcan_sub <odbcan_sub>#
If set to true/false, force exporting extra dbCAN-sub tables; omit to use package default.
- -env, --env <env>#
Execution environment label for external wrappers (usually keep local).
- Default:
'local'
- -rerun, --rerun <rerun>#
Re-run substrate prediction (pass true/false explicitly; rarely needed).
- Default:
False
- -w, --workdir <workdir>#
Working directory for legacy substrate scripts (prefer –output_dir for run_dbcan).
- Default:
'.'
- -o, --out <out>#
Legacy filename hint for standalone substrate tools (results still go under –output_dir).
- Default:
'substrate.out'
- --pul <pul>#
Path to dbCAN-PUL PUL.faa when you need an explicit PUL database file.
- --mode <mode>#
Required Input type: prok (prokaryote DNA), meta (metagenome DNA), protein (amino-acid FASTA).
- Default:
'prok'
- --output_dir <output_dir>#
Required Output directory (created if needed). All run_dbcan result files are written here.
- --input_raw_data <input_raw_data>#
Required Path to input sequences (FASTA/FASTA.gz): nucleotide for prok/meta modes, proteins when –mode=protein.
- --methods <methods>#
CAZyme annotation modules for CAZyme_annotation / easy_* step 1, comma-separated. Choices: diamond (CAZy DIAMOND), hmm (dbCAN HMM), dbCANsub (dbCAN-sub HMM). Example: –methods diamond,hmm
- Default:
'diamond,hmm,dbCANsub'
- --threads <threads>#
Parallel worker count for supported tools (DIAMOND, pyhmmer CPU threads, etc.).
- Default:
2
- --verbose_option#
Pass DIAMOND –verbose (more DIAMOND stderr output).
- Default:
False
- --e_value_threshold <e_value_threshold>#
Maximum E-value for DIAMOND CAZy hits (float; stricter = smaller).
- Default:
1e-102
- --large_input_threshold_mb <large_input_threshold_mb>#
If the input FASTA for the HMM step exceeds this size (MB), enable large mode automatically. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
5000
- --large, --no-large#
Streaming-safe pyhmmer mode for huge inputs (less preload, lower OOM risk). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
False
- --enable_memory_monitoring, --no-enable_memory_monitoring#
Track RAM and adapt pyhmmer batching; use –no-enable_memory_monitoring to disable. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
True
- --max_retries <max_retries>#
Retries after MemoryError during pyhmmer HMM search, halving batch each retry. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
3
- --memory_safety_factor <memory_safety_factor>#
Fraction of available RAM used when estimating automatic batch_size (0.0–1.0; lower = smaller batches). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
0.5
- --max_memory_usage <max_memory_usage>#
If system memory usage ratio exceeds this (0.0–1.0), emit warnings and tighten batching. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
0.8
- --batch_size <batch_size>#
Protein sequences per pyhmmer batch for dbCAN HMM; omit for automatic sizing from free RAM. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- --csv_buffer_size <csv_buffer_size>#
Buffer this many HMM hit rows before flushing to TSV (larger = fewer writes, slightly more RAM). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.
- Default:
5000
- --coverage_threshold_dbcan <coverage_threshold_dbcan>#
Minimum hit coverage (0.0–1.0 fraction of query length) for dbCAN HMM (step 1 only).
- Default:
0.35
- --e_value_threshold_dbcan <e_value_threshold_dbcan>#
Maximum domain-independent E-value to keep a dbCAN HMM hit (pyhmmer; CAZyme_annotation / easy_* step 1 only).
- Default:
1e-15
- --large_input_threshold_mb_dbsub <large_input_threshold_mb_dbsub>#
dbCAN-sub: auto large mode when input FASTA exceeds this size (MB).
- Default:
5000
- --large_dbsub, --no-large_dbsub#
dbCAN-sub: force streaming-safe pyhmmer mode.
- Default:
False
- --enable_memory_monitoring_dbsub, --no-enable_memory_monitoring_dbsub#
dbCAN-sub: enable RAM monitoring and adaptive batching.
- Default:
True
- --max_retries_dbsub <max_retries_dbsub>#
dbCAN-sub: max pyhmmer retries after MemoryError.
- Default:
3
- --memory_safety_factor_dbsub <memory_safety_factor_dbsub>#
dbCAN-sub: RAM fraction used in automatic batch_size estimate (0.0–1.0).
- Default:
0.5
- --max_memory_usage_dbsub <max_memory_usage_dbsub>#
dbCAN-sub: system memory usage ratio threshold (0.0–1.0) for warnings / throttling.
- Default:
0.8
- --batch_size_dbsub <batch_size_dbsub>#
dbCAN-sub: sequences per pyhmmer batch; omit for automatic sizing (independent from –batch_size).
- --csv_buffer_size_dbsub <csv_buffer_size_dbsub>#
dbCAN-sub: buffer this many HMM hit rows before flushing to disk.
- Default:
5000
- --coverage_threshold_dbsub <coverage_threshold_dbsub>#
Minimum hit coverage (0.0–1.0) for dbCAN-sub HMM (step 1 only).
- Default:
0.35
- --e_value_threshold_dbsub <e_value_threshold_dbsub>#
Maximum E-value to keep a dbCAN-sub HMM hit (pyhmmer; step 1 only).
- Default:
1e-15
- --coverage_threshold_stp <coverage_threshold_stp>#
Minimum hit coverage (0.0–1.0) for STP pyhmmer (step 2).
- Default:
0.35
- --e_value_threshold_stp <e_value_threshold_stp>#
Maximum E-value for STP pyhmmer hits (gff_process / easy_* step 2).
- Default:
0.0001
- --fungi, --no-fungi#
Run TF HMM search (–fungi); default skips TF pyhmmer (prokaryotes use TF DIAMOND instead).
- Default:
False
- --coverage_threshold_tf <coverage_threshold_tf>#
Minimum hit coverage (0.0–1.0) for TF pyhmmer (step 2; requires –fungi).
- Default:
0.35
- --e_value_threshold_tf <e_value_threshold_tf>#
Maximum E-value for TF pyhmmer hits (gff_process / easy_* step 2; fungi mode only).
- Default:
0.0001
- --prokaryotic, --no-prokaryotic#
Run prokaryotic TF DIAMOND step (–no-prokaryotic: skip it; use fungi TF HMM with –fungi).
- Default:
True
- --coverage_threshold_tf_diamond <coverage_threshold_tf_diamond>#
DIAMOND –query-cover for TF DIAMOND: minimum query coverage in percent (0–100; same semantics as –coverage_threshold_tc, default 35%%).
- Default:
35
- --e_value_threshold_tf_diamond <e_value_threshold_tf_diamond>#
Maximum E-value for TF DIAMOND (prokaryotic TFDB path).
- Default:
0.0001
- --coverage_threshold_tc <coverage_threshold_tc>#
DIAMOND –query-cover for TCDB search, as percent of query length (0–100; default 35 = 35%%).
- Default:
35
- --e_value_threshold_tc <e_value_threshold_tc>#
Maximum E-value for TC (transporter) DIAMOND vs TCDB.
- Default:
0.0001
- --prodigal-streaming-threshold-mb <prodigal_streaming_threshold_mb>#
When –prodigal-gff-streaming=auto, stream if GFF size exceeds this (decimal MB). Use 0 to stream any non-empty file.
- Default:
50
- --prodigal-gff-streaming <prodigal_gff_streaming>#
Prodigal GFF annotation: ‘auto’ stream-annotate when input exceeds threshold (faster for huge GFF); ‘on’ always use streaming; ‘off’ always use BCBio GFF.parse.
- Default:
'auto'- Options:
auto | on | off
- --gff_type <gff_type>#
GFF dialect label (e.g. prodigal, NCBI_prok). When –mode!=protein defaults to ‘prodigal’ if unset.
- --input_gff <input_gff>#
Annotation GFF. Required when –mode=protein; otherwise defaults to <output_dir>/uniInput.gff.
- --feature_type <feature_types>#
GFF feature types to parse (repeat option for multiple values, e.g. –feature_type CDS –feature_type gene).
- Default:
'CDS'
- --min_cluster_genes <min_cluster_genes>#
Minimum total genes in a CGC locus.
- Default:
2
- --min_core_cazyme <min_core_cazyme>#
Minimum core CAZyme count required to retain a CGC.
- Default:
1
- --extend_gene_count <extend_gene_count>#
With –extend_mode=gene, extend each side by this many flanking genes.
- Default:
0
- --extend_bp <extend_bp>#
With –extend_mode=bp, extend each side by this many base pairs.
- Default:
0
- --extend_mode <extend_mode>#
After CGC detection, extend cluster bounds: ‘bp’ uses –extend_bp; ‘gene’ uses –extend_gene_count; ‘none’ disables.
- Default:
'none'- Options:
none | bp | gene
- --use_distance#
Also require signature genes to fall within –base_pair_distance bp.
- Default:
False
- --use_null_genes, --no-use_null_genes#
Allow null genes between CGC signatures (–no-use_null_genes: disable).
- Default:
True
- --base_pair_distance <base_pair_distance>#
Max distance (bp) between CGC signature genes when –use_distance is enabled.
- Default:
15000
- --num_null_gene <num_null_gene>#
Max number of intervening non-signature (null) genes allowed between core CGC genes.
- Default:
2
- --additional_min_categories <additional_min_categories>#
When –additional_logic=any, minimum number of distinct additional gene classes that must match.
- Default:
1
- --additional_logic <additional_logic>#
How to combine –additional_genes: ‘all’ = every listed class must be present; ‘any’ = at least –additional_min_categories classes.
- Default:
'all'- Options:
all | any
- --additional_genes <additional_genes>#
Gene class tags required alongside CAZyme for CGC signatures. Repeat: –additional_genes TC –additional_genes TF. Choices include TC, TF, STP.
- Default:
'TC'
gff_process#
Generate GFF for CGC identification. need –input_gff when –input_raw_data is protein sequence. if –input_gff is not provided, will set default <output_dir>/uniInput.gff.
Usage
run_dbcan gff_process [OPTIONS]
Options
- -v, --verbose#
Shortcut for –log-level DEBUG (overrides –log-level if both are passed).
- Default:
False
- --log-file <log_file>#
Log file path (truncates each run); mirrors console when set.
- --log-level <log_level>#
Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.
- Default:
'WARNING'- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
- --db_dir <db_dir>#
Required Directory containing dbCAN database files (HMM, DIAMOND, etc.).
- --output_dir <output_dir>#
Required Output directory (created if needed). All run_dbcan result files are written here.
- --threads <threads>#
Parallel worker count for supported tools (DIAMOND, pyhmmer CPU threads, etc.).
- Default:
2
- --coverage_threshold_stp <coverage_threshold_stp>#
Minimum hit coverage (0.0–1.0) for STP pyhmmer (step 2).
- Default:
0.35
- --e_value_threshold_stp <e_value_threshold_stp>#
Maximum E-value for STP pyhmmer hits (gff_process / easy_* step 2).
- Default:
0.0001
- --fungi, --no-fungi#
Run TF HMM search (–fungi); default skips TF pyhmmer (prokaryotes use TF DIAMOND instead).
- Default:
False
- --coverage_threshold_tf <coverage_threshold_tf>#
Minimum hit coverage (0.0–1.0) for TF pyhmmer (step 2; requires –fungi).
- Default:
0.35
- --e_value_threshold_tf <e_value_threshold_tf>#
Maximum E-value for TF pyhmmer hits (gff_process / easy_* step 2; fungi mode only).
- Default:
0.0001
- --prokaryotic, --no-prokaryotic#
Run prokaryotic TF DIAMOND step (–no-prokaryotic: skip it; use fungi TF HMM with –fungi).
- Default:
True
- --coverage_threshold_tf_diamond <coverage_threshold_tf_diamond>#
DIAMOND –query-cover for TF DIAMOND: minimum query coverage in percent (0–100; same semantics as –coverage_threshold_tc, default 35%%).
- Default:
35
- --e_value_threshold_tf_diamond <e_value_threshold_tf_diamond>#
Maximum E-value for TF DIAMOND (prokaryotic TFDB path).
- Default:
0.0001
- --coverage_threshold_tc <coverage_threshold_tc>#
DIAMOND –query-cover for TCDB search, as percent of query length (0–100; default 35 = 35%%).
- Default:
35
- --e_value_threshold_tc <e_value_threshold_tc>#
Maximum E-value for TC (transporter) DIAMOND vs TCDB.
- Default:
0.0001
- --coverage_threshold_sulfatase <coverage_threshold_sulfatase>#
DIAMOND –query-cover for Sulfatase: minimum query coverage in percent (0–100).
- Default:
35
- --e_value_threshold_sulfatase <e_value_threshold_sulfatase>#
Maximum E-value for Sulfatase DIAMOND.
- Default:
0.0001
- --coverage_threshold_peptidase <coverage_threshold_peptidase>#
DIAMOND –query-cover for Peptidase: minimum query coverage in percent (0–100).
- Default:
35
- --e_value_threshold_peptidase <e_value_threshold_peptidase>#
Maximum E-value for Peptidase DIAMOND.
- Default:
0.0001
- --prodigal-streaming-threshold-mb <prodigal_streaming_threshold_mb>#
When –prodigal-gff-streaming=auto, stream if GFF size exceeds this (decimal MB). Use 0 to stream any non-empty file.
- Default:
50
- --prodigal-gff-streaming <prodigal_gff_streaming>#
Prodigal GFF annotation: ‘auto’ stream-annotate when input exceeds threshold (faster for huge GFF); ‘on’ always use streaming; ‘off’ always use BCBio GFF.parse.
- Default:
'auto'- Options:
auto | on | off
- --gff_type <gff_type>#
GFF dialect label (e.g. prodigal, NCBI_prok). When –mode!=protein defaults to ‘prodigal’ if unset.
- --input_gff <input_gff>#
Annotation GFF. Required when –mode=protein; otherwise defaults to <output_dir>/uniInput.gff.
substrate_prediction#
Usage
run_dbcan substrate_prediction [OPTIONS]
Options
- -v, --verbose#
Shortcut for –log-level DEBUG (overrides –log-level if both are passed).
- Default:
False
- --log-file <log_file>#
Log file path (truncates each run); mirrors console when set.
- --log-level <log_level>#
Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.
- Default:
'WARNING'- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
- --db_dir <db_dir>#
Required Database directory containing PUL/DIAMOND assets for substrate prediction.
- Default:
'./dbCAN_databases'
- -odbcanpul, --odbcanpul <odbcanpul>#
Whether to export dbCAN-PUL homology tables (pass true/false).
- Default:
True
- -odbcan_sub, --odbcan_sub <odbcan_sub>#
If set to true/false, force exporting extra dbCAN-sub tables; omit to use package default.
- -env, --env <env>#
Execution environment label for external wrappers (usually keep local).
- Default:
'local'
- -rerun, --rerun <rerun>#
Re-run substrate prediction (pass true/false explicitly; rarely needed).
- Default:
False
- -w, --workdir <workdir>#
Working directory for legacy substrate scripts (prefer –output_dir for run_dbcan).
- Default:
'.'
- -o, --out <out>#
Legacy filename hint for standalone substrate tools (results still go under –output_dir).
- Default:
'substrate.out'
- --pul <pul>#
Path to dbCAN-PUL PUL.faa when you need an explicit PUL database file.
- --mode <mode>#
Required Input type: prok (prokaryote DNA), meta (metagenome DNA), protein (amino-acid FASTA).
- Default:
'prok'
- --output_dir <output_dir>#
Required Output directory (created if needed). All run_dbcan result files are written here.
- --input_raw_data <input_raw_data>#
Required Path to input sequences (FASTA/FASTA.gz): nucleotide for prok/meta modes, proteins when –mode=protein.
- -evalue, --evalue_cutoff <evalue_cutoff>#
Maximum BLAST E-value for PUL–CGC homology hits.
- Default:
0.01
- -bsc, --bitscore_cutoff <bitscore_cutoff>#
Minimum BLAST bit score for PUL–CGC homology hits.
- Default:
50.0
- -cov, --coverage_cutoff <coverage_cutoff>#
Minimum BLAST query coverage (0.0–1.0) for PUL–CGC homology hits.
- Default:
0.0
- -iden, --identity_cutoff <identity_cutoff>#
Minimum BLAST identity (0.0–1.0) for PUL–CGC homology hits.
- Default:
0.0
- -eptn, --extra_pair_type_num <extra_pair_type_num>#
Comma-separated counts matching –extra_pair_type entries (same order).
- Default:
'0'
- -ept, --extra_pair_type <extra_pair_type>#
Optional comma-separated accessory gene pair types (advanced PUL matching).
- -tpn, --total_pair_num <total_pair_num>#
Minimum total informative gene pairs (CAZyme + accessory) for a link.
- Default:
2
- -cpn, --CAZyme_pair_num <cazyme_pair_num>#
Minimum CAZyme–CAZyme pairs required inside the CGC for homology scoring.
- Default:
1
- -uqcgn, --uniq_query_cgc_gene_num <uniq_query_cgc_gene_num>#
Minimum unique CGC genes participating in a PUL–CGC link.
- Default:
2
- -upghn, --uniq_pul_gene_hit_num <uniq_pul_gene_hit_num>#
Minimum unique PUL genes hit by BLAST for a valid PUL–CGC link.
- Default:
2
- -subs, --substrate_scors <substrate_scors>#
Minimum aggregated dbCAN-sub score (field name substrate_scors) to accept a substrate assignment.
- Default:
2.0
- -npsc, --num_of_protein_substrate_cutoff <num_of_protein_substrate_cutoff>#
Minimum proteins supporting a substrate call within a CGC.
- Default:
2
- -ndsc, --num_of_domains_substrate_cutoff <num_of_domains_substrate_cutoff>#
Minimum distinct substrate-associated domains required per prediction.
- Default:
2
- -hmmevalue, --hmmevalue <hmmevalue>#
Maximum dbCAN-sub HMM E-value allowed for substrate evidence.
- Default:
0.01
- -hmmcov, --hmmcov <hmmcov>#
Minimum dbCAN-sub HMM coverage (0.0–1.0) when scoring substrate evidence.
- Default:
0.0
version#
show version information.
Usage
run_dbcan version [OPTIONS]