API Documentation#

run_dbcan Command Line Interface#

run_dbcan#

use dbCAN tools to annotate and analyze CAZymes and CGCs.

Usage

run_dbcan [OPTIONS] COMMAND [ARGS]...

Options

-v, --verbose#

Shortcut for –log-level DEBUG (overrides –log-level if both are passed).

Default:

False

--log-file <log_file>#

Log file path (truncates each run); mirrors console when set.

--log-level <log_level>#

Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.

Default:

'WARNING'

Options:

DEBUG | INFO | WARNING | ERROR | CRITICAL

CAZyme_annotation#

annotate CAZyme using run_dbcan with prokaryotic, metagenomics, and protein sequences.

Usage

run_dbcan CAZyme_annotation [OPTIONS]

Options

-v, --verbose#

Shortcut for –log-level DEBUG (overrides –log-level if both are passed).

Default:

False

--log-file <log_file>#

Log file path (truncates each run); mirrors console when set.

--log-level <log_level>#

Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.

Default:

'WARNING'

Options:

DEBUG | INFO | WARNING | ERROR | CRITICAL

--mode <mode>#

Required Input type: prok (prokaryote DNA), meta (metagenome DNA), protein (amino-acid FASTA).

Default:

'prok'

--output_dir <output_dir>#

Required Output directory (created if needed). All run_dbcan result files are written here.

--input_raw_data <input_raw_data>#

Required Path to input sequences (FASTA/FASTA.gz): nucleotide for prok/meta modes, proteins when –mode=protein.

--db_dir <db_dir>#

Required Directory containing dbCAN database files (HMM, DIAMOND, etc.).

--methods <methods>#

CAZyme annotation modules for CAZyme_annotation / easy_* step 1, comma-separated. Choices: diamond (CAZy DIAMOND), hmm (dbCAN HMM), dbCANsub (dbCAN-sub HMM). Example: –methods diamond,hmm

Default:

'diamond,hmm,dbCANsub'

--threads <threads>#

Parallel worker count for supported tools (DIAMOND, pyhmmer CPU threads, etc.).

Default:

2

--verbose_option#

Pass DIAMOND –verbose (more DIAMOND stderr output).

Default:

False

--e_value_threshold <e_value_threshold>#

Maximum E-value for DIAMOND CAZy hits (float; stricter = smaller).

Default:

1e-102

--large_input_threshold_mb <large_input_threshold_mb>#

If the input FASTA for the HMM step exceeds this size (MB), enable large mode automatically. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

5000

--large, --no-large#

Streaming-safe pyhmmer mode for huge inputs (less preload, lower OOM risk). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

False

--enable_memory_monitoring, --no-enable_memory_monitoring#

Track RAM and adapt pyhmmer batching; use –no-enable_memory_monitoring to disable. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

True

--max_retries <max_retries>#

Retries after MemoryError during pyhmmer HMM search, halving batch each retry. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

3

--memory_safety_factor <memory_safety_factor>#

Fraction of available RAM used when estimating automatic batch_size (0.0–1.0; lower = smaller batches). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

0.5

--max_memory_usage <max_memory_usage>#

If system memory usage ratio exceeds this (0.0–1.0), emit warnings and tighten batching. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

0.8

--batch_size <batch_size>#

Protein sequences per pyhmmer batch for dbCAN HMM; omit for automatic sizing from free RAM. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

--csv_buffer_size <csv_buffer_size>#

Buffer this many HMM hit rows before flushing to TSV (larger = fewer writes, slightly more RAM). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

5000

--coverage_threshold_dbcan <coverage_threshold_dbcan>#

Minimum hit coverage (0.0–1.0 fraction of query length) for dbCAN HMM (step 1 only).

Default:

0.35

--e_value_threshold_dbcan <e_value_threshold_dbcan>#

Maximum domain-independent E-value to keep a dbCAN HMM hit (pyhmmer; CAZyme_annotation / easy_* step 1 only).

Default:

1e-15

--large_input_threshold_mb_dbsub <large_input_threshold_mb_dbsub>#

dbCAN-sub: auto large mode when input FASTA exceeds this size (MB).

Default:

5000

--large_dbsub, --no-large_dbsub#

dbCAN-sub: force streaming-safe pyhmmer mode.

Default:

False

--enable_memory_monitoring_dbsub, --no-enable_memory_monitoring_dbsub#

dbCAN-sub: enable RAM monitoring and adaptive batching.

Default:

True

--max_retries_dbsub <max_retries_dbsub>#

dbCAN-sub: max pyhmmer retries after MemoryError.

Default:

3

--memory_safety_factor_dbsub <memory_safety_factor_dbsub>#

dbCAN-sub: RAM fraction used in automatic batch_size estimate (0.0–1.0).

Default:

0.5

--max_memory_usage_dbsub <max_memory_usage_dbsub>#

dbCAN-sub: system memory usage ratio threshold (0.0–1.0) for warnings / throttling.

Default:

0.8

--batch_size_dbsub <batch_size_dbsub>#

dbCAN-sub: sequences per pyhmmer batch; omit for automatic sizing (independent from –batch_size).

--csv_buffer_size_dbsub <csv_buffer_size_dbsub>#

dbCAN-sub: buffer this many HMM hit rows before flushing to disk.

Default:

5000

--coverage_threshold_dbsub <coverage_threshold_dbsub>#

Minimum hit coverage (0.0–1.0) for dbCAN-sub HMM (step 1 only).

Default:

0.35

--e_value_threshold_dbsub <e_value_threshold_dbsub>#

Maximum E-value to keep a dbCAN-sub HMM hit (pyhmmer; step 1 only).

Default:

1e-15

--force_topology, --no-force_topology#

Recompute topology columns even when overview already contains predictions.

Default:

False

--signalp_org <signalp_org>#

SignalP organism class: other (bacteria/archaea) or euk (eukaryotes).

Default:

'other'

Options:

other | euk

--deeptmhmm_python <deeptmhmm_python>#

Python interpreter used to launch DeepTMHMM predict.py.

Default:

'python'

--deeptmhmm_dir <deeptmhmm_dir>#

Directory that contains DeepTMHMM predict.py (only used with –run_deeptmhmm).

--run_deeptmhmm, --no-run_deeptmhmm#

Run a user-installed DeepTMHMM predict.py and append transmembrane predictions to overview.

Default:

False

--run_signalp, --no-run_signalp#

Run SignalP 6 (BioLib) on translated proteins and append peptide signal columns to overview.

Default:

False

Pfam_null_cgc#

identify CAZyme Gene Clusters(CGCs)

Usage

run_dbcan Pfam_null_cgc [OPTIONS]

Options

-v, --verbose#

Shortcut for –log-level DEBUG (overrides –log-level if both are passed).

Default:

False

--log-file <log_file>#

Log file path (truncates each run); mirrors console when set.

--log-level <log_level>#

Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.

Default:

'WARNING'

Options:

DEBUG | INFO | WARNING | ERROR | CRITICAL

--threads <threads>#

Parallel worker count for supported tools (DIAMOND, pyhmmer CPU threads, etc.).

Default:

2

--db_dir <db_dir>#

Required Directory containing dbCAN database files (HMM, DIAMOND, etc.).

--output_dir <output_dir>#

Required Output directory (created if needed). All run_dbcan result files are written here.

--null_from_gff#

Extract null-gene proteins from cgc.gff instead of cgc_standard_out.tsv.

Default:

False

--coverage_threshold_pfam <coverage_threshold_pfam>#

Minimum HMM alignment coverage (0.0–1.0 fraction of HMM length) for Pfam hits.

Default:

0.35

--e_value_threshold_pfam <e_value_threshold_pfam>#

Maximum domain-independent E-value for Pfam hits.

Default:

0.0001

--run_pfam#

Run Pfam pyhmmer on null genes (Pfam_null_cgc / CGC null annotation).

Default:

False

cgc_circle_plot#

generate circular plots for CAZyme Gene Clusters(CGCs).

Usage

run_dbcan cgc_circle_plot [OPTIONS]

Options

-v, --verbose#

Shortcut for –log-level DEBUG (overrides –log-level if both are passed).

Default:

False

--log-file <log_file>#

Log file path (truncates each run); mirrors console when set.

--log-level <log_level>#

Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.

Default:

'WARNING'

Options:

DEBUG | INFO | WARNING | ERROR | CRITICAL

--output_dir <output_dir>#

Required Output directory (created if needed). All run_dbcan result files are written here.

cgc_finder#

identify CAZyme Gene Clusters(CGCs)

Usage

run_dbcan cgc_finder [OPTIONS]

Options

-v, --verbose#

Shortcut for –log-level DEBUG (overrides –log-level if both are passed).

Default:

False

--log-file <log_file>#

Log file path (truncates each run); mirrors console when set.

--log-level <log_level>#

Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.

Default:

'WARNING'

Options:

DEBUG | INFO | WARNING | ERROR | CRITICAL

--output_dir <output_dir>#

Required Output directory (created if needed). All run_dbcan result files are written here.

--feature_type <feature_types>#

GFF feature types to parse (repeat option for multiple values, e.g. –feature_type CDS –feature_type gene).

Default:

'CDS'

--min_cluster_genes <min_cluster_genes>#

Minimum total genes in a CGC locus.

Default:

2

--min_core_cazyme <min_core_cazyme>#

Minimum core CAZyme count required to retain a CGC.

Default:

1

--extend_gene_count <extend_gene_count>#

With –extend_mode=gene, extend each side by this many flanking genes.

Default:

0

--extend_bp <extend_bp>#

With –extend_mode=bp, extend each side by this many base pairs.

Default:

0

--extend_mode <extend_mode>#

After CGC detection, extend cluster bounds: ‘bp’ uses –extend_bp; ‘gene’ uses –extend_gene_count; ‘none’ disables.

Default:

'none'

Options:

none | bp | gene

--use_distance#

Also require signature genes to fall within –base_pair_distance bp.

Default:

False

--use_null_genes, --no-use_null_genes#

Allow null genes between CGC signatures (–no-use_null_genes: disable).

Default:

True

--base_pair_distance <base_pair_distance>#

Max distance (bp) between CGC signature genes when –use_distance is enabled.

Default:

15000

--num_null_gene <num_null_gene>#

Max number of intervening non-signature (null) genes allowed between core CGC genes.

Default:

2

--additional_min_categories <additional_min_categories>#

When –additional_logic=any, minimum number of distinct additional gene classes that must match.

Default:

1

--additional_logic <additional_logic>#

How to combine –additional_genes: ‘all’ = every listed class must be present; ‘any’ = at least –additional_min_categories classes.

Default:

'all'

Options:

all | any

--additional_genes <additional_genes>#

Gene class tags required alongside CAZyme for CGC signatures. Repeat: –additional_genes TC –additional_genes TF. Choices include TC, TF, STP.

Default:

'TC'

database#

download dbCAN databases.

Usage

run_dbcan database [OPTIONS]

Options

-v, --verbose#

Shortcut for –log-level DEBUG (overrides –log-level if both are passed).

Default:

False

--log-file <log_file>#

Log file path (truncates each run); mirrors console when set.

--log-level <log_level>#

Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.

Default:

'WARNING'

Options:

DEBUG | INFO | WARNING | ERROR | CRITICAL

--verify-ssl, --no-verify-ssl#

Verify TLS certificates for HTTPS downloads (–no-verify-ssl: insecure, not recommended).

Default:

True

--no-overwrite#

If set, skip downloading files that already exist in db_dir.

Default:

False

--resume, --no-resume#

Resume partial downloads when supported (–no-resume: always fetch from scratch).

Default:

True

--retries <retries>#

Retries for transient HTTP/S3 download failures.

Default:

3

--timeout <timeout>#

HTTP(S) request timeout in seconds per download attempt.

Default:

30

--aws_s3#

Download from the pinned AWS S3 release; omit for HTTP db_current (moving snapshot).

Default:

False

--cgc, --no-cgc#

With –cgc (default): download CGC-related DB assets. With –no-cgc: skip them (database subcommand only).

Default:

True

--db_dir <db_dir>#

Required Directory containing dbCAN database files (HMM, DIAMOND, etc.).

easy_CGC#

Perform complete CGC analysis: CAZyme annotation, GFF processing, and CGC identification in one step.

Usage

run_dbcan easy_CGC [OPTIONS]

Options

-v, --verbose#

Shortcut for –log-level DEBUG (overrides –log-level if both are passed).

Default:

False

--log-file <log_file>#

Log file path (truncates each run); mirrors console when set.

--log-level <log_level>#

Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.

Default:

'WARNING'

Options:

DEBUG | INFO | WARNING | ERROR | CRITICAL

--mode <mode>#

Required Input type: prok (prokaryote DNA), meta (metagenome DNA), protein (amino-acid FASTA).

Default:

'prok'

--output_dir <output_dir>#

Required Output directory (created if needed). All run_dbcan result files are written here.

--input_raw_data <input_raw_data>#

Required Path to input sequences (FASTA/FASTA.gz): nucleotide for prok/meta modes, proteins when –mode=protein.

--db_dir <db_dir>#

Required Directory containing dbCAN database files (HMM, DIAMOND, etc.).

--methods <methods>#

CAZyme annotation modules for CAZyme_annotation / easy_* step 1, comma-separated. Choices: diamond (CAZy DIAMOND), hmm (dbCAN HMM), dbCANsub (dbCAN-sub HMM). Example: –methods diamond,hmm

Default:

'diamond,hmm,dbCANsub'

--threads <threads>#

Parallel worker count for supported tools (DIAMOND, pyhmmer CPU threads, etc.).

Default:

2

--verbose_option#

Pass DIAMOND –verbose (more DIAMOND stderr output).

Default:

False

--e_value_threshold <e_value_threshold>#

Maximum E-value for DIAMOND CAZy hits (float; stricter = smaller).

Default:

1e-102

--large_input_threshold_mb <large_input_threshold_mb>#

If the input FASTA for the HMM step exceeds this size (MB), enable large mode automatically. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

5000

--large, --no-large#

Streaming-safe pyhmmer mode for huge inputs (less preload, lower OOM risk). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

False

--enable_memory_monitoring, --no-enable_memory_monitoring#

Track RAM and adapt pyhmmer batching; use –no-enable_memory_monitoring to disable. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

True

--max_retries <max_retries>#

Retries after MemoryError during pyhmmer HMM search, halving batch each retry. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

3

--memory_safety_factor <memory_safety_factor>#

Fraction of available RAM used when estimating automatic batch_size (0.0–1.0; lower = smaller batches). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

0.5

--max_memory_usage <max_memory_usage>#

If system memory usage ratio exceeds this (0.0–1.0), emit warnings and tighten batching. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

0.8

--batch_size <batch_size>#

Protein sequences per pyhmmer batch for dbCAN HMM; omit for automatic sizing from free RAM. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

--csv_buffer_size <csv_buffer_size>#

Buffer this many HMM hit rows before flushing to TSV (larger = fewer writes, slightly more RAM). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

5000

--coverage_threshold_dbcan <coverage_threshold_dbcan>#

Minimum hit coverage (0.0–1.0 fraction of query length) for dbCAN HMM (step 1 only).

Default:

0.35

--e_value_threshold_dbcan <e_value_threshold_dbcan>#

Maximum domain-independent E-value to keep a dbCAN HMM hit (pyhmmer; CAZyme_annotation / easy_* step 1 only).

Default:

1e-15

--large_input_threshold_mb_dbsub <large_input_threshold_mb_dbsub>#

dbCAN-sub: auto large mode when input FASTA exceeds this size (MB).

Default:

5000

--large_dbsub, --no-large_dbsub#

dbCAN-sub: force streaming-safe pyhmmer mode.

Default:

False

--enable_memory_monitoring_dbsub, --no-enable_memory_monitoring_dbsub#

dbCAN-sub: enable RAM monitoring and adaptive batching.

Default:

True

--max_retries_dbsub <max_retries_dbsub>#

dbCAN-sub: max pyhmmer retries after MemoryError.

Default:

3

--memory_safety_factor_dbsub <memory_safety_factor_dbsub>#

dbCAN-sub: RAM fraction used in automatic batch_size estimate (0.0–1.0).

Default:

0.5

--max_memory_usage_dbsub <max_memory_usage_dbsub>#

dbCAN-sub: system memory usage ratio threshold (0.0–1.0) for warnings / throttling.

Default:

0.8

--batch_size_dbsub <batch_size_dbsub>#

dbCAN-sub: sequences per pyhmmer batch; omit for automatic sizing (independent from –batch_size).

--csv_buffer_size_dbsub <csv_buffer_size_dbsub>#

dbCAN-sub: buffer this many HMM hit rows before flushing to disk.

Default:

5000

--coverage_threshold_dbsub <coverage_threshold_dbsub>#

Minimum hit coverage (0.0–1.0) for dbCAN-sub HMM (step 1 only).

Default:

0.35

--e_value_threshold_dbsub <e_value_threshold_dbsub>#

Maximum E-value to keep a dbCAN-sub HMM hit (pyhmmer; step 1 only).

Default:

1e-15

--coverage_threshold_stp <coverage_threshold_stp>#

Minimum hit coverage (0.0–1.0) for STP pyhmmer (step 2).

Default:

0.35

--e_value_threshold_stp <e_value_threshold_stp>#

Maximum E-value for STP pyhmmer hits (gff_process / easy_* step 2).

Default:

0.0001

--fungi, --no-fungi#

Run TF HMM search (–fungi); default skips TF pyhmmer (prokaryotes use TF DIAMOND instead).

Default:

False

--coverage_threshold_tf <coverage_threshold_tf>#

Minimum hit coverage (0.0–1.0) for TF pyhmmer (step 2; requires –fungi).

Default:

0.35

--e_value_threshold_tf <e_value_threshold_tf>#

Maximum E-value for TF pyhmmer hits (gff_process / easy_* step 2; fungi mode only).

Default:

0.0001

--prokaryotic, --no-prokaryotic#

Run prokaryotic TF DIAMOND step (–no-prokaryotic: skip it; use fungi TF HMM with –fungi).

Default:

True

--coverage_threshold_tf_diamond <coverage_threshold_tf_diamond>#

DIAMOND –query-cover for TF DIAMOND: minimum query coverage in percent (0–100; same semantics as –coverage_threshold_tc, default 35%%).

Default:

35

--e_value_threshold_tf_diamond <e_value_threshold_tf_diamond>#

Maximum E-value for TF DIAMOND (prokaryotic TFDB path).

Default:

0.0001

--coverage_threshold_tc <coverage_threshold_tc>#

DIAMOND –query-cover for TCDB search, as percent of query length (0–100; default 35 = 35%%).

Default:

35

--e_value_threshold_tc <e_value_threshold_tc>#

Maximum E-value for TC (transporter) DIAMOND vs TCDB.

Default:

0.0001

--prodigal-streaming-threshold-mb <prodigal_streaming_threshold_mb>#

When –prodigal-gff-streaming=auto, stream if GFF size exceeds this (decimal MB). Use 0 to stream any non-empty file.

Default:

50

--prodigal-gff-streaming <prodigal_gff_streaming>#

Prodigal GFF annotation: ‘auto’ stream-annotate when input exceeds threshold (faster for huge GFF); ‘on’ always use streaming; ‘off’ always use BCBio GFF.parse.

Default:

'auto'

Options:

auto | on | off

--gff_type <gff_type>#

GFF dialect label (e.g. prodigal, NCBI_prok). When –mode!=protein defaults to ‘prodigal’ if unset.

--input_gff <input_gff>#

Annotation GFF. Required when –mode=protein; otherwise defaults to <output_dir>/uniInput.gff.

--feature_type <feature_types>#

GFF feature types to parse (repeat option for multiple values, e.g. –feature_type CDS –feature_type gene).

Default:

'CDS'

--min_cluster_genes <min_cluster_genes>#

Minimum total genes in a CGC locus.

Default:

2

--min_core_cazyme <min_core_cazyme>#

Minimum core CAZyme count required to retain a CGC.

Default:

1

--extend_gene_count <extend_gene_count>#

With –extend_mode=gene, extend each side by this many flanking genes.

Default:

0

--extend_bp <extend_bp>#

With –extend_mode=bp, extend each side by this many base pairs.

Default:

0

--extend_mode <extend_mode>#

After CGC detection, extend cluster bounds: ‘bp’ uses –extend_bp; ‘gene’ uses –extend_gene_count; ‘none’ disables.

Default:

'none'

Options:

none | bp | gene

--use_distance#

Also require signature genes to fall within –base_pair_distance bp.

Default:

False

--use_null_genes, --no-use_null_genes#

Allow null genes between CGC signatures (–no-use_null_genes: disable).

Default:

True

--base_pair_distance <base_pair_distance>#

Max distance (bp) between CGC signature genes when –use_distance is enabled.

Default:

15000

--num_null_gene <num_null_gene>#

Max number of intervening non-signature (null) genes allowed between core CGC genes.

Default:

2

--additional_min_categories <additional_min_categories>#

When –additional_logic=any, minimum number of distinct additional gene classes that must match.

Default:

1

--additional_logic <additional_logic>#

How to combine –additional_genes: ‘all’ = every listed class must be present; ‘any’ = at least –additional_min_categories classes.

Default:

'all'

Options:

all | any

--additional_genes <additional_genes>#

Gene class tags required alongside CAZyme for CGC signatures. Repeat: –additional_genes TC –additional_genes TF. Choices include TC, TF, STP.

Default:

'TC'

easy_substrate#

Perform complete CGC analysis: CAZyme annotation, GFF processing, CGC identification, and substrate prediction in one step.

Usage

run_dbcan easy_substrate [OPTIONS]

Options

-v, --verbose#

Shortcut for –log-level DEBUG (overrides –log-level if both are passed).

Default:

False

--log-file <log_file>#

Log file path (truncates each run); mirrors console when set.

--log-level <log_level>#

Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.

Default:

'WARNING'

Options:

DEBUG | INFO | WARNING | ERROR | CRITICAL

-subs, --substrate_scors <substrate_scors>#

Minimum aggregated dbCAN-sub score (field name substrate_scors) to accept a substrate assignment.

Default:

2.0

-npsc, --num_of_protein_substrate_cutoff <num_of_protein_substrate_cutoff>#

Minimum proteins supporting a substrate call within a CGC.

Default:

2

-ndsc, --num_of_domains_substrate_cutoff <num_of_domains_substrate_cutoff>#

Minimum distinct substrate-associated domains required per prediction.

Default:

2

-hmmevalue, --hmmevalue <hmmevalue>#

Maximum dbCAN-sub HMM E-value allowed for substrate evidence.

Default:

0.01

-hmmcov, --hmmcov <hmmcov>#

Minimum dbCAN-sub HMM coverage (0.0–1.0) when scoring substrate evidence.

Default:

0.0

-evalue, --evalue_cutoff <evalue_cutoff>#

Maximum BLAST E-value for PUL–CGC homology hits.

Default:

0.01

-bsc, --bitscore_cutoff <bitscore_cutoff>#

Minimum BLAST bit score for PUL–CGC homology hits.

Default:

50.0

-cov, --coverage_cutoff <coverage_cutoff>#

Minimum BLAST query coverage (0.0–1.0) for PUL–CGC homology hits.

Default:

0.0

-iden, --identity_cutoff <identity_cutoff>#

Minimum BLAST identity (0.0–1.0) for PUL–CGC homology hits.

Default:

0.0

-eptn, --extra_pair_type_num <extra_pair_type_num>#

Comma-separated counts matching –extra_pair_type entries (same order).

Default:

'0'

-ept, --extra_pair_type <extra_pair_type>#

Optional comma-separated accessory gene pair types (advanced PUL matching).

-tpn, --total_pair_num <total_pair_num>#

Minimum total informative gene pairs (CAZyme + accessory) for a link.

Default:

2

-cpn, --CAZyme_pair_num <cazyme_pair_num>#

Minimum CAZyme–CAZyme pairs required inside the CGC for homology scoring.

Default:

1

-uqcgn, --uniq_query_cgc_gene_num <uniq_query_cgc_gene_num>#

Minimum unique CGC genes participating in a PUL–CGC link.

Default:

2

-upghn, --uniq_pul_gene_hit_num <uniq_pul_gene_hit_num>#

Minimum unique PUL genes hit by BLAST for a valid PUL–CGC link.

Default:

2

--db_dir <db_dir>#

Required Database directory containing PUL/DIAMOND assets for substrate prediction.

Default:

'./dbCAN_databases'

-odbcanpul, --odbcanpul <odbcanpul>#

Whether to export dbCAN-PUL homology tables (pass true/false).

Default:

True

-odbcan_sub, --odbcan_sub <odbcan_sub>#

If set to true/false, force exporting extra dbCAN-sub tables; omit to use package default.

-env, --env <env>#

Execution environment label for external wrappers (usually keep local).

Default:

'local'

-rerun, --rerun <rerun>#

Re-run substrate prediction (pass true/false explicitly; rarely needed).

Default:

False

-w, --workdir <workdir>#

Working directory for legacy substrate scripts (prefer –output_dir for run_dbcan).

Default:

'.'

-o, --out <out>#

Legacy filename hint for standalone substrate tools (results still go under –output_dir).

Default:

'substrate.out'

--pul <pul>#

Path to dbCAN-PUL PUL.faa when you need an explicit PUL database file.

--mode <mode>#

Required Input type: prok (prokaryote DNA), meta (metagenome DNA), protein (amino-acid FASTA).

Default:

'prok'

--output_dir <output_dir>#

Required Output directory (created if needed). All run_dbcan result files are written here.

--input_raw_data <input_raw_data>#

Required Path to input sequences (FASTA/FASTA.gz): nucleotide for prok/meta modes, proteins when –mode=protein.

--methods <methods>#

CAZyme annotation modules for CAZyme_annotation / easy_* step 1, comma-separated. Choices: diamond (CAZy DIAMOND), hmm (dbCAN HMM), dbCANsub (dbCAN-sub HMM). Example: –methods diamond,hmm

Default:

'diamond,hmm,dbCANsub'

--threads <threads>#

Parallel worker count for supported tools (DIAMOND, pyhmmer CPU threads, etc.).

Default:

2

--verbose_option#

Pass DIAMOND –verbose (more DIAMOND stderr output).

Default:

False

--e_value_threshold <e_value_threshold>#

Maximum E-value for DIAMOND CAZy hits (float; stricter = smaller).

Default:

1e-102

--large_input_threshold_mb <large_input_threshold_mb>#

If the input FASTA for the HMM step exceeds this size (MB), enable large mode automatically. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

5000

--large, --no-large#

Streaming-safe pyhmmer mode for huge inputs (less preload, lower OOM risk). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

False

--enable_memory_monitoring, --no-enable_memory_monitoring#

Track RAM and adapt pyhmmer batching; use –no-enable_memory_monitoring to disable. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

True

--max_retries <max_retries>#

Retries after MemoryError during pyhmmer HMM search, halving batch each retry. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

3

--memory_safety_factor <memory_safety_factor>#

Fraction of available RAM used when estimating automatic batch_size (0.0–1.0; lower = smaller batches). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

0.5

--max_memory_usage <max_memory_usage>#

If system memory usage ratio exceeds this (0.0–1.0), emit warnings and tighten batching. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

0.8

--batch_size <batch_size>#

Protein sequences per pyhmmer batch for dbCAN HMM; omit for automatic sizing from free RAM. Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

--csv_buffer_size <csv_buffer_size>#

Buffer this many HMM hit rows before flushing to TSV (larger = fewer writes, slightly more RAM). Applies to pyhmmer TF/STP in gff_process (easy_* step 2) as well—no separate TF/STP memory options.

Default:

5000

--coverage_threshold_dbcan <coverage_threshold_dbcan>#

Minimum hit coverage (0.0–1.0 fraction of query length) for dbCAN HMM (step 1 only).

Default:

0.35

--e_value_threshold_dbcan <e_value_threshold_dbcan>#

Maximum domain-independent E-value to keep a dbCAN HMM hit (pyhmmer; CAZyme_annotation / easy_* step 1 only).

Default:

1e-15

--large_input_threshold_mb_dbsub <large_input_threshold_mb_dbsub>#

dbCAN-sub: auto large mode when input FASTA exceeds this size (MB).

Default:

5000

--large_dbsub, --no-large_dbsub#

dbCAN-sub: force streaming-safe pyhmmer mode.

Default:

False

--enable_memory_monitoring_dbsub, --no-enable_memory_monitoring_dbsub#

dbCAN-sub: enable RAM monitoring and adaptive batching.

Default:

True

--max_retries_dbsub <max_retries_dbsub>#

dbCAN-sub: max pyhmmer retries after MemoryError.

Default:

3

--memory_safety_factor_dbsub <memory_safety_factor_dbsub>#

dbCAN-sub: RAM fraction used in automatic batch_size estimate (0.0–1.0).

Default:

0.5

--max_memory_usage_dbsub <max_memory_usage_dbsub>#

dbCAN-sub: system memory usage ratio threshold (0.0–1.0) for warnings / throttling.

Default:

0.8

--batch_size_dbsub <batch_size_dbsub>#

dbCAN-sub: sequences per pyhmmer batch; omit for automatic sizing (independent from –batch_size).

--csv_buffer_size_dbsub <csv_buffer_size_dbsub>#

dbCAN-sub: buffer this many HMM hit rows before flushing to disk.

Default:

5000

--coverage_threshold_dbsub <coverage_threshold_dbsub>#

Minimum hit coverage (0.0–1.0) for dbCAN-sub HMM (step 1 only).

Default:

0.35

--e_value_threshold_dbsub <e_value_threshold_dbsub>#

Maximum E-value to keep a dbCAN-sub HMM hit (pyhmmer; step 1 only).

Default:

1e-15

--coverage_threshold_stp <coverage_threshold_stp>#

Minimum hit coverage (0.0–1.0) for STP pyhmmer (step 2).

Default:

0.35

--e_value_threshold_stp <e_value_threshold_stp>#

Maximum E-value for STP pyhmmer hits (gff_process / easy_* step 2).

Default:

0.0001

--fungi, --no-fungi#

Run TF HMM search (–fungi); default skips TF pyhmmer (prokaryotes use TF DIAMOND instead).

Default:

False

--coverage_threshold_tf <coverage_threshold_tf>#

Minimum hit coverage (0.0–1.0) for TF pyhmmer (step 2; requires –fungi).

Default:

0.35

--e_value_threshold_tf <e_value_threshold_tf>#

Maximum E-value for TF pyhmmer hits (gff_process / easy_* step 2; fungi mode only).

Default:

0.0001

--prokaryotic, --no-prokaryotic#

Run prokaryotic TF DIAMOND step (–no-prokaryotic: skip it; use fungi TF HMM with –fungi).

Default:

True

--coverage_threshold_tf_diamond <coverage_threshold_tf_diamond>#

DIAMOND –query-cover for TF DIAMOND: minimum query coverage in percent (0–100; same semantics as –coverage_threshold_tc, default 35%%).

Default:

35

--e_value_threshold_tf_diamond <e_value_threshold_tf_diamond>#

Maximum E-value for TF DIAMOND (prokaryotic TFDB path).

Default:

0.0001

--coverage_threshold_tc <coverage_threshold_tc>#

DIAMOND –query-cover for TCDB search, as percent of query length (0–100; default 35 = 35%%).

Default:

35

--e_value_threshold_tc <e_value_threshold_tc>#

Maximum E-value for TC (transporter) DIAMOND vs TCDB.

Default:

0.0001

--prodigal-streaming-threshold-mb <prodigal_streaming_threshold_mb>#

When –prodigal-gff-streaming=auto, stream if GFF size exceeds this (decimal MB). Use 0 to stream any non-empty file.

Default:

50

--prodigal-gff-streaming <prodigal_gff_streaming>#

Prodigal GFF annotation: ‘auto’ stream-annotate when input exceeds threshold (faster for huge GFF); ‘on’ always use streaming; ‘off’ always use BCBio GFF.parse.

Default:

'auto'

Options:

auto | on | off

--gff_type <gff_type>#

GFF dialect label (e.g. prodigal, NCBI_prok). When –mode!=protein defaults to ‘prodigal’ if unset.

--input_gff <input_gff>#

Annotation GFF. Required when –mode=protein; otherwise defaults to <output_dir>/uniInput.gff.

--feature_type <feature_types>#

GFF feature types to parse (repeat option for multiple values, e.g. –feature_type CDS –feature_type gene).

Default:

'CDS'

--min_cluster_genes <min_cluster_genes>#

Minimum total genes in a CGC locus.

Default:

2

--min_core_cazyme <min_core_cazyme>#

Minimum core CAZyme count required to retain a CGC.

Default:

1

--extend_gene_count <extend_gene_count>#

With –extend_mode=gene, extend each side by this many flanking genes.

Default:

0

--extend_bp <extend_bp>#

With –extend_mode=bp, extend each side by this many base pairs.

Default:

0

--extend_mode <extend_mode>#

After CGC detection, extend cluster bounds: ‘bp’ uses –extend_bp; ‘gene’ uses –extend_gene_count; ‘none’ disables.

Default:

'none'

Options:

none | bp | gene

--use_distance#

Also require signature genes to fall within –base_pair_distance bp.

Default:

False

--use_null_genes, --no-use_null_genes#

Allow null genes between CGC signatures (–no-use_null_genes: disable).

Default:

True

--base_pair_distance <base_pair_distance>#

Max distance (bp) between CGC signature genes when –use_distance is enabled.

Default:

15000

--num_null_gene <num_null_gene>#

Max number of intervening non-signature (null) genes allowed between core CGC genes.

Default:

2

--additional_min_categories <additional_min_categories>#

When –additional_logic=any, minimum number of distinct additional gene classes that must match.

Default:

1

--additional_logic <additional_logic>#

How to combine –additional_genes: ‘all’ = every listed class must be present; ‘any’ = at least –additional_min_categories classes.

Default:

'all'

Options:

all | any

--additional_genes <additional_genes>#

Gene class tags required alongside CAZyme for CGC signatures. Repeat: –additional_genes TC –additional_genes TF. Choices include TC, TF, STP.

Default:

'TC'

gff_process#

Generate GFF for CGC identification. need –input_gff when –input_raw_data is protein sequence. if –input_gff is not provided, will set default <output_dir>/uniInput.gff.

Usage

run_dbcan gff_process [OPTIONS]

Options

-v, --verbose#

Shortcut for –log-level DEBUG (overrides –log-level if both are passed).

Default:

False

--log-file <log_file>#

Log file path (truncates each run); mirrors console when set.

--log-level <log_level>#

Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.

Default:

'WARNING'

Options:

DEBUG | INFO | WARNING | ERROR | CRITICAL

--db_dir <db_dir>#

Required Directory containing dbCAN database files (HMM, DIAMOND, etc.).

--output_dir <output_dir>#

Required Output directory (created if needed). All run_dbcan result files are written here.

--threads <threads>#

Parallel worker count for supported tools (DIAMOND, pyhmmer CPU threads, etc.).

Default:

2

--coverage_threshold_stp <coverage_threshold_stp>#

Minimum hit coverage (0.0–1.0) for STP pyhmmer (step 2).

Default:

0.35

--e_value_threshold_stp <e_value_threshold_stp>#

Maximum E-value for STP pyhmmer hits (gff_process / easy_* step 2).

Default:

0.0001

--fungi, --no-fungi#

Run TF HMM search (–fungi); default skips TF pyhmmer (prokaryotes use TF DIAMOND instead).

Default:

False

--coverage_threshold_tf <coverage_threshold_tf>#

Minimum hit coverage (0.0–1.0) for TF pyhmmer (step 2; requires –fungi).

Default:

0.35

--e_value_threshold_tf <e_value_threshold_tf>#

Maximum E-value for TF pyhmmer hits (gff_process / easy_* step 2; fungi mode only).

Default:

0.0001

--prokaryotic, --no-prokaryotic#

Run prokaryotic TF DIAMOND step (–no-prokaryotic: skip it; use fungi TF HMM with –fungi).

Default:

True

--coverage_threshold_tf_diamond <coverage_threshold_tf_diamond>#

DIAMOND –query-cover for TF DIAMOND: minimum query coverage in percent (0–100; same semantics as –coverage_threshold_tc, default 35%%).

Default:

35

--e_value_threshold_tf_diamond <e_value_threshold_tf_diamond>#

Maximum E-value for TF DIAMOND (prokaryotic TFDB path).

Default:

0.0001

--coverage_threshold_tc <coverage_threshold_tc>#

DIAMOND –query-cover for TCDB search, as percent of query length (0–100; default 35 = 35%%).

Default:

35

--e_value_threshold_tc <e_value_threshold_tc>#

Maximum E-value for TC (transporter) DIAMOND vs TCDB.

Default:

0.0001

--coverage_threshold_sulfatase <coverage_threshold_sulfatase>#

DIAMOND –query-cover for Sulfatase: minimum query coverage in percent (0–100).

Default:

35

--e_value_threshold_sulfatase <e_value_threshold_sulfatase>#

Maximum E-value for Sulfatase DIAMOND.

Default:

0.0001

--coverage_threshold_peptidase <coverage_threshold_peptidase>#

DIAMOND –query-cover for Peptidase: minimum query coverage in percent (0–100).

Default:

35

--e_value_threshold_peptidase <e_value_threshold_peptidase>#

Maximum E-value for Peptidase DIAMOND.

Default:

0.0001

--prodigal-streaming-threshold-mb <prodigal_streaming_threshold_mb>#

When –prodigal-gff-streaming=auto, stream if GFF size exceeds this (decimal MB). Use 0 to stream any non-empty file.

Default:

50

--prodigal-gff-streaming <prodigal_gff_streaming>#

Prodigal GFF annotation: ‘auto’ stream-annotate when input exceeds threshold (faster for huge GFF); ‘on’ always use streaming; ‘off’ always use BCBio GFF.parse.

Default:

'auto'

Options:

auto | on | off

--gff_type <gff_type>#

GFF dialect label (e.g. prodigal, NCBI_prok). When –mode!=protein defaults to ‘prodigal’ if unset.

--input_gff <input_gff>#

Annotation GFF. Required when –mode=protein; otherwise defaults to <output_dir>/uniInput.gff.

substrate_prediction#

Usage

run_dbcan substrate_prediction [OPTIONS]

Options

-v, --verbose#

Shortcut for –log-level DEBUG (overrides –log-level if both are passed).

Default:

False

--log-file <log_file>#

Log file path (truncates each run); mirrors console when set.

--log-level <log_level>#

Python logging level for run_dbcan and nested subcommands (including easy_CGC / easy_substrate steps). DEBUG shows diagnostic messages where implemented.

Default:

'WARNING'

Options:

DEBUG | INFO | WARNING | ERROR | CRITICAL

--db_dir <db_dir>#

Required Database directory containing PUL/DIAMOND assets for substrate prediction.

Default:

'./dbCAN_databases'

-odbcanpul, --odbcanpul <odbcanpul>#

Whether to export dbCAN-PUL homology tables (pass true/false).

Default:

True

-odbcan_sub, --odbcan_sub <odbcan_sub>#

If set to true/false, force exporting extra dbCAN-sub tables; omit to use package default.

-env, --env <env>#

Execution environment label for external wrappers (usually keep local).

Default:

'local'

-rerun, --rerun <rerun>#

Re-run substrate prediction (pass true/false explicitly; rarely needed).

Default:

False

-w, --workdir <workdir>#

Working directory for legacy substrate scripts (prefer –output_dir for run_dbcan).

Default:

'.'

-o, --out <out>#

Legacy filename hint for standalone substrate tools (results still go under –output_dir).

Default:

'substrate.out'

--pul <pul>#

Path to dbCAN-PUL PUL.faa when you need an explicit PUL database file.

--mode <mode>#

Required Input type: prok (prokaryote DNA), meta (metagenome DNA), protein (amino-acid FASTA).

Default:

'prok'

--output_dir <output_dir>#

Required Output directory (created if needed). All run_dbcan result files are written here.

--input_raw_data <input_raw_data>#

Required Path to input sequences (FASTA/FASTA.gz): nucleotide for prok/meta modes, proteins when –mode=protein.

-evalue, --evalue_cutoff <evalue_cutoff>#

Maximum BLAST E-value for PUL–CGC homology hits.

Default:

0.01

-bsc, --bitscore_cutoff <bitscore_cutoff>#

Minimum BLAST bit score for PUL–CGC homology hits.

Default:

50.0

-cov, --coverage_cutoff <coverage_cutoff>#

Minimum BLAST query coverage (0.0–1.0) for PUL–CGC homology hits.

Default:

0.0

-iden, --identity_cutoff <identity_cutoff>#

Minimum BLAST identity (0.0–1.0) for PUL–CGC homology hits.

Default:

0.0

-eptn, --extra_pair_type_num <extra_pair_type_num>#

Comma-separated counts matching –extra_pair_type entries (same order).

Default:

'0'

-ept, --extra_pair_type <extra_pair_type>#

Optional comma-separated accessory gene pair types (advanced PUL matching).

-tpn, --total_pair_num <total_pair_num>#

Minimum total informative gene pairs (CAZyme + accessory) for a link.

Default:

2

-cpn, --CAZyme_pair_num <cazyme_pair_num>#

Minimum CAZyme–CAZyme pairs required inside the CGC for homology scoring.

Default:

1

-uqcgn, --uniq_query_cgc_gene_num <uniq_query_cgc_gene_num>#

Minimum unique CGC genes participating in a PUL–CGC link.

Default:

2

-upghn, --uniq_pul_gene_hit_num <uniq_pul_gene_hit_num>#

Minimum unique PUL genes hit by BLAST for a valid PUL–CGC link.

Default:

2

-subs, --substrate_scors <substrate_scors>#

Minimum aggregated dbCAN-sub score (field name substrate_scors) to accept a substrate assignment.

Default:

2.0

-npsc, --num_of_protein_substrate_cutoff <num_of_protein_substrate_cutoff>#

Minimum proteins supporting a substrate call within a CGC.

Default:

2

-ndsc, --num_of_domains_substrate_cutoff <num_of_domains_substrate_cutoff>#

Minimum distinct substrate-associated domains required per prediction.

Default:

2

-hmmevalue, --hmmevalue <hmmevalue>#

Maximum dbCAN-sub HMM E-value allowed for substrate evidence.

Default:

0.01

-hmmcov, --hmmcov <hmmcov>#

Minimum dbCAN-sub HMM coverage (0.0–1.0) when scoring substrate evidence.

Default:

0.0

version#

show version information.

Usage

run_dbcan version [OPTIONS]