Short Reads Analysis Mode#
The short reads analysis mode (--type shortreads) is designed for processing Illumina short-read sequencing data. This mode performs assembly-based CAZyme annotation using MEGAHIT for metagenomic assembly, followed by gene prediction, CAZyme annotation, and abundance calculation.
Overview#
The short reads mode is the default analysis mode and is optimized for Illumina paired-end or single-end sequencing data. It provides comprehensive CAZyme and CGC (CAZyme Gene Cluster) analysis with optional RNA-seq integration for expression analysis.
Workflow#
The short reads workflow consists of the following main steps:
Quality Control (FastQC + TrimGalore) - FastQC quality assessment of raw sequencing reads - TrimGalore adapter trimming and quality filtering
Taxonomic Filtering (Kraken2, optional) - Taxonomic classification using Kraken2 - Extraction of reads matching specified taxonomy (default: human reads, tax ID 9606) - Can be skipped with
--skip_kraken_extractionRead Processing (optional) - Subsampling: Downsample reads before assembly (see Short Reads: Subsampling Mode) - Co-assembly: Combine all samples for joint assembly (see Short Reads: Co-assembly Mode)
Assembly (MEGAHIT) - Metagenomic assembly using MEGAHIT - Default minimum contig length: 1000 bp - Memory usage limited to 50% of available memory
Gene Prediction (Pyrodigal) - Prodigal-based gene finding optimized for metagenomic data - Generates protein sequences (FAA) and gene annotations (GFF)
CAZyme Annotation (run_dbCAN) - CAZyme identification using dbCAN database - CGC (CAZyme Gene Cluster) detection - Substrate prediction
Read Mapping (BWA-MEM) - Mapping of DNA reads back to assembled contigs - Mapping of RNA reads (if provided) for expression analysis - Coverage calculation for genes and CGCs
Abundance Calculation - Gene-level abundance calculation based on read coverage - CGC abundance and visualization - Generation of bar plots and heatmaps
Report Generation (MultiQC) - Aggregated quality control and analysis reports
Usage#
Note
Before running the pipeline, make sure you have cloned the repository. See Nextflow Pipeline: Usage for installation instructions.
Basic Usage#
The simplest command to run short reads analysis (assuming you are in the dbcan-nf directory):
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type shortreads \
-profile docker \
--skip_kraken_extraction # based on the database size of kraken2, you can skip this step if the database is too large.
With RNA-seq Data#
When RNA-seq transcriptome data is provided in the samplesheet, the pipeline will automatically:
Process RNA reads through quality control
Map RNA reads to assembled contigs
Calculate RNA-based abundance for expression analysis
Generate separate DNA and RNA abundance plots
Note
RNA-seq processing is automatically disabled when using --subsample or --coassembly modes.
Example with RNA-seq:
nextflow run main.nf \
--input samplesheet_with_rna.csv \
--outdir results \
--type shortreads \
-profile docker
Skipping Steps#
You can skip certain steps if needed:
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type shortreads \
--skip_fastqc \
--skip_trimming \
--skip_kraken_extraction \
-profile docker
Advanced Options#
Using custom databases:
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type shortreads \
--dbcan_db /path/to/dbcan_database \
--kraken_db /path/to/kraken_database \
-profile docker
Subsampling and Co-assembly#
The short reads mode supports two special processing options:
Subsampling Mode: Downsample reads before assembly to reduce computational requirements
Co-assembly Mode: Combine all samples for joint assembly
These modes are mutually exclusive and cannot be used together.
Output Files#
The pipeline generates output files organized in the following directory structure:
Assembly Results#
megahit/-*_contigs.fa.gz: Assembled contigs in FASTA format (gzipped) - Assembly statistics and reports
Gene Prediction#
pyrodigal/-*.faa.gz: Predicted protein sequences (gzipped) -*.gff.gz: Gene annotations in GFF format (gzipped)
CAZyme Annotation#
rundbcan/-*_dbcan/: Directory containing all run_dbCAN resultsFiles in this directory:
overview.tsv: CAZyme annotation overviewdbCAN_hmm_results.tsv: HMM-based CAZyme predictionsdbCANsub_hmm_results.tsv: Subfamily predictionsdiamond.out: DIAMOND search results*_cgc.gff: CGC annotations*_cgc_standard_out.tsv: CGC standard output*_substrate_prediction.tsv: Substrate predictions*_synteny_pdf/: Synteny plots for CGCs
Read Mapping#
bwa/: BWA index filesbwa_index_mem/: BAM files from read mapping -*.bam: Aligned reads -*.bam.bai: BAM index files
Coverage and Abundance#
dbcan_utils_cal_coverage/-*_depth.txt: Gene coverage depth filesdbcan_utils_cal_abund/-*_abund/: Abundance calculation resultsFiles in this directory:
*_abund.txt: Gene abundance values*_cgc_abund.txt: CGC abundance values
dbcan_plot/-*_pdf/: Visualization plotsFiles in this directory:
heatmap.pdf: Abundance heatmapec.pdf: EC number distributionfamily.pdf: CAZyme family distributionsubfamily.pdf: Subfamily distribution
cgc_depth_plot/-*_cgc_depth.tsv: CGC depth coverage data -*_cgc_depth.pdf: CGC depth plots
Quality Control#
fastqc/: FastQC reports (if not skipped)trimgalore/: TrimGalore reports (if not skipped)multiqc/: MultiQC aggregated report
Pipeline Information#
pipeline_info/: Execution reports, parameters, and software versions
Key Features#
Dual Analysis: Supports both DNA and RNA-seq data for comprehensive analysis
Flexible Processing: Optional subsampling and co-assembly modes
Comprehensive Annotation: CAZyme identification, CGC detection, and substrate prediction
Abundance Calculation: Gene-level and CGC-level abundance with visualization
Quality Control: Integrated QC pipeline with MultiQC reporting
Example Results#
For example visualizations and results from short reads mode analysis, see Nextflow Pipeline: Results Examples.
See Also#
Short Reads: Subsampling Mode - Subsampling mode details
Short Reads: Co-assembly Mode - Co-assembly mode details
Nextflow Pipeline: Parameters Reference - Complete parameter reference
nextflow-output - General output documentation
Nextflow Pipeline: Results Examples - Example results and visualizations