Long Reads Analysis Mode#
The long reads analysis mode (--type longreads) is designed for processing PacBio or Nanopore long-read sequencing data. This mode uses Flye for metagenomic assembly, which is optimized for long-read technologies.
Overview#
The long reads mode is optimized for third-generation sequencing technologies (PacBio and Nanopore) that produce longer reads compared to Illumina short-read sequencing. Longer reads enable better assembly of complex metagenomic communities and improved detection of complete gene clusters.
Workflow#
The long reads workflow consists of the following main steps:
Quality Control (FastQC, optional for DNA) - FastQC quality assessment (primarily for RNA-seq data) - DNA long reads typically skip QC/trimming steps as they are less affected by adapters
Taxonomic Filtering (Kraken2, optional) - Taxonomic classification using Kraken2 (applied to RNA-seq data) - Extraction of reads matching specified taxonomy - Can be skipped with
--skip_kraken_extractionAssembly (Flye) - Long-read metagenomic assembly using Flye - Supports multiple Flye modes for different sequencing technologies - Configurable via
--flye_modeparameterGene Prediction (Pyrodigal) - Prodigal-based gene finding optimized for metagenomic data - Generates protein sequences (FAA) and gene annotations (GFF)
CAZyme Annotation (run_dbCAN) - CAZyme identification using dbCAN database - CGC (CAZyme Gene Cluster) detection - Substrate prediction
Read Mapping (Minimap2 for DNA, BWA-MEM for RNA) - Mapping of long DNA reads back to assembled contigs using Minimap2 - Mapping of RNA reads (if provided) using BWA-MEM for expression analysis - Coverage calculation for genes and CGCs
Abundance Calculation - Gene-level abundance calculation based on read coverage - CGC abundance and visualization - Generation of bar plots and heatmaps
Report Generation (MultiQC) - Aggregated quality control and analysis reports
Usage#
Note
Before running the pipeline, make sure you have cloned the repository. See Nextflow Pipeline: Usage for installation instructions.
Basic Usage#
The simplest command to run long reads analysis (assuming you are in the dbcan-nf directory):
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type longreads \
-profile docker \
--skip_kraken_extraction # based on the database size of kraken2, you can skip this step if the database is too large.
Flye Mode Selection#
The --flye_mode parameter allows you to specify the appropriate Flye mode for your sequencing technology:
# PacBio HiFi reads (default)
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type longreads \
--flye_mode --pacbio-hifi \
-profile docker
# PacBio raw reads
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type longreads \
--flye_mode --pacbio-raw \
-profile docker
# Nanopore raw reads
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type longreads \
--flye_mode --nano-raw \
-profile docker
# Nanopore high-quality reads
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type longreads \
--flye_mode --nano-hq \
-profile docker
With RNA-seq Data#
When RNA-seq transcriptome data is provided in the samplesheet, the pipeline will:
Process RNA reads through quality control (FastQC + TrimGalore)
Apply Kraken2 taxonomic filtering (if enabled)
Map RNA reads to assembled contigs using BWA-MEM
Calculate RNA-based abundance for expression analysis
Generate separate DNA and RNA abundance plots
Example with RNA-seq:
nextflow run main.nf \
--input samplesheet_with_rna.csv \
--outdir results \
--type longreads \
--flye_mode --pacbio-hifi \
-profile docker
Skipping Steps#
You can skip certain steps if needed:
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type longreads \
--skip_kraken_extraction \
-profile docker
Flye Mode Options#
The --flye_mode parameter accepts the following Flye assembly modes:
Mode |
Description |
|---|---|
|
PacBio HiFi (high-fidelity) reads. Default mode. Best for high-quality PacBio data. |
|
PacBio raw (CLR) reads. Use for standard PacBio sequencing data. |
|
Nanopore raw reads. Use for standard Nanopore sequencing data. |
|
Nanopore high-quality reads. Use for Q20+ or similar high-quality Nanopore data. |
Output Files#
The pipeline generates output files organized in the following directory structure:
Assembly Results#
flye/-*_assembly.fasta.gz: Assembled contigs in FASTA format (gzipped) - Assembly statistics and reports
Gene Prediction#
pyrodigal/-*.faa.gz: Predicted protein sequences (gzipped) -*.gff.gz: Gene annotations in GFF format (gzipped)
CAZyme Annotation#
rundbcan/-*_dbcan/: Directory containing all run_dbCAN resultsFiles in this directory:
overview.tsv: CAZyme annotation overviewdbCAN_hmm_results.tsv: HMM-based CAZyme predictionsdbCANsub_hmm_results.tsv: Subfamily predictionsdiamond.out: DIAMOND search results*_cgc.gff: CGC annotations*_cgc_standard_out.tsv: CGC standard output*_substrate_prediction.tsv: Substrate predictions*_synteny_pdf/: Synteny plots for CGCs
Read Mapping#
minimap2/: Minimap2 alignment files for long DNA reads -*.bam: Aligned reads -*.bam.bai: BAM index filesbwa_index_mem/: BWA-MEM alignment files for RNA reads (if provided) -*.bam: Aligned reads -*.bam.bai: BAM index files
Coverage and Abundance#
dbcan_utils_cal_coverage/-*_depth.txt: Gene coverage depth filesdbcan_utils_cal_abund/-*_abund/: Abundance calculation resultsFiles in this directory:
*_abund.txt: Gene abundance values*_cgc_abund.txt: CGC abundance values
dbcan_plot/-*_pdf/: Visualization plotsFiles in this directory:
heatmap.pdf: Abundance heatmapec.pdf: EC number distributionfamily.pdf: CAZyme family distributionsubfamily.pdf: Subfamily distribution
cgc_depth_plot/-*_cgc_depth.tsv: CGC depth coverage data -*_cgc_depth.pdf: CGC depth plots
Quality Control#
fastqc/: FastQC reports (for RNA-seq data)trimgalore/: TrimGalore reports (for RNA-seq data)multiqc/: MultiQC aggregated report
Pipeline Information#
pipeline_info/: Execution reports, parameters, and software versions
Key Features#
Long-Read Optimized: Uses Flye assembler optimized for PacBio and Nanopore reads
Flexible Flye Modes: Supports multiple Flye modes for different sequencing technologies
Dual Mapping: Minimap2 for long DNA reads, BWA-MEM for RNA reads
Complete Gene Clusters: Longer reads enable better assembly of complete CGCs
RNA-seq Integration: Optional RNA-seq support for expression analysis
Advantages of Long Reads#
Better Contiguity: Longer contigs due to long-read sequencing
Complete Genes: More complete gene predictions, especially for large genes
Reduced Fragmentation: Fewer fragmented gene clusters
Repeat Resolution: Better resolution of repetitive regions
Structural Variants: Improved detection of structural variations
Considerations#
Computational Requirements: Long-read assembly typically requires more memory and time
Error Rates: Long reads may have higher error rates than short reads
Coverage: Lower coverage requirements compared to short reads
Cost: Long-read sequencing may be more expensive per base
Best Practices#
Choose Correct Mode: Select the appropriate
--flye_modefor your sequencing technologyQuality Assessment: Review FastQC reports for RNA-seq data
Coverage Planning: Long reads require less coverage than short reads (typically 20-30x)
Resource Allocation: Ensure sufficient memory for Flye assembly
When to Use Long Reads Mode#
Recommended: - PacBio or Nanopore sequencing data - When complete gene clusters are important - When dealing with complex or repetitive regions - When structural variation detection is needed
Not Recommended:
- Illumina short-read data (use --type shortreads instead)
- When computational resources are very limited
- When only basic CAZyme annotation is needed
Example Results#
For example visualizations from long reads mode, see Nextflow Pipeline: Results Examples.
See Also#
Short Reads Analysis Mode - Short reads mode documentation
Assembly-Free Analysis Mode - Assembly-free mode documentation
Nextflow Pipeline: Parameters Reference - Complete parameter reference
Nextflow Pipeline: Results Examples - Example results and visualizations
Flye Documentation - Flye assembler documentation