Short Reads: Co-assembly Mode

Short Reads: Co-assembly Mode#

The co-assembly mode combines reads from all samples and performs a single joint assembly, improving contig continuity and enhancing detection of shared genomic features across samples.

Overview#

Co-assembly is particularly useful when:

You want to improve assembly quality by combining sequencing depth from multiple samples
You’re analyzing samples from similar environments or conditions
You want to detect shared CAZyme gene clusters across samples
You need longer, more complete contigs for better gene prediction

In co-assembly mode, all reads from all samples are combined (preserving paired-end or single-end structure) and assembled together using MEGAHIT. The resulting assembly is then used for all downstream analysis, but read mapping and abundance calculation are performed per-sample.

Activation#

Note

Before running the pipeline, make sure you have cloned the repository. See Nextflow Pipeline: Usage for installation instructions.

To enable co-assembly mode, use the --coassembly flag along with --type shortreads (assuming you are in the dbcan-nf directory):

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --type shortreads \
  --coassembly \
  --skip_kraken_extraction \ # based on the database size of kraken2, you can skip this step if the database is too large.
  -profile docker

Requirements#

Minimum Samples: At least 2 samples are required. The pipeline will produce an error if fewer samples are provided.
Compatible Data: All samples should be from similar sequencing runs or conditions for best results
Mutually Exclusive: Cannot be used together with --subsample

Parameters#

Co-assembly Parameters#
Parameter	Type	Default	Description
`--coassembly`	boolean	`false`	Enable co-assembly mode. Must be used with `--type shortreads` and requires at least 2 samples.

Usage Examples#

Basic Co-assembly#

Co-assemble all samples in the samplesheet:

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --type shortreads \
  --coassembly \
  -profile docker

Co-assembly with Multiple Samples#

The samplesheet should contain at least 2 samples:

sample,fastq_1,fastq_2
CONTROL_REP1,control1_R1.fastq.gz,control1_R2.fastq.gz
CONTROL_REP2,control2_R1.fastq.gz,control2_R2.fastq.gz
TREATMENT_REP1,treatment1_R1.fastq.gz,treatment1_R2.fastq.gz
TREATMENT_REP2,treatment2_R1.fastq.gz,treatment2_R2.fastq.gz

Co-assembly with Other Options#

Combine co-assembly with other parameters:

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --type shortreads \
  --coassembly \
  --skip_kraken_extraction \
  -profile docker

Workflow Behavior#

Assembly Phase#

Read Combination: All reads from all samples are combined into a single set
Single Assembly: One MEGAHIT assembly is performed on the combined reads
Assembly Naming: The co-assembly is named coassembly in intermediate files

Annotation Phase#

Single Annotation: CAZyme annotation is performed once on the co-assembly
Result Replication: Annotation results are replicated to each original sample for downstream processing

Read Mapping Phase#

Per-Sample Mapping: Each sample’s reads are mapped back to the co-assembly contigs
Sample-Specific Coverage: Coverage and abundance are calculated per-sample
Preserved Sample Identity: All output files maintain original sample names

Output Files#

Output Structure#

The co-assembly mode produces output files with a specific structure:

Co-assembly Files: Assembly and annotation files use coassembly as the sample name
Sample-Specific Files: Read mapping, coverage, and abundance files use original sample names
Replicated Annotations: CAZyme annotation results are available for each sample

Example Output Structure#

results/
├── megahit/
│   └── coassembly_contigs.fa.gz          # Single co-assembly
├── pyrodigal/
│   ├── coassembly.faa.gz                 # Genes from co-assembly
│   └── coassembly.gff.gz
├── rundbcan/
│   └── coassembly_dbcan/                 # Annotation from co-assembly
├── bwa_index_mem/
│   ├── sample1_dna.bam                   # Per-sample mapping
│   ├── sample1_dna.bam.bai
│   ├── sample2_dna.bam
│   └── sample2_dna.bam.bai
├── dbcan_utils_cal_abund/
│   ├── sample1_dna_abund/                # Per-sample abundance
│   └── sample2_dna_abund/
└── ...

Key Differences from Standard Mode#

Single Assembly: One assembly instead of per-sample assemblies
Shared Contigs: All samples share the same contig set
Per-Sample Abundance: Abundance is still calculated per-sample
No RNA-seq: RNA-seq processing is automatically disabled

Advantages#

Improved Contiguity: Longer, more complete contigs due to increased sequencing depth
Better Gene Detection: Shared genes are more likely to be detected and fully assembled
Reduced Fragmentation: Fewer fragmented genes and gene clusters
Computational Efficiency: Single assembly is more efficient than multiple per-sample assemblies

Considerations#

Sample Compatibility: Best results when samples are from similar environments
Heterogeneity: Highly diverse samples may produce less optimal co-assemblies
Abundance Comparison: Abundance values are comparable across samples as they use the same reference
Memory Requirements: Co-assembly may require more memory than per-sample assembly

Best Practices#

Sample Selection: Use samples from similar conditions or environments
Quality Control: Ensure all samples have similar quality before co-assembly
Sample Size: 2-10 samples typically work well; very large numbers may be computationally intensive
Validation: Compare co-assembly results with per-sample assemblies to assess improvement

When to Use Co-assembly#

Recommended: - Samples from similar environments or conditions - When improved contig continuity is important - When detecting shared features across samples - When computational resources allow single large assembly

Not Recommended: - Highly diverse or unrelated samples - When sample-specific assemblies are required - When RNA-seq analysis is needed (automatically disabled) - Single sample analysis (requires at least 2 samples)

Example Results#

For example visualizations from co-assembly mode, see the co-assembly results section.