Short Reads: Co-assembly Mode#

The co-assembly mode combines reads from all samples and performs a single joint assembly, improving contig continuity and enhancing detection of shared genomic features across samples.

Overview#

Co-assembly is particularly useful when:

  • You want to improve assembly quality by combining sequencing depth from multiple samples

  • You’re analyzing samples from similar environments or conditions

  • You want to detect shared CAZyme gene clusters across samples

  • You need longer, more complete contigs for better gene prediction

In co-assembly mode, all reads from all samples are combined (preserving paired-end or single-end structure) and assembled together using MEGAHIT. The resulting assembly is then used for all downstream analysis, but read mapping and abundance calculation are performed per-sample.

Activation#

Note

Before running the pipeline, make sure you have cloned the repository. See Nextflow Pipeline: Usage for installation instructions.

To enable co-assembly mode, use the --coassembly flag along with --type shortreads (assuming you are in the dbcan-nf directory):

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --type shortreads \
  --coassembly \
  --skip_kraken_extraction \ # based on the database size of kraken2, you can skip this step if the database is too large.
  -profile docker

Requirements#

  • Minimum Samples: At least 2 samples are required. The pipeline will produce an error if fewer samples are provided.

  • Compatible Data: All samples should be from similar sequencing runs or conditions for best results

  • Mutually Exclusive: Cannot be used together with --subsample

Parameters#

Co-assembly Parameters#

Parameter

Type

Default

Description

--coassembly

boolean

false

Enable co-assembly mode. Must be used with --type shortreads and requires at least 2 samples.

Usage Examples#

Basic Co-assembly#

Co-assemble all samples in the samplesheet:

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --type shortreads \
  --coassembly \
  -profile docker

Co-assembly with Multiple Samples#

The samplesheet should contain at least 2 samples:

sample,fastq_1,fastq_2
CONTROL_REP1,control1_R1.fastq.gz,control1_R2.fastq.gz
CONTROL_REP2,control2_R1.fastq.gz,control2_R2.fastq.gz
TREATMENT_REP1,treatment1_R1.fastq.gz,treatment1_R2.fastq.gz
TREATMENT_REP2,treatment2_R1.fastq.gz,treatment2_R2.fastq.gz

Co-assembly with Other Options#

Combine co-assembly with other parameters:

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --type shortreads \
  --coassembly \
  --skip_kraken_extraction \
  -profile docker

Workflow Behavior#

Assembly Phase#

  1. Read Combination: All reads from all samples are combined into a single set

  2. Single Assembly: One MEGAHIT assembly is performed on the combined reads

  3. Assembly Naming: The co-assembly is named coassembly in intermediate files

Annotation Phase#

  1. Single Annotation: CAZyme annotation is performed once on the co-assembly

  2. Result Replication: Annotation results are replicated to each original sample for downstream processing

Read Mapping Phase#

  1. Per-Sample Mapping: Each sample’s reads are mapped back to the co-assembly contigs

  2. Sample-Specific Coverage: Coverage and abundance are calculated per-sample

  3. Preserved Sample Identity: All output files maintain original sample names

Output Files#

Output Structure#

The co-assembly mode produces output files with a specific structure:

  • Co-assembly Files: Assembly and annotation files use coassembly as the sample name

  • Sample-Specific Files: Read mapping, coverage, and abundance files use original sample names

  • Replicated Annotations: CAZyme annotation results are available for each sample

Example Output Structure#

results/
├── megahit/
│   └── coassembly_contigs.fa.gz          # Single co-assembly
├── pyrodigal/
│   ├── coassembly.faa.gz                 # Genes from co-assembly
│   └── coassembly.gff.gz
├── rundbcan/
│   └── coassembly_dbcan/                 # Annotation from co-assembly
├── bwa_index_mem/
│   ├── sample1_dna.bam                   # Per-sample mapping
│   ├── sample1_dna.bam.bai
│   ├── sample2_dna.bam
│   └── sample2_dna.bam.bai
├── dbcan_utils_cal_abund/
│   ├── sample1_dna_abund/                # Per-sample abundance
│   └── sample2_dna_abund/
└── ...

Key Differences from Standard Mode#

  1. Single Assembly: One assembly instead of per-sample assemblies

  2. Shared Contigs: All samples share the same contig set

  3. Per-Sample Abundance: Abundance is still calculated per-sample

  4. No RNA-seq: RNA-seq processing is automatically disabled

Advantages#

  • Improved Contiguity: Longer, more complete contigs due to increased sequencing depth

  • Better Gene Detection: Shared genes are more likely to be detected and fully assembled

  • Reduced Fragmentation: Fewer fragmented genes and gene clusters

  • Computational Efficiency: Single assembly is more efficient than multiple per-sample assemblies

Considerations#

  • Sample Compatibility: Best results when samples are from similar environments

  • Heterogeneity: Highly diverse samples may produce less optimal co-assemblies

  • Abundance Comparison: Abundance values are comparable across samples as they use the same reference

  • Memory Requirements: Co-assembly may require more memory than per-sample assembly

Best Practices#

  1. Sample Selection: Use samples from similar conditions or environments

  2. Quality Control: Ensure all samples have similar quality before co-assembly

  3. Sample Size: 2-10 samples typically work well; very large numbers may be computationally intensive

  4. Validation: Compare co-assembly results with per-sample assemblies to assess improvement

When to Use Co-assembly#

Recommended: - Samples from similar environments or conditions - When improved contig continuity is important - When detecting shared features across samples - When computational resources allow single large assembly

Not Recommended: - Highly diverse or unrelated samples - When sample-specific assemblies are required - When RNA-seq analysis is needed (automatically disabled) - Single sample analysis (requires at least 2 samples)

Example Results#

For example visualizations from co-assembly mode, see the co-assembly results section.

See Also#