Short Reads: Co-assembly Mode#
The co-assembly mode combines reads from all samples and performs a single joint assembly, improving contig continuity and enhancing detection of shared genomic features across samples.
Overview#
Co-assembly is particularly useful when:
You want to improve assembly quality by combining sequencing depth from multiple samples
You’re analyzing samples from similar environments or conditions
You want to detect shared CAZyme gene clusters across samples
You need longer, more complete contigs for better gene prediction
In co-assembly mode, all reads from all samples are combined (preserving paired-end or single-end structure) and assembled together using MEGAHIT. The resulting assembly is then used for all downstream analysis, but read mapping and abundance calculation are performed per-sample.
Activation#
Note
Before running the pipeline, make sure you have cloned the repository. See Nextflow Pipeline: Usage for installation instructions.
To enable co-assembly mode, use the --coassembly flag along with --type shortreads (assuming you are in the dbcan-nf directory):
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type shortreads \
--coassembly \
--skip_kraken_extraction \ # based on the database size of kraken2, you can skip this step if the database is too large.
-profile docker
Requirements#
Minimum Samples: At least 2 samples are required. The pipeline will produce an error if fewer samples are provided.
Compatible Data: All samples should be from similar sequencing runs or conditions for best results
Mutually Exclusive: Cannot be used together with
--subsample
Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
boolean |
|
Enable co-assembly mode. Must be used with |
Usage Examples#
Basic Co-assembly#
Co-assemble all samples in the samplesheet:
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type shortreads \
--coassembly \
-profile docker
Co-assembly with Multiple Samples#
The samplesheet should contain at least 2 samples:
sample,fastq_1,fastq_2
CONTROL_REP1,control1_R1.fastq.gz,control1_R2.fastq.gz
CONTROL_REP2,control2_R1.fastq.gz,control2_R2.fastq.gz
TREATMENT_REP1,treatment1_R1.fastq.gz,treatment1_R2.fastq.gz
TREATMENT_REP2,treatment2_R1.fastq.gz,treatment2_R2.fastq.gz
Co-assembly with Other Options#
Combine co-assembly with other parameters:
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type shortreads \
--coassembly \
--skip_kraken_extraction \
-profile docker
Workflow Behavior#
Assembly Phase#
Read Combination: All reads from all samples are combined into a single set
Single Assembly: One MEGAHIT assembly is performed on the combined reads
Assembly Naming: The co-assembly is named
coassemblyin intermediate files
Annotation Phase#
Single Annotation: CAZyme annotation is performed once on the co-assembly
Result Replication: Annotation results are replicated to each original sample for downstream processing
Read Mapping Phase#
Per-Sample Mapping: Each sample’s reads are mapped back to the co-assembly contigs
Sample-Specific Coverage: Coverage and abundance are calculated per-sample
Preserved Sample Identity: All output files maintain original sample names
Output Files#
Output Structure#
The co-assembly mode produces output files with a specific structure:
Co-assembly Files: Assembly and annotation files use
coassemblyas the sample nameSample-Specific Files: Read mapping, coverage, and abundance files use original sample names
Replicated Annotations: CAZyme annotation results are available for each sample
Example Output Structure#
results/
├── megahit/
│ └── coassembly_contigs.fa.gz # Single co-assembly
├── pyrodigal/
│ ├── coassembly.faa.gz # Genes from co-assembly
│ └── coassembly.gff.gz
├── rundbcan/
│ └── coassembly_dbcan/ # Annotation from co-assembly
├── bwa_index_mem/
│ ├── sample1_dna.bam # Per-sample mapping
│ ├── sample1_dna.bam.bai
│ ├── sample2_dna.bam
│ └── sample2_dna.bam.bai
├── dbcan_utils_cal_abund/
│ ├── sample1_dna_abund/ # Per-sample abundance
│ └── sample2_dna_abund/
└── ...
Key Differences from Standard Mode#
Single Assembly: One assembly instead of per-sample assemblies
Shared Contigs: All samples share the same contig set
Per-Sample Abundance: Abundance is still calculated per-sample
No RNA-seq: RNA-seq processing is automatically disabled
Advantages#
Improved Contiguity: Longer, more complete contigs due to increased sequencing depth
Better Gene Detection: Shared genes are more likely to be detected and fully assembled
Reduced Fragmentation: Fewer fragmented genes and gene clusters
Computational Efficiency: Single assembly is more efficient than multiple per-sample assemblies
Considerations#
Sample Compatibility: Best results when samples are from similar environments
Heterogeneity: Highly diverse samples may produce less optimal co-assemblies
Abundance Comparison: Abundance values are comparable across samples as they use the same reference
Memory Requirements: Co-assembly may require more memory than per-sample assembly
Best Practices#
Sample Selection: Use samples from similar conditions or environments
Quality Control: Ensure all samples have similar quality before co-assembly
Sample Size: 2-10 samples typically work well; very large numbers may be computationally intensive
Validation: Compare co-assembly results with per-sample assemblies to assess improvement
When to Use Co-assembly#
Recommended: - Samples from similar environments or conditions - When improved contig continuity is important - When detecting shared features across samples - When computational resources allow single large assembly
Not Recommended: - Highly diverse or unrelated samples - When sample-specific assemblies are required - When RNA-seq analysis is needed (automatically disabled) - Single sample analysis (requires at least 2 samples)
Example Results#
For example visualizations from co-assembly mode, see the co-assembly results section.
See Also#
Short Reads Analysis Mode - Main short reads mode documentation
Short Reads: Subsampling Mode - Subsampling mode (alternative to co-assembly)
Nextflow Pipeline: Parameters Reference - Complete parameter reference
Nextflow Pipeline: Results Examples - Example results and visualizations