Short Reads: Subsampling Mode#
The subsampling mode allows you to downsample sequencing reads before assembly, reducing computational requirements and enabling quick pipeline validation on large datasets.
Overview#
Subsampling mode is useful when:
You have very large datasets and want to reduce computational time and memory usage
You want to quickly validate the pipeline workflow before running on full datasets
You need to test different parameters on a subset of data
Computational resources are limited
When subsampling is enabled, the pipeline uses seqtk sample to randomly downsample reads from each input file before assembly. This happens per-sample, so each sample is independently subsampled.
Activation#
Note
Before running the pipeline, make sure you have cloned the repository. See Nextflow Pipeline: Usage for installation instructions.
To enable subsampling mode, use the --subsample flag along with --type shortreads (assuming you are in the dbcan-nf directory):
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type shortreads \
--subsample \
-profile docker
--skip_kraken_extraction # based on the database size of kraken2, you can skip this step if the database is too large.
Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
boolean |
|
Enable subsampling mode. Must be used with |
|
integer |
|
Number of reads per file to retain. Applied to each FASTQ file independently. |
Usage Examples#
Basic Subsampling#
Subsample to 20 million reads per file (default):
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type shortreads \
--subsample \
-profile docker \
--skip_kraken_extraction # based on the database size of kraken2, you can skip this step if the database is too large.
Custom Subsampling Size#
Subsample to 5 million reads per file:
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type shortreads \
--subsample \
--subsample_size 5000000 \
-profile docker
Subsampling with Other Options#
Combine subsampling with other parameters:
nextflow run main.nf \
--input samplesheet.csv \
--outdir results \
--type shortreads \
--subsample \
--subsample_size 10000000 \
--skip_kraken_extraction \
-profile docker
Behavior#
How Subsampling Works#
Per-File Subsampling: Each FASTQ file (R1 and R2 for paired-end) is independently subsampled
Random Sampling: Uses
seqtk samplewith random seed for reproducible resultsPreserves Pairing: For paired-end data, both R1 and R2 files are subsampled to maintain pairing
Sample Naming: Subsampled samples are renamed with
_subsamplesuffix (e.g.,sample1_dna_subsample)
Limitations#
Mutually Exclusive: Cannot be used together with
--coassemblyNo RNA-seq: RNA-seq processing is automatically disabled when subsampling is enabled
Reduced Coverage: Subsampling reduces sequencing depth, which may affect assembly quality and gene detection
Output Files#
Output files follow the same structure as standard short reads mode, but with modified sample names:
Sample IDs are appended with
_subsamplesuffixAll downstream files (assembly, annotation, abundance) use the subsampled sample names
Output directory structure remains the same as Short Reads Analysis Mode
Example Output Structure#
results/
├── megahit/
│ └── sample1_dna_subsample_contigs.fa.gz
├── pyrodigal/
│ ├── sample1_dna_subsample.faa.gz
│ └── sample1_dna_subsample.gff.gz
├── rundbcan/
│ └── sample1_dna_subsample_dbcan/
└── ...
Best Practices#
Start Small: Begin with a small subsample size (e.g., 5-10 million reads) to validate the pipeline
Scale Up: Gradually increase subsample size to find the optimal balance between quality and resources
Compare Results: Compare subsampled results with full dataset to assess impact on downstream analysis
Resource Planning: Use subsampling to estimate resource requirements for full dataset analysis
When to Use Subsampling#
Recommended: - Initial pipeline validation - Parameter optimization - Resource-limited environments - Very large datasets (>100M reads per sample)
Not Recommended: - Final production analysis (use full dataset) - Low-coverage samples (subsampling may further reduce coverage) - When maximum sensitivity is required
Example Results#
For example visualizations from subsampling mode, see the subsampling results section.
See Also#
Short Reads Analysis Mode - Main short reads mode documentation
Short Reads: Co-assembly Mode - Co-assembly mode (alternative to subsampling)
Nextflow Pipeline: Parameters Reference - Complete parameter reference
Nextflow Pipeline: Results Examples - Example results and visualizations