Short Reads: Subsampling Mode

Short Reads: Subsampling Mode#

The subsampling mode allows you to downsample sequencing reads before assembly, reducing computational requirements and enabling quick pipeline validation on large datasets.

Overview#

Subsampling mode is useful when:

You have very large datasets and want to reduce computational time and memory usage
You want to quickly validate the pipeline workflow before running on full datasets
You need to test different parameters on a subset of data
Computational resources are limited

When subsampling is enabled, the pipeline uses seqtk sample to randomly downsample reads from each input file before assembly. This happens per-sample, so each sample is independently subsampled.

Activation#

Note

Before running the pipeline, make sure you have cloned the repository. See Nextflow Pipeline: Usage for installation instructions.

To enable subsampling mode, use the --subsample flag along with --type shortreads (assuming you are in the dbcan-nf directory):

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --type shortreads \
  --subsample \
  -profile docker
  --skip_kraken_extraction # based on the database size of kraken2, you can skip this step if the database is too large.

Parameters#

Subsampling Parameters#
Parameter	Type	Default	Description
`--subsample`	boolean	`false`	Enable subsampling mode. Must be used with `--type shortreads`.
`--subsample_size`	integer	`20000000`	Number of reads per file to retain. Applied to each FASTQ file independently.

Usage Examples#

Basic Subsampling#

Subsample to 20 million reads per file (default):

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --type shortreads \
  --subsample \
  -profile docker \
  --skip_kraken_extraction # based on the database size of kraken2, you can skip this step if the database is too large.

Custom Subsampling Size#

Subsample to 5 million reads per file:

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --type shortreads \
  --subsample \
  --subsample_size 5000000 \
  -profile docker

Subsampling with Other Options#

Combine subsampling with other parameters:

nextflow run main.nf \
  --input samplesheet.csv \
  --outdir results \
  --type shortreads \
  --subsample \
  --subsample_size 10000000 \
  --skip_kraken_extraction \
  -profile docker

Behavior#

How Subsampling Works#

Per-File Subsampling: Each FASTQ file (R1 and R2 for paired-end) is independently subsampled
Random Sampling: Uses seqtk sample with random seed for reproducible results
Preserves Pairing: For paired-end data, both R1 and R2 files are subsampled to maintain pairing
Sample Naming: Subsampled samples are renamed with _subsample suffix (e.g., sample1_dna_subsample)

Limitations#

Mutually Exclusive: Cannot be used together with --coassembly
No RNA-seq: RNA-seq processing is automatically disabled when subsampling is enabled
Reduced Coverage: Subsampling reduces sequencing depth, which may affect assembly quality and gene detection

Output Files#

Output files follow the same structure as standard short reads mode, but with modified sample names:

Sample IDs are appended with _subsample suffix
All downstream files (assembly, annotation, abundance) use the subsampled sample names
Output directory structure remains the same as Short Reads Analysis Mode

Example Output Structure#

results/
├── megahit/
│   └── sample1_dna_subsample_contigs.fa.gz
├── pyrodigal/
│   ├── sample1_dna_subsample.faa.gz
│   └── sample1_dna_subsample.gff.gz
├── rundbcan/
│   └── sample1_dna_subsample_dbcan/
└── ...

Best Practices#

Start Small: Begin with a small subsample size (e.g., 5-10 million reads) to validate the pipeline
Scale Up: Gradually increase subsample size to find the optimal balance between quality and resources
Compare Results: Compare subsampled results with full dataset to assess impact on downstream analysis
Resource Planning: Use subsampling to estimate resource requirements for full dataset analysis

When to Use Subsampling#

Recommended: - Initial pipeline validation - Parameter optimization - Resource-limited environments - Very large datasets (>100M reads per sample)

Not Recommended: - Final production analysis (use full dataset) - Low-coverage samples (subsampling may further reduce coverage) - When maximum sensitivity is required

Example Results#

For example visualizations from subsampling mode, see the subsampling results section.