Introduction

A simple application to split FASTQ files.

The algorithm is a reimplementation from biopet-fastqsplitter. Fastqsplitter reads a fastq file. It then splits the reads over the designated output files.

This application does NOT work with multiline fastq sequences.

fastqsplitter uses the excellent xopen library by @marcelm. This determines by extension whether the file is compressed and allows for very fast compression and decompression of gzip files.

Usage

usage: fastqsplitter [-h] -i INPUT -o OUTPUT [-c COMPRESSION_LEVEL]
                     [-t THREADS_PER_FILE]

Named Arguments

-i, --input The fastq file to be scattered.
-o, --output Scatter over these output files. Multiple -o flags can be used. The extensions determine which compression algorithm will be used. ‘.gz’ for gzip, ‘.bz2’ for bzip2, ‘.xz’ for xz. Other extensions will use no compression.
-c, --compression-level
 

Only applicable when output files have a ‘.gz’ extension. Default=1

Default: 1

-t, --threads-per-file
 

Set the number of compression threads per output file. NOTE: more threads are only useful when using a compression level > 1. Default=1

Default: 1

Note

Fastqsplitter uses a separate process for reading the input file, doing the splitting as well as one seperate process per output file. Fastqsplitter therefore always uses multiple CPU cores.

Example

With an input file input_fastq.gz of 2.3 GB. fastqsplitter -i input_fastq.gz -o split.1.fq.gz -o split.2.fq.gz -o split.3.fq.gz

Fastqsplitter will read input_fastq.gz. The first 100 reads will go to split.1.fq.gz, read 101-200 will go to split.2.fq.gz, read 201-300 will go to split.3.fq.gz, read 301-400 will go to split.1.fq.gz, etc.

This way the fastq reads are evenly distributed, with a difference of maximum 100 reads between output files, and no positional bias in each output file.

Performance comparisons

Comparing different modes of fastqsplitter and biopet-fastqsplitter. Biopet-fastqsplitter has only one mode: compression level 5, and an unknown number of threads per file.

Fastqsplitter runs with 1 thread per output file and compression level 1 by default. For fair comparison with biopet-fastqsplitter, fastqsplitter was run with 4 threads per file (xopen default) and compression level 5. Since fastqsplitter starts several pigz and one gzip process the memory usage of these processes are included in the results.

This test case was run with a 2.3 GB input fastq file zipped. This was split over 5 output files.

The used test machine had 32 GB memory (2x16GB 2133mhz), an Intel core i7-6700 (4 cores, 8 threads) and a Sandisk X400 500gb SSD.

measurement fastqsplitter (defaults) fastqsplitter -t 4 -c 5 biopet-fastqsplitter
real time 0m50.932s 1m28.153s 1m41.385s
total cpu time 3m7.116s 7m55.436s 8m20.304s
max mem 24 MB 32MB 400MB
max vmem 110 MB 1.6 GB 11.0 GB
output files total size 2290 MB 2025 MB 2025 MB

The outcomes for multiple runs were fairly consistent with a max +-3 second difference between runs.

Changelog

1.1.0

  • Enable the building of wheels for the project now that Cython extensions are used. Thanks to @marcelm for providing a working build script on https://github.com/marcelm/dnaio.

  • Cythonize the splitting algorithm. This reduces the overhead of the application up to 50% over the fastest native python implementation. Overhead is all the allocated cpu time that is not system time.

    This means splitting of uncompressed fastqs will be noticably faster (30% faster was achieved during testing). When splitting compressed fastq files into compressed split fastq files this change will not be much faster since all the gzip process will be run in a separate thread. Still when splitting a 2.3 gb gzipped fastq file into 3 gzipped split fastq files the speedup from the fastest python implementation was 14% in total cpu seconds. (Due to the multithreaded nature of the application wall clock time was reduced by only 3%).

1.0.0

  • Added documentation for fastqsplitter and set up readthedocs page.
  • Added tests for fastqsplitter.
  • Upstream contributions to xopen have improved fastqsplitter speed.
  • Initiated fastqsplitter.