2.6 KiB
2.6 KiB
FastDup
Identifies duplicate reads. This tool locates and tags duplicate reads in a coordinate ordered SAM or BAM file.
Use the same algorithm as picard MarkDuplicates and output identical results, and use spdlog as log tool and the default level is 'info'.
Features
- Fast - with the same number of threads
FastDupis ~8X faster than GATK MarkDuplicatesSpark. AndFastDupachives ~20X performance improvement than Picard MarkDuplicates. - Generate identical outputs compared to Picard MarkDuplicates.
- The same detailed metrics data witch Picard MarkDuplicates.
- All data processed in memory and low-memory footprint even for large input files.
Limitations
- Although
FastDupcan detecte all the same duplicates as Picard MarkDuplicates. They may mark different reads as duplicates because the reads sort algorithm in Picard MarkDuplicates is unstable. Considering there are 2 reads(A, B and A is in front of B in file) in a duplicate group and they have the same score, Picard Markduplicates may mark A as duplicate because B may be in front of A after sorting. WhileFastDupuse stable sort algorithm and always mark B as duplicate. - In optical duplicates detection, Picard Markduplicates use short (int16_t) as data type in parsing
tile/region, x coordinate and y coordinate from a read name, which may data overflow as these integers
may exceed the range of short type.
FastDupfixes this bug. But for consistency with Picard Markduplicates, we keep this bug in source codes. Just change the data type in PhysicalLocation struct in read_ends.h file to fix this bug. FastDupuse the data characteristics in coordinate ordered SAM/BAM files to improve the performance of detecting duplicates, thus the input should be ordered by coordinate in advance.
Requirements
Install following tools and required libraries.
# install autoconf (for htslib), cmake, c++17 (gcc >= 8.1 or clang >= 7 should work), zlib, libbz2, liblzma, libcurl, libdeflate (optional)
sudo apt update
sudo apt install autoconf cmake g++-8 zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libdeflate-dev gcc-8 g++-8
Install
Download a distribution tarball FastDup.tar.gz or clone the source codes from github.
# build htslib
cd FastDup/ext/htslib
autoreconf -i
./configure
make
# build FastDup
cd FastDup
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make
The generated binary fastdup will be in the build/bin folder.
Usage
Get help
./fastdup --help
Mark duplicates on an input BAM file using 8 threads
./fastdup --input in_test.bam --output out_md.bam --metrics stats.txt --num-threads 8