# FastDup Identifies duplicate reads. This tool locates and tags duplicate reads in a coordinate ordered SAM or BAM file. Use the same algorithm as picard MarkDuplicates and output identical results. Use spdlog as log tool and the default level is 'info'. ### Features * Fast - with the same number of threads `FastDup` is ~3.5X faster than GATK MarkDuplicatesSpark. And `FastDup` achives ~15X performance improvement than Picard MarkDuplicates. * Generate identical outputs compared to Picard MarkDuplicates. * The same detailed metrics data witch Picard MarkDuplicates. * All data processed in memory and low-memory footprint even for large input files. ### Limitations * Although `FastDup` can detecte all the same duplicates as Picard MarkDuplicates. They may mark different reads as duplicates because the reads sort algorithm in Picard MarkDuplicates is unstable. Considering there are 2 reads(A, B and A is in front of B in file) in a duplicate group and they have the same score, Picard Markduplicates may mark A as duplicate because B may be in front of A after sorting. While `FastDup` use stable sort algorithm and always mark B as duplicate. * In optical duplicates detection, Picard Markduplicates use short (int16_t) as data type in parsing tile/region, x coordinate and y coordinate from a read name, which may data overflow as these integers may exceed the range of short type. `FastDup` fixes this bug. But for consistency with Picard Markduplicates, we keep this bug in source codes. Just change the data type in PhysicalLocation struct in read_ends.h file to fix this bug. * `FastDup` use the data characteristics in coordinate ordered SAM/BAM files to improve the performance of detecting duplicates, thus the input should be ordered by coordinate in advance. ## Requirements ### Build tools * autoconf (for htslib) * cmake * c++17 (gcc >= 8.1 or clang >= 7 should work.) ### Libraries needed * zlib * libbz2 * liblzma * libcurl * libdeflate (optional) ## Install Download a distribution tarball `FastDup.tar.gz` or clone the source codes from github. ```bash # build htslib cd FastDup/ext/htslib autoreconf -i ./configure make # build FastDup cd FastDup mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release make && sudo make install ``` ## Usage 1. get help ```bash fastdup --help ``` 2. mark duplicates on an input BAM file ```bash fastdup --input in.bam --output out.bam --metrics stats.txt --num-threads 8 ```