快速去冗余，输入是按照坐标排序的BAM/SAM文件

Go to file

zzh 3606562de9 完美解决了所有问题，基本都是由排序带来的，排序过程涉及的变量太多，导致范围跟预想的不一致，现在都解决了		2025-03-04 00:40:57 +08:00
ext	添加log，argparse等第三方库，修改代码	2024-12-14 12:24:19 +08:00
src	完美解决了所有问题，基本都是由排序带来的，排序过程涉及的变量太多，导致范围跟预想的不一致，现在都解决了	2025-03-04 00:40:57 +08:00
.gitignore	代码重构基本完成，还差markdup里的一些调用和处理代码	2024-12-15 03:20:35 +08:00
CMakeLists.txt	去掉了一些调试信息，把cout等输出都换成了spdlog，可以作为开源代码发布了，README还需要详细写一写	2024-12-16 02:55:11 +08:00
Makefile	解决了一些bug，但是还有问题，保留了一些调试代码	2025-02-28 03:06:21 +08:00
Makefile_src	解决了一些bug，但是还有问题，保留了一些调试代码	2025-02-28 03:06:21 +08:00
README.md	更新readme	2024-12-18 22:22:10 +08:00

README.md

FastDup

Identifies duplicate reads. This tool locates and tags duplicate reads in a coordinate ordered SAM or BAM file.

Use the same algorithm as picard MarkDuplicates and output identical results. Use spdlog as log tool and the default level is 'info'.

Features

Fast - with the same number of threads FastDup is ~3.5X faster than GATK MarkDuplicatesSpark. And FastDup achives ~15X performance improvement than Picard MarkDuplicates.
Generate identical outputs compared to Picard MarkDuplicates.
The same detailed metrics data witch Picard MarkDuplicates.
All data processed in memory and low-memory footprint even for large input files.

Limitations

Although FastDup can detecte all the same duplicates as Picard MarkDuplicates. They may mark different reads as duplicates because the reads sort algorithm in Picard MarkDuplicates is unstable. Considering there are 2 reads(A, B and A is in front of B in file) in a duplicate group and they have the same score, Picard Markduplicates may mark A as duplicate because B may be in front of A after sorting. While FastDup use stable sort algorithm and always mark B as duplicate.
In optical duplicates detection, Picard Markduplicates use short (int16_t) as data type in parsing tile/region, x coordinate and y coordinate from a read name, which may data overflow as these integers may exceed the range of short type. FastDup fixes this bug. But for consistency with Picard Markduplicates, we keep this bug in source codes. Just change the data type in PhysicalLocation struct in read_ends.h file to fix this bug.
FastDup use the data characteristics in coordinate ordered SAM/BAM files to improve the performance of detecting duplicates, thus the input should be ordered by coordinate in advance.

Requirements

Build tools

autoconf (for htslib)
cmake
c++17 (gcc >= 8.1 or clang >= 7 should work.)

Libraries needed

zlib
libbz2
liblzma
libcurl
libdeflate (optional)

Install

Download a distribution tarball FastDup.tar.gz or clone the source codes from github.

# build htslib
cd FastDup/ext/htslib
autoreconf -i
./configure
make

# build FastDup
cd FastDup
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make

The generated binary fastdup will be in the build/bin folder.

Usage

get help

fastdup --help

mark duplicates on an input BAM file using 8 threads

fastdup --input in.bam --output out.bam --metrics stats.txt --num-threads 8