Previously, bwa-mem waits for I/O. When the input data comes from a slow source
(I/O or piped from a slow program), bwa-mem may spend significant amount of
wall-clock time in the single-thread mode. The same may also happen when bwa-mem
writes to slow target. This commit uses two sequence buffers. it allows bwa-mem
to map one buffer while filling or dumping the other buffer. When bwa-mem is run
on 16 threads using the bwa.kit pipeline, the wall clock time is reduced by 30%.