Many Intel intrinsics have a corresponding Neon equivalent.
Other cases are more interesting:
* Neon's vmaxvq directly selects the maximum entry in a vector,
so can be used to implement both the __max_16/__max_8 macros
and the _mm_movemask_epi8 early loop exit. Introduce additional
helper macros alongside __max_16/__max_8 so that the early loop
exit can similarly be implemented differently on the two platforms.
* Full-width shifts can be done via vextq. This is defined close to
the ksw_u8()/ksw_i16() functions (rather than in neon_sse.h) as it
implicitly uses one of their local variables.
* ksw_i16() uses saturating *signed* 16-bit operations apart from
_mm_subs_epu16; presumably the data is effectively still signed but
we wish to keep it non-negative. The ARM intrinsics are more careful
about type checking, so this requires an extra U16() helper macro.
The previous code implicitly caused a load; change it so the load
intrinsic is explicitly invoked, as the others are. (This in fact
makes no difference to the generated code.)
The bwa makefile doesn't set these two itself, but the environment
or make command line might set any of CC/CPPFLAGS/CFLAGS/LDFLAGS/LIBS.
Use $(CPPFLAGS) when compiling and $(LDFLAGS) when linking so they can
be used to customise the build. Remove $(DFLAGS) from link commands as
these preprocessor options are irrelevant for linking.
In particular, this defines the output SAM to be unsorted BUT also query grouped. The latter is very important to explicitly define so downstream tools that don't make assumptions know that reads from the same template are grouped.
Clarify that the -5 bwa mem option chooses the alignment that starts earliest in the read relative to the read/sequencing order, not genomic coordinate order