Many Intel intrinsics have a corresponding Neon equivalent.
Other cases are more interesting:
* Neon's vmaxvq directly selects the maximum entry in a vector,
so can be used to implement both the __max_16/__max_8 macros
and the _mm_movemask_epi8 early loop exit. Introduce additional
helper macros alongside __max_16/__max_8 so that the early loop
exit can similarly be implemented differently on the two platforms.
* Full-width shifts can be done via vextq. This is defined close to
the ksw_u8()/ksw_i16() functions (rather than in neon_sse.h) as it
implicitly uses one of their local variables.
* ksw_i16() uses saturating *signed* 16-bit operations apart from
_mm_subs_epu16; presumably the data is effectively still signed but
we wish to keep it non-negative. The ARM intrinsics are more careful
about type checking, so this requires an extra U16() helper macro.