pub fn avx_transpose128x128(in_out: &mut [__m256i; 64])
Transpose a 128x128 bit matrix using AVX2 intrinsics.
AVX2 needs to be enabled.