This new algorithm uses a Scalar->Vector->Scalar iteration loop which
requires no masking off of any incomplete data chunks.
Also, the width was reduced to 32 bytes instead of 64, as I found this
to be about as fast as the previous 64-byte x86 version.
It was leaky and required a substantial number of `loc := #caller_location` additions to parts of the core library to make it easier to track down how and where it leaked.
The tests now run fine multi-threaded.
Returns the actual error if one is set, instead of swallowing it for the
less descriptive negative error.
Also fixes a out-of-bounds slice error in `bufio.writer_write` because
it wasn't checking the returned `m`.