Simplify and make simd_util cross-platform

This new algorithm uses a Scalar->Vector->Scalar iteration loop which
requires no masking off of any incomplete data chunks.

Also, the width was reduced to 32 bytes instead of 64, as I found this
to be about as fast as the previous 64-byte x86 version.
This commit is contained in:
Feoramund
2024-08-09 17:39:19 -04:00
parent 793811b219
commit 12dd0cb72a
5 changed files with 101 additions and 151 deletions
@@ -1,4 +1,3 @@
//+build i386, amd64
package benchmark_simd_util
import "core:fmt"