Usually you can expect a ~4x speedup for most SIMD code for floats with 128bit regs.
This is a 7x speedup. And no, the compiler wrote perfectly okay code.
Logic tells you this is impossible, after all you only do the same operation but on 4 values at the same time.
However, what's happening here is a matter of dependencies. And when every iteration is dependent on the previous one, redoing operations (necessary for this code to be SIMD'd) can save a lot of time. Math is cheap, IO is not.
A Mastodon instance for people interested in multimedia, codecs, assembly, SIMD, and the occasional weeb stuff.