I don't have to care about MIPS SIMD assembly! I can probably get myself new RISC-V Loongson chips for free instead!

@luna Does using onomatopoeia count as words?

The double version is 13% faster than the AVX version.
But the AVX version exists for a good reason - it saves a ton of registers and nasty cross-lane shuffles (the double version needs a different register layout).

Show thread

Got a 2xFFT8 working!
Price of 1xFFT8 = 20 instructions
Price of 2xFFT8 = 23 instructions
Buy one get one 85% off!

Just saw a Twitch stream where someone streamed at 30Mbps! For 720p30!

@wolf480pl No, it's the other way around, they're callee-saved. So your function has to preserve them.
x86 has no such requirement. All vector registers are caller-saved, if they're needed.

The throughput is really a throwback to x86-32's register pressure issues. When you unroll everything so much you start to run out of units.
ARM may have 32 registers, but 8 have to be preserved (well, only their halves, which pretty much means as a whole usually before you resort to hacks), and one is a 0.
Which means you get only around 6 registers to use in total per unrolled iteration. Which is kinda meh.

I really much strongly prefer x86's ABI where all vector registers are scratch registers. What's the point of having access to registers for which you need to preserve half of them? Using the stack to save them is burning cycles.

Show thread

I think a lot of what's said about M1's architecture is also applicable to other ARM CPUs timings in general, since I've seen exactly the same behavior on a Cortex-A53 (Odroid C2).

Anyway, this is sooooooooo helpful. I don't think I'll want to write SIMD without seeing timing sheets now. You have no idea why your code is so slow without one, but now that I do, I perfectly understand.

Show thread

1. LOOK AT THOSE THROUGHPUTS! LOOK AT THEM! 0.25 ON MOST INSTRUCTIONS! That's insane and tells so much! They're full of units, both arithmetic, shuffle and load/store.
This means that these CPUs really hugely benefit from unrolling in order to saturate all units as much as possible.

2. The timings are mostly on the meh side of things. 2-cycle latency on all shuffles (zip/unzip/rev64/etc). Except for floating point adds, those are barely 3 cycles! That's on-par with the best x86, most x86s have 4 cycle adds.!

3. The CPU seems to have 3 load/store units. Same as Zen 3, only they're generic, whilst with the Zen they were always either 2 stores/1 load or vice-versa.

4. The units are definitely split into float/integers. 7 cycle latency for cross-use is terrible, even on pre-combination era x86s.

5. Integer SIMD performance is on is actually pretty bad. 2 cycle latency on adds. 3 cycle latency on mults. x86s will do pretty much all integer SIMD in one cycle.

Overall impression: UNROLL EVERYTHING TO THE MAX! BINARY SIZE BE DAMNED!

Show thread

Someone reverse engineered the instruction timings of Apple's Firestorm/Icestorm CPUs found in the M1!
Will post my impressions next post!

dougallj.github.io/applecpu/fi

Looked through my music collection and found MOSAIC.WAV's Crossing Heart.

They wrote a track, マジカルラッキージャム, playable both forward and reverse! It sounds pretty decent! It's also the best track on the album!

soundcloud.com/mosaicwav/short

Lynne boosted

On the topic of SIMD, RISC-V is actually going to be less capable than SVE if most hardware has 4096-bit vector are size, since it's shared between all vectors. You'd likely run out of registers if trying to do a 32-point one.

It's not great, RISC-V's SIMD will be worse purely in terms of size than AVX2.
Hopefully they go higher, but I'm doubting it, most documentation/examples I find use 4096-bit register files.

Show thread

Like, SVE allows sizes up to 2048 bits, which means you could handle a single 64-point transform in 2 registers.
Which will mean to create the most optimal split-radix FFT you'll need 3 handwritten transforms - a single 64-point, a 2-at-a-time 64-point and a 2-at-a-time 32-point.

FFTs are done in stages, but to optimize you generally split the transform into even and odd value sections and you optimize them all at once. Doing it in stages will reduce SIMD potential and will mean you'll end up cheating by having more permutations than actual arithmetic.

I think I'd be crazy enough to do it but definitely not for money. No idea what would motivate me, but money wouldn't do it.

Show thread

I don't think I'll like doing FFT assembly for ARM SVE or RISC-V.
When you have a limited vector size, you are bound to optimize the algorithm to it.
When you don't, a lot of skills in optimizing are gone, since now it's all about how much patience are you willing to spend to handwrite huge transforms (someone insane manually factored the 64-point FFT to 1152 ops over a period of probably a month).
I probably wouldn't have the patience to do more than a 32-point handwritten non-compound transform.

@mia "60% of sRGB" stickers somehow don't carry the same marketing weight as "vivid color ᶦᶠ ʸᵒᵘ ᵃʳᵉ ᵒⁿ ᵃᶜᶦᵈ!"

@PawelK I don't think so, it doesn't really have it. "Drone" by itself would be. "Orchestrator" too, but that one's already taken by Microsoft.

Show older
Parsee

A Mastodon instance for people interested in multimedia, codecs, assembly, SIMD, and the occasional weeb stuff.