I don't have to care about MIPS SIMD assembly! I can probably get myself new RISC-V Loongson chips for free instead!

The double version is 13% faster than the AVX version.
But the AVX version exists for a good reason - it saves a ton of registers and nasty cross-lane shuffles (the double version needs a different register layout).

Show thread

Got a 2xFFT8 working!
Price of 1xFFT8 = 20 instructions
Price of 2xFFT8 = 23 instructions
Buy one get one 85% off!

Just saw a Twitch stream where someone streamed at 30Mbps! For 720p30!

The throughput is really a throwback to x86-32's register pressure issues. When you unroll everything so much you start to run out of units.
ARM may have 32 registers, but 8 have to be preserved (well, only their halves, which pretty much means as a whole usually before you resort to hacks), and one is a 0.
Which means you get only around 6 registers to use in total per unrolled iteration. Which is kinda meh.

I really much strongly prefer x86's ABI where all vector registers are scratch registers. What's the point of having access to registers for which you need to preserve half of them? Using the stack to save them is burning cycles.

Show thread

I think a lot of what's said about M1's architecture is also applicable to other ARM CPUs timings in general, since I've seen exactly the same behavior on a Cortex-A53 (Odroid C2).

Anyway, this is sooooooooo helpful. I don't think I'll want to write SIMD without seeing timing sheets now. You have no idea why your code is so slow without one, but now that I do, I perfectly understand.

Show thread

1. LOOK AT THOSE THROUGHPUTS! LOOK AT THEM! 0.25 ON MOST INSTRUCTIONS! That's insane and tells so much! They're full of units, both arithmetic, shuffle and load/store.
This means that these CPUs really hugely benefit from unrolling in order to saturate all units as much as possible.

2. The timings are mostly on the meh side of things. 2-cycle latency on all shuffles (zip/unzip/rev64/etc). Except for floating point adds, those are barely 3 cycles! That's on-par with the best x86, most x86s have 4 cycle adds.!

3. The CPU seems to have 3 load/store units. Same as Zen 3, only they're generic, whilst with the Zen they were always either 2 stores/1 load or vice-versa.

4. The units are definitely split into float/integers. 7 cycle latency for cross-use is terrible, even on pre-combination era x86s.

5. Integer SIMD performance is on is actually pretty bad. 2 cycle latency on adds. 3 cycle latency on mults. x86s will do pretty much all integer SIMD in one cycle.


Show thread

Someone reverse engineered the instruction timings of Apple's Firestorm/Icestorm CPUs found in the M1!
Will post my impressions next post!


Looked through my music collection and found MOSAIC.WAV's Crossing Heart.

They wrote a track, マジカルラッキージャム, playable both forward and reverse! It sounds pretty decent! It's also the best track on the album!


Lynne boosted

On the topic of SIMD, RISC-V is actually going to be less capable than SVE if most hardware has 4096-bit vector are size, since it's shared between all vectors. You'd likely run out of registers if trying to do a 32-point one.

It's not great, RISC-V's SIMD will be worse purely in terms of size than AVX2.
Hopefully they go higher, but I'm doubting it, most documentation/examples I find use 4096-bit register files.

Show thread

Like, SVE allows sizes up to 2048 bits, which means you could handle a single 64-point transform in 2 registers.
Which will mean to create the most optimal split-radix FFT you'll need 3 handwritten transforms - a single 64-point, a 2-at-a-time 64-point and a 2-at-a-time 32-point.

FFTs are done in stages, but to optimize you generally split the transform into even and odd value sections and you optimize them all at once. Doing it in stages will reduce SIMD potential and will mean you'll end up cheating by having more permutations than actual arithmetic.

I think I'd be crazy enough to do it but definitely not for money. No idea what would motivate me, but money wouldn't do it.

Show thread

I don't think I'll like doing FFT assembly for ARM SVE or RISC-V.
When you have a limited vector size, you are bound to optimize the algorithm to it.
When you don't, a lot of skills in optimizing are gone, since now it's all about how much patience are you willing to spend to handwrite huge transforms (someone insane manually factored the 64-point FFT to 1152 ops over a period of probably a month).
I probably wouldn't have the patience to do more than a 32-point handwritten non-compound transform.

Either that or it's a very good name for a band that plays electronic dance music.

Show thread

"Generic Orchestral Drone" should be a music genre. It's the "whatever, just put some music on the game, I don't care" of video game music after all.

NASA: Hey, like, could you start wearing masks?
ISS: Why? We've all been screened, we've been on the station a month now breathing recirculated air, we haven't had a case ever.
NASA: Well, we all do here on Earth, and like, you're from Earth, so could you show a little solidarity?
ISS: No? It's not our fucking problem, is it? We're not even due to come back in months now.
NASA: But when you take pictures without masks on it's really uncomfortable to see them down here.
ISS: <fake coughs directly at camera> Did this make you uncomfortable?
NASA: Yeah.
ISS: <everyone bunched together>
NASA: Fuck you. I hope you choke on the incredibly foul smelling air up there. <hangs up>

It's only been 2 months and this is how I feel about 202(0)1 already.

Show thread

Rocket Lake benchmarks out!
RIP Intel is the best way I can put it. Shit IPC gains that you can make up by just OCing your RAM.
"But isn't AV512 nice, Lynne?". No. I'll only become nice and "a thing" once AMD supports it.

Skylake was such a polished gem, and now, look at them. Still stuck on a now-ancient node.

Show older

A Mastodon instance for people interested in multimedia, codecs, assembly, SIMD, and the occasional weeb stuff.