There are 3 way to write an ISA:
First is to just listen to suggestions from companies and programmers and smooth the cost through economies of scale. Everyone's happy. That's Intel.
Second is to design it, but still think about special things some programmers might need. That's ARM.
Third is to design by committee, with implementation cost and elegance in mind. That's RISC-V.
So even if a fast and powerful RISC-V CPU appears it'll be held back by a Fisher-Price ISA. A cheap Intel will beat it.

@lynne you probably write a lot of assembly to make codecs fast, so ISA matters a lot to you. But do you think AAA games or web browsers would be affected by a perculiar ISA choices that nevertheless play well with compilers?

@wolf480pl Yes, even outside of codecs matrix ops are still very relevant and common, so you want fast, cheap shuffles, which you just don't get unless you can encode the pattern as an immediate. But you can't do that if you limit your instructions to always be 32-bit in length.
It makes for a cheap instruction decoder, but decoding instructions already takes barely a few percent on modern x86 CPUs in total, and a lot of that is very low overhead since the lookahead is massive.

@lynne don't you count reorder buffer, instruction scheduling, register renaming, microcode engine, and all that shit as part of Decode? o.O

@wolf480pl On x86 I think of decoding as only converting the variable length instructions to uops.
Though AFAIK different architectures still use uops internally, so it shouldn't be any different, apart from being simpler due to constant instruction length.
Everything else is separate.

@lynne if by uops you mean the control signals for all the muxes and flip-flops then yes, every CPU necessarily has those.

As for fixed-length vs variable-length instructions, I think it's more for simplifying Fetch than Decode, and to make it easier to avoid pipeline stalls.

I'm not sure how it works in out-of-order CPU's, i'll have to think about it more.
But in an in-order CPU, with separate L1I and L1D, having to read more than one word from L1I per instruction would cause pipeline stalls.

@lynne but do you really need that shuffle pattern to be an immediate?
You can read from L1D at the same time as you read from L1I, and a shuffle on vector registers isn't supposed to access any vectors in memory, right?

So if you have a free L1D cycle, you may as well read the shuffle pattern from there.
Now that I think of it, ARM's placement of constants at the end of a function makes a lot of sense.

@wolf480pl You waste a register on ARM to load the pattern for even the simplest ones. Contrary to popular belief, you don't have a lot of registers on ARM. Barely 23 or so, considering one is zero and 8 must be preserved (technically only half of their bits, but in practice all bits). And Apple like being special so they reserve v17.
For most things I've worked on you shuffle many many times with unique pattern, so it really adds up.

ok, so you don't want to do

ld rN, =pattern1
shuffle rN, someVector

which in reality is:

ld rN, [pc+pattern1_offset]
shuffle someVector, rN

that's 2 L1I accesses, one L1D, and wastes a register.

But what if you could do sth like:

shuffle someVector, [pc+pattern1_offset]


though now that I think of it, it only makes sense if the offset takes fewer bits than the pattern.

How long are the vectors usually?
For 4, the pattern would be 8 bits, right?
And for 8, it'd be 24 bits?

@wolf480pl Vector shuffles on aarch64 and wasm use a pattern which is 1 vector register, and its always done on a per-byte basis.
You can do this on x86 using pshufb m2, [byte_pattern], but you can also just do shufps m2, q3210.

@lynne uhh... so with 128-bit vectors you have 16 elements, that's 64 bits minimum for the shuffle pattern, way too much for an immediate IMO.

16 elements is a 4x4 matrix, right? Are bigger matrices, like 8x8 or 16x16 used much in every-day software?

Also, I wonder what makes matrix multiplication such a popular hot spot in today's software.

Obviously, there's a lot of that in 3D graphics, but that's on the GPU.
Then there are codecs, but we already knew that.
I'm guessing image processing algorithms, like ones GIMP offers, use matrices a lot, too.

But when using a bloated web browser and inefficient JS frontend to scroll through fedi's timelines, are there many matrices involved?

@lynne s/multiplication/operations/, dunno why I thought you mentioned multiplication specifically

I don't want per-byte shuffle immediates, just per-16/3264 bit words, which x86s do have an ARMs don't.
8x8 and 16x16 are common codec block sizes. Obviously you can't fit in that many coefficients in a single register (moreover they're usually at least 16 bits) so you split them up.

@lynne oh, so big words, 4 at a time. Not even a whole row, ok. I could totally see an 8-bit pattern for shuffling 4 words around fitting in an immediate for an ISA with fixed-length instructions. It's only 3 bits more than a register number.

tl;dr no reason why you couldn't have shufps m2, q3210 in a fixed-length instruction set.

@lynne hm... looks like RISC-V RV32V extension, in the current draft, does the shuffle in a very similar way to ARM.

Also, unlike x86, it has all the instructions generalized across all vector sizes, and the vector size is not even encoded in the instruction. It's in a separate status register.

But it has, and, which take an immediate, but only allow shift-like permutations. It has some support for masks, too. Though I guess it's still not what you need.

@wolf480pl No, by uops I meant micro operations.
That has to happen earlier in the stage, at least on x86, due to uop fusion, rather than later on.

@lynne oh, ok.
I don't know about other out-of-order CPUs (btw. if you have some good learning materials about those, pls link) but in in-order CPUs, I've seen Decode defined as converting ISA instructions to control signals.

Sign in to participate in the conversation

A Mastodon instance for people interested in multimedia, codecs, assembly, SIMD, and the occasional weeb stuff.