Most useless instruction: crc32.
It uses the Castagnoli polynomial which is mainly only used in ethernet, where checksums are 99% of the time generated by the interface, not the CPU.
The IEEE polynomial is the ubiquitous one used pretty much everywhere.
These days the microcode of the instruction has hardly been touched, and why bother when programs can roll their own with the same or better performance. And an actually useful polynomial.

For per-component right shifts in assembly you can use pmulhrsw with a { 0x3fff >> shift } vector, if numbers ∈ [1:16384] and shift ∈ [0:14]. Shifts over 15 yield 1 instead of 0. Larger numbers allowed with higher shifts.
Just tooting this for myself if I need it again in the future. I remember it was very useful for packing, despite the inconvenient requirements.


A Mastodon instance for people interested in multimedia, codecs, assembly, SIMD, and the occasional weeb stuff.