Most useless #x86 instruction: crc32.
It uses the Castagnoli polynomial which is mainly only used in ethernet, where checksums are 99% of the time generated by the interface, not the CPU.
The IEEE polynomial is the ubiquitous one used pretty much everywhere.
These days the microcode of the instruction has hardly been touched, and why bother when programs can roll their own with the same or better performance. And an actually useful polynomial.
#x86 Pentium 4's hack to fix instruction scheduling
https://en.wikipedia.org/wiki/Replay_system
For per-component right shifts in #x86 assembly you can use pmulhrsw with a { 0x3fff >> shift } vector, if numbers ∈ [1:16384] and shift ∈ [0:14]. Shifts over 15 yield 1 instead of 0. Larger numbers allowed with higher shifts.
Just tooting this for myself if I need it again in the future. I remember it was very useful for packing, despite the inconvenient requirements.
Codec researcher, x86 assembly and Vulkan expert. As expected.
Physicist. Unexpected.
Had nothing to do with x264. Most unexpected.