Hacker News new | past | comments | ask | show | jobs | submit login

The author himself states that they are slow. I reckon that at that these instructions are translated into a whole slew of micro-ops. Rarely used instructions like this often are not well optimized.



I posted this before I looked up the timings. Some of the historically slow instructions like BT/BTS/BTR are now fast. I was hoping these might be among them, but unfortunately they are not.

For Haswell, Agner (http://www.agner.org/optimize/) says:

  AAA 2 uops, Latency 4
  AAS 2 uops, Latency 6
  AAD 3 uops, Latency 4
  AAM 8 uops, Latency 21


BT/BTS/BTR are also now well used by compiler code generators.

With AAx instructions, it's the old chicken and egg problem. Compilers don't use them because they are slow, and they are slow because compilers don't use them.

I speculate that perhaps the BTx instructions got fast because the Intel compiler devs wanted to use them.


They're relatively slow for a single instruction, but if you compare the uops they generate with the number of uops needed for an equivalent sequence of multiple simple ("RISC style") instructions, I'd bet they're the same or even slightly better - after all, an equivalent sequence of instructions would need to perform the same operations the same way - by generating uops for the execution units. The difference is that the RISC-style instructions take up more cache and fetch and decode bandwidth, whereas these CISC instructions get expanded into uops inside the decoder - so the speed of these instructions is dependent upon how fast the decoder can emit the uops.

...and looking at the same Haswell instruction tables for the simpler sequence of instructions, we find that:

AAA performs a compare and decides whether or not to add some constant, along with some flag operations; 2 uops is what one compare and one add instruction would already generate, plus you'd have to take into account a (possibly mispredicted) conditional jump. If you find out a way to do it using a CMOV, that alone is 2 uops and a latency of 2 cycles. AAS is similar except it's doing a subtraction, but maybe there's something else that increases its latency by another 2 clocks...

AAD is an 8-bit multiply followed by an addition and clearing of a register. MUL/IMUL r8 generates 1 uop and has a latency of 3, ADD r, i is another uop with a latency of 1, and clearing the register is another uop (no latency due to register renaming, I'd guess.) This would be 3 uops and a latency of 4, exactly the same as the single instruction.

AAM is an 8-bit divide; a DIV r8 generates 9 uops and has a latency of 22-25, compared with 8 uops and a latency of 21 for AAM.

So it would appear that Intel has pretty much made these instructions as fast as they could for the microarchitecture, and glancing through the tables this appears to have been true since the Pentium II (with two exceptions - the Atom series, and the ill-fated NetBurst); e.g. in the Nehalem, we have

    AAA/AAS/DAA/DAS  1 uop, latency 3
    AAD 3 uops, latency 15(?)
    AAM 5 uops, latency 20
and the Pentium M has

    AAA/AAS/DAA/DAS 1 uop, no latency listed
    AAD 3 uops, latency 2
    AAM 4 uops, latency 15
The Atom is rather disappointing:

    AAA/AAS 13 uops, latency 16
    AAD 4 uops, latency 7
    AAM 10 uops, latency 24
The P4 is extremely disappointing:

    AAA/AAS 4+27 uops, latency 90
    AAD 4+10 uops, latency 22
    AAM 4+22 uops, latency 56
AMD has historically been worse on the complex instructions, and although they've improved a bit, are still behind Intel's latest; e.g. for Steamroller the timings are

    AAA/AAS 10 uops, latency 6
    AAD 4 uops, latency 6
    AAM 10 uops, latency 15 (on par with 9 uops/latency 17-22 for DIV r8)
Edit: I benchmarked AAM vs DIV and AAD vs MUL+ADD (with a dependency chain, so the real latencies are being tested instead of being hidden by something else) on a Nehalem (i7 870) and for 500000 iterations,

    5250303 clock cycles for DIV
    5250258 clock cycles for AAM
    3818905 clock cycles for MUL+ADD
    3818907 clock cycles for AAD
So it's safe to say they're really just as fast.


Very interesting. One reasonable benchmark would be uppercasing a string as described in the article vs the usual way.


As great as BCD instructions may be, they can't possibly compete with vector instructions for that kind of operation.


Wow, that Agner link is a goldmine. Thanks.


Ah, I see he did. Oh well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: