The author himself states that they are slow. I reckon that at that these instru...

nkurz · on Oct 19, 2014

I posted this before I looked up the timings. Some of the historically slow instructions like BT/BTS/BTR are now fast. I was hoping these might be among them, but unfortunately they are not.

For Haswell, Agner (http://www.agner.org/optimize/) says:

  AAA 2 uops, Latency 4
  AAS 2 uops, Latency 6
  AAD 3 uops, Latency 4
  AAM 8 uops, Latency 21

WalterBright · on Oct 19, 2014

BT/BTS/BTR are also now well used by compiler code generators.

With AAx instructions, it's the old chicken and egg problem. Compilers don't use them because they are slow, and they are slow because compilers don't use them.

I speculate that perhaps the BTx instructions got fast because the Intel compiler devs wanted to use them.

userbinator · on Oct 19, 2014

They're relatively slow for a single instruction, but if you compare the uops they generate with the number of uops needed for an equivalent sequence of multiple simple ("RISC style") instructions, I'd bet they're the same or even slightly better - after all, an equivalent sequence of instructions would need to perform the same operations the same way - by generating uops for the execution units. The difference is that the RISC-style instructions take up more cache and fetch and decode bandwidth, whereas these CISC instructions get expanded into uops inside the decoder - so the speed of these instructions is dependent upon how fast the decoder can emit the uops.

...and looking at the same Haswell instruction tables for the simpler sequence of instructions, we find that:

AAA performs a compare and decides whether or not to add some constant, along with some flag operations; 2 uops is what one compare and one add instruction would already generate, plus you'd have to take into account a (possibly mispredicted) conditional jump. If you find out a way to do it using a CMOV, that alone is 2 uops and a latency of 2 cycles. AAS is similar except it's doing a subtraction, but maybe there's something else that increases its latency by another 2 clocks...

AAD is an 8-bit multiply followed by an addition and clearing of a register. MUL/IMUL r8 generates 1 uop and has a latency of 3, ADD r, i is another uop with a latency of 1, and clearing the register is another uop (no latency due to register renaming, I'd guess.) This would be 3 uops and a latency of 4, exactly the same as the single instruction.

AAM is an 8-bit divide; a DIV r8 generates 9 uops and has a latency of 22-25, compared with 8 uops and a latency of 21 for AAM.

So it would appear that Intel has pretty much made these instructions as fast as they could for the microarchitecture, and glancing through the tables this appears to have been true since the Pentium II (with two exceptions - the Atom series, and the ill-fated NetBurst); e.g. in the Nehalem, we have

    AAA/AAS/DAA/DAS  1 uop, latency 3
    AAD 3 uops, latency 15(?)
    AAM 5 uops, latency 20

and the Pentium M has

    AAA/AAS/DAA/DAS 1 uop, no latency listed
    AAD 3 uops, latency 2
    AAM 4 uops, latency 15

The Atom is rather disappointing:

    AAA/AAS 13 uops, latency 16
    AAD 4 uops, latency 7
    AAM 10 uops, latency 24

The P4 is extremely disappointing:

    AAA/AAS 4+27 uops, latency 90
    AAD 4+10 uops, latency 22
    AAM 4+22 uops, latency 56

AMD has historically been worse on the complex instructions, and although they've improved a bit, are still behind Intel's latest; e.g. for Steamroller the timings are

    AAA/AAS 10 uops, latency 6
    AAD 4 uops, latency 6
    AAM 10 uops, latency 15 (on par with 9 uops/latency 17-22 for DIV r8)

Edit: I benchmarked AAM vs DIV and AAD vs MUL+ADD (with a dependency chain, so the real latencies are being tested instead of being hidden by something else) on a Nehalem (i7 870) and for 500000 iterations,

    5250303 clock cycles for DIV
    5250258 clock cycles for AAM
    3818905 clock cycles for MUL+ADD
    3818907 clock cycles for AAD

So it's safe to say they're really just as fast.

WalterBright · on Oct 19, 2014

Very interesting. One reasonable benchmark would be uppercasing a string as described in the article vs the usual way.

pbsd · on Oct 19, 2014

As great as BCD instructions may be, they can't possibly compete with vector instructions for that kind of operation.

colanderman · on Oct 19, 2014

Wow, that Agner link is a goldmine. Thanks.

WalterBright · on Oct 19, 2014

Ah, I see he did. Oh well.