The Hidden Power of BCD Instructions

userbinator · on Oct 19, 2014

AAD is basically an 8-bit immediate multiply-accumulate, so it could be said that the 8088 had a MAC instruction - with self-modifying code you could do a real 8-bit multiply-accumulate. ;-)

There's also this 5-byte sequence to convert a nybble (0-F) in AL into the appropriate ASCII hex digit (30h-39h, 41h-46h):

    cmp al, 10
    sbb al, 69h
    das

x86 code can be insanely dense - check out the 256-byte and below categories in the demoscene, for example.

aktau · on Oct 19, 2014

That looks much cooler than the way I recently tried to print numbers in hex when teaching myself some more asm (https://gist.github.com/aktau/a85a925282fbe66d13b0). I wonder how it performs... (afaik, old rarely used instructions like that could get deprioritized by intel)

zxcdw · on Oct 19, 2014

This is good stuff. If anybody wants to dig deeper to articles like this, I have to mention the Hugi Coding Digest[1] (an executable "diskmag") from 2003 which contains all the articles related to programming from Hugi #11 to Hugi #27, including this one.

The topics of the articles are as follows: "Mathematics & Theoretical Computer Science", "General Programming Techniques", "Searching & Sorting", "Object-Orientated Programming", "File Formats", "Text Processing", "2D Graphics Programming", "3D Graphics Programming", "Windows Graphics Programming (GDI, DirectDraw, Direct3D)", "OpenGL", "Sound Programming", "Synchronization & Scripting for Demos", "Hardware-centered Programming", "Code Optimization, FPU", "Data Compression", "64k, 4k and even smaller intros", "Windows", "Linux", "Other Non-Wintel platforms", "Active Server Pages", "ActiveX", "Assembler", "C++", "Flash", "Java", "JavaScript", "PHP", "Other Programming Languages", "Miscellaneous".

Hell, it also has some nice tracker music on the background.

Obviously the format is a bit cumbersome -- but I think it's a good dive into the demoscene culture. Also most of the articles are written by hobbyists -- the real young hackers (oh and a few crackers too!) who just want to share what they have learned.

I think it should run natively on Windows and runs on Linux via Wine. Just launch the hugicode.exe -- of course with appropriate security caution, and if you trust me, Hugi and scene.org to have no malicious intent. :)

Why is the hacking culture like this dead? It was still somewhat well alive just 10 years ago, never mind 15 or 20 years ago. Even after so many years, it still saddens me to look back into gems like this Hugi Special Digest from a decade ago and see it forgotten and gone. Not just the contents or the release itself, but the computing culture which has died along with the demoscene.

[1]: https://www.scene.org/file.php?file=/mags/hugi/hugise01.zip&...

userbinator · on Oct 19, 2014

Why is the hacking culture like this dead?

The demoscene is very much alive, if you look at places like pouet.net there's plenty of new demos released even in the sub-1k categories. The newest ones there are from this month. However, you might be correct to say that it's become less known amongst general computer users and programmers, and I think the consumer-oriented nature of computers today (especially mobile devices) is mostly to blame; users are restrained and actively discouraged from tinkering with their machines software and hardware-wise, and isolated from knowledge by many layers of abstraction and complexity. There's a big movement against users sharing executables with each other and running them, and while the security concerns are real, I think it's also had a chilling effect on the hobbyists. The fact that antimalware software tends to detect packed demos as suspicious/infected (false positives) doesn't help either. In addition, many people probably found their way into demoscene via the warez scene that it grew from - and with the growing antipiracy concerns, that route is becoming narrower too.

While I don't think the demoscene is currently "dead" per se, it's certainly at risk of becoming even more of an obscure and fringe culture than it is now.

ChuckMcM · on Oct 19, 2014

I agree with this, I think the fraction of people who are looking at computers in this deep way is similar to what it has always been, but it is still a small fraction. And as such its activities are swamped in the noise of other things with the same name.

Perhaps part of the difference is that before (when RAM/CPU was expensive/slow) you were forced to do this to make something impressive and now we have an excess of compute and RAM. So to rekindle that challenge we set an artificial limit.

voltagex_ · on Oct 19, 2014

I did a little digging - there's a Linux version of Hugi's reader software (called Panorama) available at http://chris.dragan.name/old-projects/panorama/

GFK_of_xmaspast · on Oct 19, 2014

That's just the lifecycle of subcultures.

davidst · on Oct 19, 2014

This really brings back memories. 25 years ago I used (what was then) the undocumented variation in the itoa() routine for the Borland C run-time library. The purpose was to eliminate the need for a 16-byte table to generate hex codes when base-16 output was desired. itoa() was a part of the printf() library so this table became embedded in virtually every executable. Knocking that out was a meaningful size optimization in those days.

Tuna-Fish · on Oct 19, 2014

Note that all of the decimal arithmetic instructions are invalid in 64-bit mode.

They had to scavenge opcode space from somewhere, and the bcd were deemed unnecessary.

userbinator · on Oct 19, 2014

Unfortunately they (AMD) didn't reassign those opcodes for some other purpose - they've just become completely invalid.

Instead, a whole row of useful general-purpose increment and decrement instructions was replaced by 16 REX prefices. A bit odd if you consider that the number of BCD and segment prefix opcodes they made invalid would've been more than enough to be assigned to the new REXes, and still maintain a consistent encoding...

2510c39011c5 · on Oct 19, 2014

Since both AAD (opcode 0x37) and AAS (opcode 0x3F) are 1 byte long, and a sequential combination of them won't change the program semantics...another bonus is that ascii characters for 0x37 and 0x3F are printable characters...

This would make them good replacement for a sequence of NOP's...

WalterBright · on Oct 19, 2014

Anyone want to benchmark to see if these short sequences are faster?

AlyssaRowan · on Oct 19, 2014

They're slower, but things like this can be really useful if you're golfing for size. They're available on some other chips like the 6502, as well - although how they actually work varies! (And is mostly undocumented, too - anything that wants to use the Z flag after an ADC, for example.)

You can also abuse them as part of things like base change routines sometimes… which is sort of what they're intended for.

ksherlock · on Oct 19, 2014

The 6502 and variants (65c02, 65816, emulators, etc) have slightly different algorithms for BCD math. Adding illegal BCD numbers and checking the result is one way to identify the processor.

dtech · on Oct 19, 2014

The author himself states that they are slow. I reckon that at that these instructions are translated into a whole slew of micro-ops. Rarely used instructions like this often are not well optimized.

nkurz · on Oct 19, 2014

I posted this before I looked up the timings. Some of the historically slow instructions like BT/BTS/BTR are now fast. I was hoping these might be among them, but unfortunately they are not.

For Haswell, Agner (http://www.agner.org/optimize/) says:

  AAA 2 uops, Latency 4
  AAS 2 uops, Latency 6
  AAD 3 uops, Latency 4
  AAM 8 uops, Latency 21

WalterBright · on Oct 19, 2014

BT/BTS/BTR are also now well used by compiler code generators.

With AAx instructions, it's the old chicken and egg problem. Compilers don't use them because they are slow, and they are slow because compilers don't use them.

I speculate that perhaps the BTx instructions got fast because the Intel compiler devs wanted to use them.

userbinator · on Oct 19, 2014

They're relatively slow for a single instruction, but if you compare the uops they generate with the number of uops needed for an equivalent sequence of multiple simple ("RISC style") instructions, I'd bet they're the same or even slightly better - after all, an equivalent sequence of instructions would need to perform the same operations the same way - by generating uops for the execution units. The difference is that the RISC-style instructions take up more cache and fetch and decode bandwidth, whereas these CISC instructions get expanded into uops inside the decoder - so the speed of these instructions is dependent upon how fast the decoder can emit the uops.

...and looking at the same Haswell instruction tables for the simpler sequence of instructions, we find that:

AAA performs a compare and decides whether or not to add some constant, along with some flag operations; 2 uops is what one compare and one add instruction would already generate, plus you'd have to take into account a (possibly mispredicted) conditional jump. If you find out a way to do it using a CMOV, that alone is 2 uops and a latency of 2 cycles. AAS is similar except it's doing a subtraction, but maybe there's something else that increases its latency by another 2 clocks...

AAD is an 8-bit multiply followed by an addition and clearing of a register. MUL/IMUL r8 generates 1 uop and has a latency of 3, ADD r, i is another uop with a latency of 1, and clearing the register is another uop (no latency due to register renaming, I'd guess.) This would be 3 uops and a latency of 4, exactly the same as the single instruction.

AAM is an 8-bit divide; a DIV r8 generates 9 uops and has a latency of 22-25, compared with 8 uops and a latency of 21 for AAM.

So it would appear that Intel has pretty much made these instructions as fast as they could for the microarchitecture, and glancing through the tables this appears to have been true since the Pentium II (with two exceptions - the Atom series, and the ill-fated NetBurst); e.g. in the Nehalem, we have

    AAA/AAS/DAA/DAS  1 uop, latency 3
    AAD 3 uops, latency 15(?)
    AAM 5 uops, latency 20

and the Pentium M has

    AAA/AAS/DAA/DAS 1 uop, no latency listed
    AAD 3 uops, latency 2
    AAM 4 uops, latency 15

The Atom is rather disappointing:

    AAA/AAS 13 uops, latency 16
    AAD 4 uops, latency 7
    AAM 10 uops, latency 24

The P4 is extremely disappointing:

    AAA/AAS 4+27 uops, latency 90
    AAD 4+10 uops, latency 22
    AAM 4+22 uops, latency 56

AMD has historically been worse on the complex instructions, and although they've improved a bit, are still behind Intel's latest; e.g. for Steamroller the timings are

    AAA/AAS 10 uops, latency 6
    AAD 4 uops, latency 6
    AAM 10 uops, latency 15 (on par with 9 uops/latency 17-22 for DIV r8)

Edit: I benchmarked AAM vs DIV and AAD vs MUL+ADD (with a dependency chain, so the real latencies are being tested instead of being hidden by something else) on a Nehalem (i7 870) and for 500000 iterations,

    5250303 clock cycles for DIV
    5250258 clock cycles for AAM
    3818905 clock cycles for MUL+ADD
    3818907 clock cycles for AAD

So it's safe to say they're really just as fast.

WalterBright · on Oct 19, 2014

Very interesting. One reasonable benchmark would be uppercasing a string as described in the article vs the usual way.

pbsd · on Oct 19, 2014

As great as BCD instructions may be, they can't possibly compete with vector instructions for that kind of operation.

colanderman · on Oct 19, 2014

Wow, that Agner link is a goldmine. Thanks.

WalterBright · on Oct 19, 2014

Ah, I see he did. Oh well.