HN2new | past | comments | ask | show | jobs | submitlogin
ARM immediate value encoding (mcdiarmid.org)
83 points by JoshTriplett on Sept 20, 2015 | hide | past | favorite | 41 comments


ARM, like other RISC architectures MIPS and PowerPC, has a fixed instruction size of 32 bits. This is a good design decision, and it makes instruction decode and pipeline management much easier than with the variable instruction size of x86 or 680x0.

Actually, it turned out not to be that good a decision. The generated code is far less compact than x86, and this has costs which outweigh the advantage of simpler instruction decode. CPU speeds increased far faster than memory, so the costs of storing and fetching more instruction data grew larger and larger.

As a result, ARM experimented with other more compact encoding schemes, e.g. Thumb.


Indeed. The rise and fall of RISC designs coincides with a period where CPU and memory speeds were much better matched, and there was an extreme premium on the number of gates consumed by your design, to fit it all in a single economically manufacturable (i.e. small) silicon die.

I say "fall" because ARM had an additional advantage, the market which they were targeting couldn't afford ceramic packages, only plastic, and they didn't have the simulation capability to confirm their 2nd design would hit their power dissipation target (they were in fact using chips from their 1st design), so they were very conservative and hit their target with a very large margin to spare, maybe a factor 10?

So with that extra advantage, and a good business model (better than any competitor at the critical times? Certainly MIPS), they got more and more design wins where power was a consideration. In 2001 at Lucent I worked on a power constrained monster board that provisioned something like 300 modem lines, it had specialized ADI chips to do the heavy lifting, they were controlled by a bunch of ARM chips, and it had one housekeeping MIPS chip that was standard for these boards. And they pretty much own the mobile market that isn't tied to x86 by an existing software base, or that needs more horsepower than they're currently offering.


Also, RISC began at a time when main memory was actually slightly faster than the CPU[1], and the bottleneck for the contemporary CISCs was instruction decoding. With memory bandwidth to spare, it made a lot of sense to simplify the decoders to make them faster, resulting in sparser encodings which used more storage and bandwidth. Had memory always been slower, perhaps things would've turned out very differently - multicycle decoders and more complex instruction sets might've become far more common, and less caching.

[1] If I remember correctly, I saw the graph showing this memory-faster-than-CPU in the famous Hennessy&Patterson book, which is somewhat ironic since it quite heavily preaches about the virtues of MIPS.


I remember Steve Furber told us in CS that the first time he tested the original ARM they got back from the fab, they forgot to connect the power lines. It worked anyway, just from the power on the I/O lines.


Hah! No need to meter the power used to know they'd achieved their goal by a large margin.

I found the interview of him I was making my comments from, here's the critical 2 paragraphs (http://queue.acm.org/detail.cfm?id=1716385):

SF The ARM was conceived as a processor for a tethered desktop computer, where ultimate low power was not a requirement. We wanted to keep it low cost, however, and at that time, keeping the chip low cost meant ensuring it would go in low-cost packaging, which meant plastic. In order to use plastic packaging, we had to keep the power dissipation below a watt—that was a hard limit. Anything above a watt would make the plastic packaging unsuitable, and the package would cost more than the chip itself.

We didn't have particularly good or dependable power-analysis tools in those days; they were all a bit approximate. We applied Victorian engineering margins, and in designing to ensure it came out under a watt, we missed, and it came out under a tenth of a watt—really low power.


It happens due to ESD protection diodes on the I/O lines. You can still see this happen in some of the current low power microcontrollers.


Interesting. It reminds me of how (simple) compression schemes save disk space and speed throughput on modern database systems. Moving data around the memory hierarchy is the bottleneck in all systems it seems, from registers, via cache, RAM, (soon x-point,) SSDs, all the way to spinning disks and indeed offsite storage. Some form of space optimisation seems to pay off both ways across this hierarchy even if it is more compute intensive. It was not always so: in the 90s and earlier, compression tended to have had a speed tradeoff.


I agree, compression becoming 'free' or 'better than free' is counter-intuitive and yet it can be a performance improvement. Perhaps we may see another change in priorities in the future though, as systems optimise more and more for power consumption over raw speed...

(There are SSDs that do transparent compression/decompression. I wonder if there are RAM chips out there that do the same thing for performance?)

To be fair to ARM, their original decision to stick to fixed size instructions was probably correct at the time, when the limiting factor in performance was entirely CPU, and memory speeds were not so much a bottleneck.


> Actually, it turned out not to be that good a decision. The generated code is far less compact than x86

That isn't true with x86-64. I've measured the average instruction size of x86-64 in the wild and it ends up being almost exactly 4 bytes. REX prefixes add up fast.


So across 64-bit ARM and x86-64, what is the ratio of instructions across the same program compiled by effectively the same compiler (modulo of course maturity in code selection and peephole optimization)?

My guess is that x64-64 still saves quite a few pages of memory over 64-bit ARM on a large program.


As an experiment, I built CLAPACK (v3.2) for both arm64 and x86_64 with an up-to-date clang.

x86_64 size: 5.2M arm64 size: 4.4M

So on at least some large codebases, arm64 looks pretty efficient.


This isn't a robust counter argument for the same reason that clock speed is not a robust measure of overall cpu performance: not all instructions (or cycles) achieve the same amount of work.


Sure, but the advantage in terms of expressiveness surely goes to 3-address instructions (ARM) ather than 2-address ones (x86). See the sibling comment for an example of ARM being more compact in practice. The only way x86 could possibly pull ahead with a less expressive ISA would be if they got smaller instructions out of it, but they don't (thanks to the bloated REX prefixes).


As a RISC architecture, ARM makes loads and stores explicit. x86 can add/multiply to a memory location in a single instruction. AFAIK in ARM that would be three instructions. I think you'd definitely need to measure this before being able to claim what would "surely" be true.


x86-64 somehow manages to be less compact than ARM while still being variable-length. Thumb is still much saner than x86/x86-64.


This is correct. Thumb-2 is the only actual compact instruction set in wide use. x86-64 doesn't even come close to being as efficient; between ARM and x86-64, it's a wash.

It's certainly possible to design a CISC ISA (or maybe even a RISC one) that's optimized for compactness. x86 isn't that, though. Maybe it was in the 16-bit era, in a hypothetical future where BCD instructions were used a lot.


You mean Thumb-2?


No, even vanilla ARM. x86-64 instructions are 3-5 bytes in size and compilers don't use complex instructions effectively.


Well… there's other issues than just making code smaller, the idea with having a fixed length instruction size is making the decoder simpler.

Decoding enough instructions to feed a wide issue machine is really hard on x86 and can require loads of power due to the ISA, while if you have fixed or semi-fixed size instructions (like thumb), it is much easier.

You can design ISAs that are made for wide issue, cheap decoding, and compact encoding at the same time, but unfortunately it required asking questions that just was not available to the MIPS/ARM/x86 designers. Out of order execution superscalar processors just weren't invented yet.


The problem is that a simpler decoder doesn't compensate for the extra instruction cache needed to achieve the same hit rates/levels of performance, and that is bad for power efficiency since L1 cache needs to run at full core speed and in modern CPUs there's vastly more transistor area in the cache than the decoder. The increased memory traffic from lower hit rates also doesn't help. This article shows that effect quite clearly:

http://www.extremetech.com/extreme/188396-the-final-isa-show...

The x86s have 32K of L1 icache, the ARMs 32K or 16K, and the MIPS Loongson has 64K. Also, the Loongson does not support MIPS16 whereas the ARMs all support Thumb. If you look at the total energy consumed, the MIPS is noticeably worse than x86 or ARM:

http://www.extremetech.com/wp-content/uploads/2014/08/Averag...

In fact, the cache takes so much power that Intel engineers have found it profitable to turn off parts of the cache when in low-power modes; this feature is called Dynamic Cache Sizing and appears in the later Atom series.


> that is bad for power efficiency since L1 cache needs to run at full core speed and in modern CPUs there's vastly more transistor area in the cache than the decoder

It's not that simple. Dynamic power depends on the toggle rate of the flip-flops and the electrical capacitance of the fan-out wires and gates, not on the number of transistors. In a cache, very few storage elements change their state in every cycle, while the decoder performs a lot of work in every cycle.


Something I came across recently said that on x86, 65% of the power cost of a complete cache miss was in the logic of the cache hierarchy.


It's even more complicated than that, since the cache doesn't have to cache encoded instructions, they can actually store decoded instructions, and a few of the caches on a modern x86 cpu actually does that, for example there's a loop cache after the decoders, so that small loops never have to be decoded more than once.


> The problem is that a simpler decoder doesn't compensate for the extra instruction cache needed to achieve the same hit rates/levels of performance

Except this isn't true for x86-64, because x86-64 instructions are just as large as ARM instructions in practice.


And the MIPS is based on a 90nm process vs the 32nm of the Sandy Bridge they tested, while that is relevent to what you can buy, it says nothing about the intrinsic properties of the design.

Intel has had a massive advantege in fabrication for a long time.


It was a good decision for about 10 years, maybe longer for arm due to power considerations. After that transistor budgets started to allow low cost brute force decoding of x86.


https://research.cs.wisc.edu/vertical/papers/2013/isa-power-...

Generally, the thought is that ISA doesn't really affect performance.


A recent HN discussion referenced a paper that analyzed architectures in terms of ... addressing modes, as I recall. The thesis as I recall was something like, the more difficult or delayed it was to pre-fetch data, the harder it was to make it fast, and the rank ordered ISAs were a good match what I recalled happened in the market. E.g. the VAX ISA was quite bad, and DEC indeed really struggled to make it fast, and didn't bother trying to make a 64 bit version of it.

If you're primarily comparing the two major survivors of this battle, as the interesting paper you linked to does, less competitive ISAs that didn't make the cut would already be removed from serious consideration.

Heck, the great success of both the Intel and ARM ISAs was largely a matter of historical accident than intended design, e.g. the 8086 was designed to be compatible with earlier 8 bit designs and to fit in a 20,000 active gate budget. As I've described elsewhere, the ARM was designed to hit a power dissipation target (which it did by a huge margin), and since then the compact Thumb ISAs have been added to it.


Do you have a link to that discussion/paper, sounds intersting


Sorry, I haven't been able to find it.


Interesting paper, thanks for linking it in here.

I don't really understand their conclusions though. They show that the behaviour of x86 (converted to uOps) and ARM is similar (instructions in flight, etc), but I can't see how their the end result (the i7 is vastly faster, sometimes 25x faster than the ARM A8/A9) can demonstrate that the ARM instruction set is not responsible for some of the performance difference.

The paper certainly demonstrates that there is no performance hit going the other way, i.e. the extra complexity of the x86 instruction is never a bottleneck, and once translated to uOps, the instructions are broadly comparable to RISC. I get that, but I don't get how they can also conclude that the ARM ISA is not a performance limiter. They don't investigate code size or memory/cache performance, for instance.


Also, synthetically scaling an i7 down to 1GHz, assuming the performance will be proportional, is not very fair - because of how much faster its core runs compared to memory, (contrast this with the ARMs which are closer) with decreasing core speed the performance decrease will be sublinear, whereas power consumption is more like exponential in frequency:

http://i.stack.imgur.com/daciI.png

I'd expect actually running an i7 at 1GHz vs 3GHz to be far more power efficient than their scaling indicates.

There's some past HN discussion about that paper/an article related to it here:

https://hackernews.hn/item?id=8235120


Thumb is the 16-bit ISA. IIRC, the Thumb-2 encoding is variable-length albeit still not as flexible as x86 but contains both 16 and 32 bit instructions.


This is classic 32-bit ARM immediates.

Modern 32-bit arm also have movw and movt, which load the bottom and top 16 bits of a register. Combining the two you can load any 32-bit value in two instructions. Classic 8-bit+rotation immediates are still used for arithmetic immediates though.

ARM64 is different. For arithmetic, you get a 12-bit value with an optional left-shift by 12, for bitwise instructions you get an arbitrary string of 1's at any position you want (including rotated), which turns out to be very useful, while for loading values, just like in modern ARM-32, you load 16 bits at a time using movw, movt and movk to reach the higher parts of a register. You have to string together four of those to load a fully general 64-bit value, but that's pretty rare in practice.


In aarch64, the bitwise operation immediates are a bit more complicated. From the documentation

Is the bitmask immediate. Such an immediate is a 32-bit or 64-bit pattern viewed as a vector of identical elements of size e = 2, 4, 8, 16, 32, or 64 bits. Each element contains the same sub-pattern: a single run of 1 to e-1 non-zero bits, rotated by 0 to e-1 bits. This mechanism can generate 5,334 unique 64-bit patterns (as 2,667 pairs of pattern and their bitwise inverse). Because the all-zeros and all-ones values cannot be described in this way, the assembler generates an error message.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

I feel like I have a vague memory of a single byte being repeated with some specific pattern for some instruction, but I can't find it anymore.


In addition to those, there's PC relative loads from a literal pool. This is what you end up using 9 times out of ten to load a full register rather than filling it in piecemeal with immediates.


If you want to see something truly wacky and inspired, look at how the Transputer loads constants. There's an instruction that does this, IIRC, four bits at a time, with rules for what happens to previously loaded bits in terms of shifting, sign extending and so forth. Basically you built up a constant 4 bits at a time. Most of the time you want things like 1, 2, 4, -1, 0 and so forth, so it made sense.

Of course, being a stack machine, this is easier (no register to address, just a top-of-stack value).

I don't know why the Transputer didn't take off. Probably Occam (which was interesting, but weird). I know of at least one TV set top box that used the T800, but don't know of any other design wins the chip had.


I don't know why the Transputer didn't take off.

I think for two reasons.

1) "too weird". This eventually dooms all of the goofy architectures like Transputer, IA-64, Transmeta, STI Cell, etc. Nobody wants to deal with something radically different unless there are compelling long-term reasons to do so. Granted, that's a little bit of a chicken-and-egg argument. Without adoption there won't be any long-term existence of an architecture.

2) Support tools and infrastructure. Intel and ARM have an incredible variety of support software. Any new architecture is at a disadvantage unless and until all that supporting stuff is written and works well. This means compilers, IDEs, development hardware, etc. Normally the tools for a new architecture are laughably primitive compared to existing environments. Which puts early adopters at a serious disadvantage.

We're in an x86-64 / ARM duopoly now. Nothing else is really relevant. Do you see anything that could change that in the near future? I sure don't.


We could get a monopoly.

Maybe OpenRisc/Risc-V will rise. Who'da thunk in 1990 that Linux would rise?


I'd love to see an Isa that uses something like a modified Huffman coding. That way you can have jump instructions have 30 bit immediates.


SPARC. The CALL instruction takes a 30-bit immediate. http://www.cs.rochester.edu/~scott/456/local_only/sparcv7.pd... Also the Mill talks seem to indicate that their (variable-length) instruction encoding is machine-generated to be as efficient as possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: