Addressing Criticism of RISC-V Microprocessors

panick21_ · on March 20, 2022

People over argue these minimal difference. Lets be honest, never in the history of ISA were these things the primary reason for success or failure of instruction sets been a slightly better code size.

Even if by <insert objective measurement> RISC-V is 10% worse then ARM, it wouldn't actually matter that much for adoption.

Adoption happens for business reasons and what is differentiating RISC-V far more then anything else is the chance in license, governance and ecosystem.

RISC-V being better at hitting different verticals optimally because of the molecularity is likely another thing that matters more overall then how perfectly it fits for each vertical.

tsmi · on March 20, 2022

People argue over these minimal differences for good reasons.

If <insert objective measurement> = binary size, and I'm buying ROM in volume to hold that binary, +10% ROM address space can easily cost more than the ARM license.

That can matter quite a lot for adoption. Especially in the short term.

Obviously, priorities differ and change as a function of time but as the saying goes, the only thing worse than making a decision with benchmarks is making a decision without benchmarks.

brucehoult · on March 20, 2022

In 64 bit land, RISC-V has consistently the smallest code size. RV64GC, not even using the new things in the B extension that will make code smaller again.

Some data from Ubuntu 21.10 for amd64, arm64, and riscv64:

https://www.reddit.com/r/RISCV/comments/tik718/addressing_cr...

tsmi · on March 21, 2022

I only brought up the binary size thing to give a concrete example based off the article and the parent's comment. I am totally sure the situation is fluid and changing.

My high level point is: changes in "objective measurement" have costs in the same way that license, governance and ecosystem have costs. And "objective measurement" can easily overwhelm the others, especially at scale, and therefore they should not be dismissed as unimportant.

wmf · on March 20, 2022

Agreed, but I think the purpose of these kind of criticisms is to "fix" RISC-V before it becomes yet another worse-is-better design locked in for 50 years.

socialdemocrat · on March 20, 2022

I think you are really missing the point here. Of course RISC-V has negatives but most of those negatives exist for good reasons. It is a question of tradeoffs.

One of the most important goals of RISC-V is to make an architecture which can stand the test of time. In this space adding the wrong kind of instructions is a bigger problem than not adding particular instructions.

Whether you look at x86, HTML or just about anything the problem is nearly always about having to support old junk which no longer makes sense to support, or lacking the ability to grow. Remember 640K is enough for everyone? RISC-V has a lot of room to grow.

If you want an architecture for the future you would want a minimalist one with room to grow a lot. By keeping the instruction count very low and building in a system for extensions they have made a future proof ISA. Okay we cannot know the future, but it is more likely to survive for decades than something like x86 or maybe even ARM.

wmf · on March 20, 2022

Most of the complaints about RISC-V are extremely basic things like array indexing and conditional execution. These will never not be needed.

zik · on March 21, 2022

The problem is that those criticisms are almost always wrong. For instance the ARM proponents criticise RISC-V for having long instruction encodings and then they quote a few carefully cherry picked examples. But if you look at actual compiled programs RISC-V is nearly always 10-15% smaller than ARM.

tsmi · on March 20, 2022

I’m sure that’s what the team that invented segment registers said too.

The question is does it make sense to add these to the ISA long term? In the short term, given die density and how memory works today, it has advantages. But die density increases, making OoO cores cheaper, and memory technology changes. It’s not obvious that these are long term improvements.

brucehoult · on March 20, 2022

It's not that the examples given are not correct, in isolation, it's that they are not common enough in real code to matter.

dgreensp · on March 20, 2022

IANAE, but the article addresses why the arguments that assume these instructions need to be combined are usually not based on looking at the whole picture.

zozbot234 · on March 20, 2022

The only part of RISC-V that is "locked in" to any extent is the minimal set of basic integer instructions. Everything else is defined as part of standardized extensions, and can be superseded simply by defining new custom extensions. Actually even the minimal instruction set admits of some variation, such as the 'E' architectures that dispense with registers R16 to R31, thus saving area in the smallest implementations and potentially freeing up some bits in the encoding.

wmanley · on March 20, 2022

Things get locked in not by standards, but by usage. If your software depends on particular instructions being present you’re not going to buy a processor that has superseded those instructions, even if the new instructions conform to a theoretically cleaner design.

Everything being an extension (and thus removable) is a strength in some specific circumstances, but is a weakness in most.

socialdemocrat · on March 21, 2022

That doesn't make any sense. Because RISC-V has kept instructions to such a bare minimum, there will be very little you will be dependent on and which new designs have to take into account. What you describe is a problem for x86 and ARM not for RISC-V. You got it all in reverse.

General-purpose CPUs for desktop computers and such will not be willy nilly adding extensions. They will standardize on something like RV64GC. However for specialized hardware which you only ever deal with through drivers, nobody will care whether extensions come and go. Are you worried that the processor in your keyboard or mouse isn't backwards compatible with the processor you used in your previous mouse or keyboard?

Anyway code today can already check what extensions you have and generate different code paths. You can ship operating systems which have fallback code for most extensions so they can be be removed in the future. You also got traps so RISC-V can jump to a software implementation of unsupported instructions.

wmanley · on March 21, 2022

> Are you worried that the processor in your keyboard or mouse isn't backwards compatible with the processor you used in your previous mouse or keyboard?

No, but the keyboard manufacturer might be, as might the people who are writing and optimising compilers targetting that CPU.

FullyFunctional · on March 20, 2022

I'm heavily invested in RISC-V, both personally and professionally, and I think the story is much more complicated than this makes it out to be, but I'm not going to rehash the discussion yet again.

However, I do want to point out that a real issue (especially with legacy code) is the scaled address calculation with 32-bit unsigned values. Thankfully the Zba extension adds a number of instructions that help a lot, but still would require fusion to get complete parity with Arm64

For

    int update(int *base, unsigned index) { return base[index]++; }

We get

    update:
            sh2add.uw  a1,a1,a0
            lw         a0,0(a1)
            addiw      a5,a0,1
            sw         a5,0(a1)
            ret

Zba is included in the next Unix profile and will _likely_ be adopted eventually by all serious implementations.

EDIT: grammar and spacing

abainbridge · on March 20, 2022

I'm guess that your assembly code is RISC-V with the Zba extension. Is the non-Zba version worse than Arm64?

Compiling your function with Godbolt, I get:

  RISC-V (no Zba) Clang - 7 instructions - https://godbolt.org/z/7znnrzxKq
  Arm64 Clang -           7 instructions - https://godbolt.org/z/Trv8scxad

Annoyingly I can't see the code size for the Arm64 case because no output is generated if I tick the "Compile to binary" option in "Output". I have to use GCC instead:

  RISC-V (no Zba) Clang - 20 bytes - https://godbolt.org/z/eWfPaorcj
  Arm64 GCC             - 24 bytes - https://godbolt.org/z/bzsPzov5h

FullyFunctional · on March 20, 2022

EDIT: Hmm, I seem to have picked a bad example. Try this one:

    int get(int *base, unsigned index) {return base[index];}

Arm64:

    update:
        ldr     w0, [x0, w1, uxtw 2]
        ret

RV64GC (vanilla):

    update:
        slli    a5,a1,32
        srli    a1,a5,30
        add     a0,a0,a1
        lw      a0,0(a0)
        ret

RV64GC+Zba:

    update:
        sh2add.uw  a0,a1,a0
        lw      a0,0(a0)
        ret

Arm64 is able to do some indexed loads in a single instruction that might take two in RISC-V w/Zba (and up to 4+ in regular RISC-V). However, calling that a win for Arm64 is not so clear as the more complicated addressing modes could become a critical timing path and/or require an extra pipeline stage. However, as a first approximation, for a superscalar dynamically scheduled implementation, fewer ops is better so I would say it's a slight win.

I don't understand the obsession with bytes. 25% fewer bytes has only very marginally impact on a high-performance implementation and the variable length encoding has some horrendous complications (which is probably why Arm64 _dropped_ variable length instructions). Including compressed instruction in the Unix profile was the biggest mistake RISC-V did and I'll die on that hill.

ADD: Don't forget that every 32-bit instruction is currently wasting the lower two bits to allow for compressed, thus any gain from compress must be offset by the 6.25% tax that is forced upon it.

avianes · on March 20, 2022

Why using an unsigned? It is obvious here that RISC-V without Zba takes 4 instructions because it manages special cases related to unsigned.

If you use a simple int for index:

  slli  a1,a1,2
  add   a0,a0,a1
  lw    a0,0(a0)

And isolating this code in a small function puts constraints on register allocation, but if we remove this constraint then we can write:

  slli  a1,a1,2
  add   a1,a1,a0
  lw    a1,0(a1)

Which is very suitable for macro-op fusion and C extension

> Including compressed instruction in the Unix profile was the biggest mistake RISC-V did and I'll die on that hill.

This is so wrong. The C extension is one of the great strengths of RISC-V, it is easy to decode, very suitable for macro-op fusion, and it gives a huge boost in code density

FullyFunctional · on March 20, 2022

Indeed why use unsigned? Go take a look at a lot of C code (hint, look at SPEC benchmarks). They do that. Decades of pointer == int == unsigned have let to a lot of horrific code. But we still compile it.

The sins of the original RISC-V was spending too much time looking at RV32 and not realizing how big a problem this is in practice. Zba (slipped in as it’s not really “bit manipulation”) fixes the worst of this.

ADD: The problem in this HN thread is the same reason we got compressed it in the first place. The vast majority of people aren’t doing high performance wide implementation so the true cost isn’t widely appreciated. The people holding the decision power certainly didn’t understand it. I really think you have to live it to understand it.

avianes · on March 20, 2022

> Go take a look at a lot of C code (hint, look at SPEC benchmarks). They do that.

This does not really justify isolating that snippet if you admit yourself it's a bad one.

> The sins of the original RISC-V was spending too much time looking at RV32 and not realizing how big a problem this is in practice.

But indeed, my previous message shows that even without Zba the problem is erased by a good register allocation and macro-op fusion.

I think you are trying too hard to find special cases that "trick" RISC-V, you didn't even pay attention to the use of unsigned which is non-optimal (unsigned has an undesirable overflow semantic here).

JonChesterfield · on March 20, 2022

Iirc compressed instructions are the thing that costs 2 bits per 32 and was criticised as overfitted to naive compiler output. Am I thinking of something else?

avianes · on March 20, 2022

Yes, but RISC-V still has a lot of encoding-space free and the benefit of C extension is huge. It's a trade-off.

I don't think RISC-V is perfect or universal, but on this point they do a pretty good job compared to other ISAs

klelatti · on March 20, 2022

You say that the benefit is 'huge' but why should I care about code density on a modern CPU with gigabytes of memory and large caches?

From a performance perspective what is the evidence that this actually provides an advantage?

avianes · on March 20, 2022

To clarify, I'm not saying that RISC-V code density of C extension is a big advantage over its competitors, but it is a huge benefit for RISC-V.

You are right, code density is perhaps not that critical today.

And it is difficult to quantify its relevance as code density is always related to other variables such as instruction expressiveness, numbers of uop emitted, etc.

But I still think that code density is important for RISC-V. Because RISC-V philosophy to reach high performance is to use very simple instructions that can be combined together and take advantage of macro-op fusion. I think RISC-V without macro-op fusion can't reach the performance of other ISA.

But RISC-V with all these simple and not very expressive instructions and without C extension has a pretty bad code density which could cost a lot because it is not at the competitors' level.

So if we think of RISC-V as a macro-op fusion oriented ISA, then the C extension becomes important to be competitive.

I don't know what is better between a "macro-fusion" oriented arch or a "complex-instruction" oriented arch, future will tell us.

klelatti · on March 20, 2022

Thanks for clarifying - interesting to see a different philosophy being tried.

zozbot234 · on March 20, 2022

Because the fastest cache levels are tiny, even in the largest and most advanced CPU's. There's plenty of evidence for the performance benefits of improved density and terseness in both code and data.

klelatti · on March 20, 2022

The M1 has a 192k instruction cache for performance cores which is not ‘tiny’.

If there is lots of evidence for the performance benefits of improved density vs the alternative of fixed instruction width in real world CPUs then I’m sure you’ll be able to cite it.

snvzz · on March 21, 2022

>The M1 has a 192k instruction cache for performance cores which is not ‘tiny’.

ARMv8 and ARMv9 have poor code density. These cache are large as a workaround to that.

This isn't free, as besides making the die larger (and thus lower yields), the L1's clock speed is limited due to its size.

JonChesterfield · on March 20, 2022

32 bit, 32 registers, three register code. So add r0 r1 r2 spends fifteen bits on identifying which register to use then another two on the compressed ISA. That's half the encoding space gone before identifying the op. Never thought I'd want fewer registers but here we are.

If the compressed extension is great in practice it might be a win. If the early criticism of overfit to gcc -O0 proves sound and in practice compilers don't emit it then it was an expensive experiment.

avianes · on March 20, 2022

The encoding space is not the number of bits used to encode an instruction, the encoding space is the ratio of values that encode a valid instructions over the total number of instructions that could be encoded.

As an example, in an ISA with 8-bits fixed instruction length and 8 registers (reg index encoded on 3 bit):

If the last 2 bits are the opcode and we define 2 instructions (eg. AND, XOR) that manipulate 2 registers (3 bits + 3 bits), then instruction word values 0b00_000_000 to 0b01_111_111 might encode these two instructions (ignore "_" they are separators).

Therefore, instruction word values from 0b10_000_000 to 0b11_111_111 remain free, which represents half of the encoding space. So half of the encoding space remains free.

This means we still have room to put new instructions.

Similarly, RISC-V valid instructions use almost all the available bits, but there is still room in the encoding space because some opcodes remain free.

saagarjha · on March 20, 2022

> 25% fewer bytes has only very marginally impact on a high-performance implementation

Instruction cache doesn't come for free, and is usually pretty small on most shipping processors. It's not a big deal for smaller benchmarks, but in real-world programs this can become a problem.

FullyFunctional · on March 20, 2022

I am obviously aware and I'm here to tell you that the overhead of variable length instructions matters more. Arm agrees. M1 has a 192 KiB I$ btw.

ADD: had RISC-V just disallowed instructions from spanning cache lines and disallowing jumping into the middle of instructions then almost all of the issues would have gone away. Sigh.

saagarjha · on March 20, 2022

I actually had Apple's chips in mind when talking about "most shipping processors" because they have historically invested heavily in their caches and reaped benefits from it. But not all the world's an M1, and also I'll have you know that Apple themselves cares very much about their code size, even with their large caches. Don't go wasting it for no reason!

(I should also note that I am pretty on board with you with regards to variable-length instructions, this is just independent of that.)

avianes · on March 20, 2022

Variable instruction sizes have a cost, but with only 2 instruction sizes like current RISC-V that cost remains very low as long as we don't have to decode a very large number of instructions each cycle, and it gives a huge code density advantage.

FullyFunctional · on March 20, 2022

This biggest issue is one instruction spanning two cache lines, and even two pages. This means a bunch of tricky cases that is the source of bugs and overheads.

It also means you cannot tell instruction boundaries until you directly fetch instructions, so you cannot do any predecode in the cache that would help you figure out dependencies, branch targets, etc. These things matter when you are trying to fetch 8+ instructions per cycle.

avianes · on March 20, 2022

> This biggest issue is one instruction spanning two cache lines

Even with fixed (32 bit) instruction lengths aligned on 32 bit, when we have to decode a group of 8 instructions you are facing this kind of issue.

So you either have to cut the instruction group (and thus not take full advantage of the 8 way decoder) or you have to implement a more complex prefetch with a longer pipeline. And these special cases can be handled in these pipeline stages.

> It also means you cannot tell instruction boundaries until you directly fetch instructions

I mean, AMD does that on x86, with 14 instruction lengths.

It can be done for RISC-V, it's much cheaper than x86, and it takes significantly less surface area than a bigger cache to compensate.

mst · on March 20, 2022

Given the compression stuff is an extension (and so far as I can tell the 16-bit alignment for 32-bit instructions that can result in that sort of spanning is part of that extension), so far as I can tell you could implement said extension for tiny hardware where every byte counts, and then for hardware where you're wanting to fetch 8+ instructions per cycle just ... not implement it?

Wait (he says to himself, realising he's an idiot immediately -before- posting the comment for once). You said upthread the C extension is specified as part of the standard UNIX profile, so I guess people are effectively required to implement it currently?

If that was changed, would that be sufficient to dissolve the issues for people wanting to design high performance implementations, or are there other problems inherent to the extension having been specified at all? (apologies for the 101 level questions, the only processor I really understood was the ARM2 so my curiosity vastly exceeds my knowledge here)

KerrAvon · on March 20, 2022

Have the ARM AArch64 designers ever commented on this? They intentionally left out any kind of compressed instructions, and certainly Apple at least cares a lot about code size.

klelatti · on March 20, 2022

Try this at 34:30 - from Arm’s architecture lead Richard Grisenthwaite. Earlier he says that several leading micro architects think that mixing 16 bit and 32 bit instructions (Thumb2) was the worst thing that Arm ever did.

https://m.soundcloud.com/university-of-cambridge/a-history-o...

brucehoult · on March 22, 2022

He explicitly specifies that those micro-architects are at companies OTHER than ARM.

His own opinion appears to be that the worst thing ARM ever did was T2EE, designed for JIT compilers and compilers for dynamic languages. He says that by the time the chips came out compiler technology had advanced to the point that it was no longer useful and no one else used it.

A couple of other points picked up in the talk:

- He reverses Hennessy and Patterson wrt SPARC and MIPS.

- A64 effort started in 2007. So it took 5 years to freeze/publishing, the same as RISC-V.

- A64 architects thought code density is no longer important. Some people definitely disagree with that. At the time they probably thought amd64 was the only competition and matching/beating that was good enough.

- he seems to be regretting the 2nd operand shift because it fell naturally out of the 1985 micro-architecture, but it's a burden now. And yet it was included in A64 -- presumably because the initial processor pipelines had it anyway, because they supported A32. But now we have A64-only CPUs.

- LL/SC was the wrong thing to do.

snek_case · on March 20, 2022

Not only that, but CPUs have a maximum number of instructions they can dispatch per cycle (typically 4 or 6). Even in microbenchmarks, the difference there could show up.

marcosdumay · on March 20, 2022

The bottleneck on that is data interdependency of your algorithm. If you break it in 6 or 10 instructions, the data dependency stays the same.

(Of course, you can add unnecessary dependency with a badly designer ISA. But it's not a necessary condition.)

snek_case · on March 20, 2022

There's still a limit to how many instructions you can decode and dispatch every cycle, even with zero dependencies. There's also definitely dependencies in the example where you're computing a memory address to access a value.

damageboy · on March 20, 2022

Thank you for writing the obvious. Instruction Byte count is the wrong metric here 100%. Instruction Count (given reasonable decoding/timing constraints) is the thing to optimize for and indeed variable length encoding is very bad.

tsmi · on March 20, 2022

Instruction byte count matters quite a lot when you're buying ROM in volume. And today, the main commercial battleground for RISCV is in the microcontroller space where people care about these things.

knorker · on March 20, 2022

For those of us without the expertise, could you elaborate on why that is?

On the one hand we have byte count, with its obvious effect on cache space used. But to those of us who don't know, why is instruction count so important?

There's macro-op fusion, which admittedly would burn transistors that could be used for other things. Could you elaborate why it's not sufficient?

And then the fact that modern x86 does the opposite to macro-op fusion, by actually splitting up CISC instructions into micro-ops. Why is it so bad if they were more micro-ops to start with, if Intel chooses to do this?

TazeTSchnitzel · on March 20, 2022

The conditional execution section makes no mention of the fact AArch64 doesn't have this feature either, and bizarrely lists a “64-bit” ARM code example that isn't. This doesn't inspire confidence in the author's understanding.

fay59 · on March 20, 2022

All of the ARM assembly is wrong. AArch64 uses “x” or “w” to identify general purpose registers, “r” isn’t a thing.

solarexplorer · on March 20, 2022

In the same section the part about the SiFive optimization is also misleading. The goal of the optimization is obviously to avoid interrupting the instruction fetch. But he makes it sound like the goal was to reduce instruction count by fusing two instructions to get a single monster op with five (!) register operands. That just doesn't make sense.

brucehoult · on March 20, 2022

They don't fuse it. They have the branch and the following instruction both go down one integer pipeline just as they would if the branch was predicted "not taken. They are linked/tagged together such that if the branch actually happens then the result from the second instruction is simply not written back instead of taking a branch mispredict.

So the four input operands and one output operand are not a problem because that's just what two pipelines do all the time anyway.

znwu · on March 21, 2022

Yes, the author's defence for Myth #1 does not strike me as a correct defence from the RISC-V perspective. Going to compressed instruction and compressed macro-op fusion is way overkill for the very basic indexed load/store problem.

snvzz · on March 21, 2022

The actual hardware cost of this case of "macro-op fusion" is negligible.

znwu · on March 23, 2022

Yes it is negligible. The problem is that indexed load/store is a non-issue. There are ready-to-use simulation data that shows indexed load/store has minimal impact on dynamic code size.If someone uses an overkill feature to explain a non-issue, then it seems like nonsense.

Imaging someone uses quantum gravity to explain why Australians stand upside down on a globe.

snvzz · on March 23, 2022

macro-op fusion is not rocket science.

znwu · on March 24, 2022

Then write up an actually meaningful array processing function and translate it into asm is even less rocket science.

The author starts with a apparently meaningless `int x = a[i];`, but did not ask an obvious question before digging in: what usually goes before and after this statement and together what will they produce? A formal RISC-V code analysis usually does not go like this.

`int x = a[i];` is the kind of expression that strikes you as useful at the first glance, but then nothing. If you visit an array, then on x64/ARM/RISC-V, they are all compiled down to 2 instructions per iteration. Not 1 vs 2 vs 3 as suggested by the article. RISV-V may have one or two more instructions outside the loop but that's it.

pxeger1 · on March 20, 2022

I thought it was only supposed to be pseudo-assembly?

audunw · on March 20, 2022

Theres so many armchair specialists when it comes to criticizing RISC-V. I've seen people claim an ISA is better because it has branch delay slots.. which seems clever to someone who knows enough technical details about CPUs to understand what the benefit of that feature is ("free" instruction execution for every branch taken), but is a terrible idea for a truly scalable ISA (huge PITA for out-of-order architectures if I've understood correctly)

I'm sure there are some bad decisions in RISC-V, but I've yet to see one that isn't in the process of being remedied. There was a good argument for a lack of POPCOUNT instruction being bad, but I think that's being added soon

okl · on March 20, 2022

First you complain about "armchair specialists", then you make a blanket assertion about branch delay slots. Believe it or not, there are ISAs for applications where branch delay slots are useful, for example TMS320 DSPs with up to 5 delay slots.

mst · on March 20, 2022

My very much non-expert understanding was that branch delay slots where enough details of the target processor design are known at ISA design time to have the 'right' number of slots can be a neat optimisation.

OTOH if one is designing an ISA that will have a bunch of different implementations - and this includes later implementations wanting to be ASM compatible with earlier ones - they tend to eventually become a footgun for the processor designers. (if I remember correctly and didn't completely misunderstand, MIPS' branch delay slots were absolutely a neat optimisation for the early models, but when they went to a deeper pipeline for later chips required a bunch of extra design effort to maintain compatibility with, without being helpful anymore)

(explicit disclaimer that I'm an armchair amateur here, so if you're a fellow non-expert reading this comment before it attracts better informed replies please default to joining me in the assumption that I've made at least one massive error in what I'm saying here)

__s · on March 20, 2022

You shouldn't assess RISC-V from the viewpoint of a single chip. It's an ISA first. Branch slots are highly target specific

throwaway81523 · on March 20, 2022

Fairly lame article (not wrong, but stuff people following the topic have seen before), and I'd still like to hear about integer overflow detection. If the floating point extension is able to do IEEE 754 condition codes including overflow detection, why can't the integer unit do something similar?

FullyFunctional · on March 20, 2022

This comes up a lot and I'm sympathetic to your plea, really (I enjoy fantasizing about a different reality where CPUs weren't just "machines to run C programs"), but in computer architecture, what really matters for one application or a class of applications might not be important when viewed across millions of programs.

The fact is that integer operations and floating point are two completely different beasts, so much so that we have different benchmark suites for each.

Integer operations are critically latency sensitive and bagging on extra semantics doesn't come for free and for most code this would be a tax. The "overflow bit" represents an implicit result that would have to be threaded around (I'm assuming that you aren't asking for exceptions which literally nobody wants). For FP we do that, but the cost and latency of FP ops is already high so it doesn't hurt quite as much.

The RISC-V spec [1] (which I assume you have seen) already discusses all these trade offs:

"We did not include special instruction-set support for overflow checks on integer arithmetic operations in the base instruction set, as many overflow checks can be cheaply implemented using RISC-V branches. Overflow checking for unsigned addition requires only a single additional branch instruction after the addition:

    add t0, t1, t2
    bltu t0, t1, overflow

For signed addition, if one operand’s sign is known, overflow checking requires only a single branch after the addition:

     addi t0, t1, +imm
     blt t0, t1, overflow

This covers the common case of addition with an immediate operand. For general signed addition, three additional instructions after the addition are required, leveraging the observation that the sum should be less than one of the operands if and only if the other operand is negative.

     add t0, t1, t2
     slti t3, t2, 0
     slt t4, t0, t1
     bne t3, t4, overflow

In RV64I, checks of 32-bit signed additions can be optimized further by comparing the results of ADD and ADDW on the operands."

I do think that it might have been worth adding an single instruction version for the last one (excluding the branch), but I'm not aware of it getting accepted.

[1] https://github.com/riscv/riscv-isa-manual

throwaway81523 · on March 20, 2022

Yes I've seen that reasoning: they propose bloating 1 integer instruction into 4 instructions in the usual case where the operands are unknown. Ouch. In reality they expect programs to normally run without checking like they did in the 1980s. So this is more fuel for the criticism that RiscV is a 1980s design with new paint. Do GCC and Clang currently support -ftrapv for RiscV, and what happens to the code size and speed when it is enabled? Yes, IEEE FP uses sticky overflow bits and the idea is that integer operations could do the same thing. Integer overflow is one of those things like null pointer dereferences, which originally went unchecked but now really should always be checked. (C itself is also deficient in not having checkable unsigned int types).

adrian_b · on March 20, 2022

Generating the overflow bit and storing it adds a completely negligible cost to a 64-bit adder, so touting this as a cost saving measure is just a lie, even if indeed this claim has always been present in the RISC-V documentation.

Most real cases of overflow checking are of the last type. Tripling the number of instructions over a bad ISA that lacks overflow exceptions, like unfortunately almost all currently popular ISAs are, or quadrupling the number of instructions over a traditional ISA with overflow exceptions is a totally unacceptable cost.

The claim that providing overflow exceptions for integer addition might be too expensive can be easily countered by the fact that generating exceptions on each instruction is not the only way to guarantee that overflows do not happen.

It is enough to store 2 overflow flags, 1 flag with the result of the last operation and 1 sticky flag that is set by any overflow and is reset only by a special instruction. Having the sticky flag allows zero-overhead overflow checking for most arithmetic instructions, because it can be tested only once after many operations, e.g. at a function exit.

The cost of implementing the 2 overflow bits is absolutely negligible, 2 gates and 2 flip-flops. Much more extra hardware is needed for decoding a few additional instructions for flag testing and clearing, but even that is a negligible cost compared with a typical complete RISC-V implementation.

Not providing such a means of reliable and cheap overflow detection is just stupid and it is an example of hardware design disconnected from the software design for the same device.

The early RISC theory was to select the features that need to be implemented in hardware by carefully examining the code generated by compilers for representative useful programs.

The choices made for the RISC-V ISA, e.g. the omission of both the most frequently required addressing modes and of the overflow checking. proves that the ISA designers either have never applied the RISC methodology, or they have studied only examples of toy programs, which are allowed to provide erroneous results.

ajb · on March 20, 2022

The extra expense is not the generation of the overflow bit, but the infrastructure needed to support a flags register, or for every instruction to be able to generate an exception.

On a simple processor like a microcontroller this doesn't cost much, but it's severely hampers a superscalar or out of order processor, as it can't work out very easily which instructions can be run in parallel or out of order.

The clean solution from a micro architectural point of view would be to have an overflow bit (or whatever flags you wanted) in every integer register. But that's an expense most don't want to pay.

ncmncm · on March 21, 2022

> it severely hampers a superscalar or out of order processor, as it can't work out very easily which instructions can be run in parallel or out of order

This is an old myth, endlessly parroted. On current x86, the status register is renamed just like other registers, as could easily have been done in a better RISC-V design. Lack of status flags will be RISC-V's equivalent of delay slots, that once felt like an optimization but has already aged badly.

The unreliable presence of POPCNT and ROT instructions was a worse failing, apparently mitigated lately.

adrian_b · on March 20, 2022

One must not forget that on any non-toy CPU, any instruction may generate exceptions, e.g. invalid opcode exceptions or breakpoint exceptions.

In every 4-5 instructions, one is a load or store, which may generate a multitude of exceptions.

Allowing exceptions does not slow down a CPU. However they create the problem that a CPU must be able to restore the state previous to the exception, so the instruction results must not be committed to permanent storage before it becomes certain that they could not have generated an exception.

Allowing overflow exceptions on all integer arithmetic instructions, would increase the number of instructions that cannot be committed yet at any given time.

This would increase the size of various internal queues, so it would increase indeed the cost of a CPU.

That is why I have explained that overflow exceptions can be avoided while still having zero-overhead overflow checking, by using sticky overflow flags.

On a microcontroller with a target price under 50 cents, which may lack a floating-point unit, the infrastructure to support a flags register may be missing, so it may be argued that it is an additional cost, even if the truth is that the cost is negligible. Such an infrastructure existed in 8-bit CPUs with much less than 10 thousand transistors, so arguing that it is too expensive in 32-bit or 64-bit CPUs is BS.

On the other hand, any CPU that includes the floating-point unit must have a status register for the FPU and means of testing and setting its flags, so that infrastructure already exists.

It is enough to allocate some of the unused bits of the FPU status register to the integer overflow flags.

So, no, there are absolutely no valid arguments that may justify the failure to provide means for overflow checking.

I have no idea why they happened to make this choice, but the reasons are not those stated publicly. All this talk about "costs" is BS made up to justify an already taken decision.

For a didactic CPU, as RISC-V was actually designed, lacking support for overflow checking or for indexed addressing is completely irrelevant. RISC-V is a perfect target for student implementation projects.

The problem appears only when an ISA like RISC-V is taken outside its right domain of application and forced into industrial or general-purpose applications by managers who have no idea about its real advantages and disadvantages. After that, the design engineers must spend extra efforts into workarounds for the ISA shortcomings.

Moreover, the claim that overflow checking may have any influence upon the parallel execution of instructions is incorrect.

For a sticky overflow bit, the order in which it is updated by instructions does not matter. For an overflow bit that shows the last operation, the bit updates must be reordered, but that is also true for absolutely all the registers in a CPU. Even if 4 previous instructions that were executed in parallel had the same destination register, you must ensure that the result stored in the register is the result corresponding to the last instruction in program order. One more bit along hundreds of other bits does not matter.

throwaway81523 · on March 20, 2022

> After that, the design engineers must spend extra efforts into workarounds for the ISA shortcomings.

That is too optimistic. Programs will keep running unchecked and we'll keep getting CVE's from overflow bugs.

ansible · on March 20, 2022

> The clean solution from a micro architectural point of view would be to have an overflow bit (or whatever flags you wanted) in every integer register.

That's what the Mill CPU does. Each "register" also had the other usual flags, and even some new ones like Not a Result, which helps with vector operations and access protection.

ansible · on March 20, 2022

> The cost of implementing the 2 overflow bits is absolutely negligible, 2 gates and 2 flip-flops. Much more extra hardware is needed for decoding a few additional instructions for flag testing and clearing, but even that is a negligible cost compared with a typical complete RISC-V implementation.

That's understating things considerably.

ARMv8-A has PSTATE, which includes the overflow bit. This explicit state must be saved / restored upon any context switch.

And there isn't just a single PSTATE for an OOO SuperScalar, there are several.

Everything has a cost.

Banana699 · on March 20, 2022

The article said that it's not the generation and write back of the overflow bits that are the problem, this is basically free, it's the shared state that a status register represents that is the problem. Every single arithmetical/logical operation writes to it, out-of-order execution is already headache-inducing as it is without an implicit argument to most instructions.

zozbot234 · on March 20, 2022

The typical overhead of overflow checking in compiled languages (which, as a reminder, is in the low single-digit %'s at most) has nothing to do with the lack of hardware-sprcific extensions. It's a consistent pattern of missing optimization opportunities, because the compiler now needs to preserve the exact state of intermediate results after some operation fails with an overflow. Adding these new opcodes to your preferred ISA would barely change anything. (If they help at all it's in executing highly dynamic languages as opposed to compiled ones, which makes them a natural target for the in-progress 'J' extension.)

throwaway81523 · on March 21, 2022

I don't understand what you are saying with this comment. Every general purpose ISA that I know of other than Risc-V has a means of detecting overflow, such as a carry bit that you can branch on, and usually an ADDC instruction. Risc-V decided to have no method at all, other than expanding 1 instruction to 4 instructions. Integer overflow in C is undefined behaviour and we get a lot of CVE's from overflow bugs because of that. Ideally there should be a (hardware or software) overflow trap, as safer compiled languages like Ada require. Dynamic languages with bignums are another matter, though the instruction mix is likely a lot different.

modeless · on March 20, 2022

> I'm assuming that you aren't asking for exceptions which literally nobody wants

I want exceptions. Why would they be a bad idea? Besides the fact that software doesn't utilize them today (because they're not implemented, chicken and egg problem)? IMO they would be as big a security win as many other complex features CPU designers are adding in the name of security, e.g. pointer authentication.

feanaro · on March 20, 2022

> add t0, t1, t2 bltu t0, t1

How does this work? Isn't `bltu` simply a branch that is taken if `t0 < t1`? How does that detect addition overflow?

EDIT: Ah, because the operands are `t1` and `t2`. `t0` is the result. Quack.

zozbot234 · on March 20, 2022

> Every 32-bit word in the instruction cache will contain either a 32-bit uncompressed instruction or two 16-bit compressed instructions. Thus everything lines up nicely.

This is not really accurate AIUI, since the RISC-V C extension allows 32-bit insns to be 16-bit aligned. (This would also happen if 48-bit insns were enabled by some other future extension). It's nonetheless a lot simpler than whatever x86 has to do, since insn length is given by a few well-defined bits in the insn word.

jfkimmes · on March 20, 2022

I get "To keep reading this story, get the free app or log in. (With Facebook or Google)" on mobile.

No thanks, Medium. These dark patterns crop up everywhere lately...

math-dev · on March 20, 2022

That’s a shame…it was the most annoying thing about Quora to me.

As a Medium writer, I’m annoyed now! They already stopped paying me my ~0-10$ per month because I refused to beg everyone to get to their new minimum 100 followers requirement for getting paid.

aw1cks · on March 20, 2022

https://scribe.rip

socialdemocrat · on March 20, 2022

It is for authors like me writing on Medium, to have a way of getting paid. There is a need for both paid and free content. But reality is that you cannot produce quality content if everything has to be free. Advertisement is one solution, but one not without its own serious drawbacks.

Medium is a like a magazine with a very large number of journalists which it pays to write for it. Naturally it needs to charge subscribers to make an income.

Eduard · on March 20, 2022

https://archive.is/BqS0n

throwaway81523 · on March 20, 2022

12ft.io got past that for me.

VariableStar · on March 20, 2022

It is amusing and sobering to get a glimpse of some of the compexities going on inside a processor and how design philosophies may affect them. Those are things the user or even your normal programmer seldom thinks about.

btdmaster · on March 20, 2022

https://scribe.rip/addressing-criticism-of-risc-v-microproce...

ribit · on March 20, 2022

I have difficulty following the points the author is trying to make.

- Even with instruction compression the type of code they present will take more space than, say, Aarch64. - The entire section on conditional execution doesn't make any sense. Conditional execution is bad, we know it, that's why modern ARM does not have conditional execution. Overall, author's insistence to compare RISC-V to practically obsolete ARMv7 when ARMv8 has been available for over a decade is... odd. - Regarding SIMD... it's a very complex topic, but personally, I don't see any fundamental problem with vector-style ISA. I think it's a great way of allowing scalable software. But vector ISA does not replace basic SIMD as they solve different problems. Vector stuff is great for throughput, SIMD is great for latency. There are many tasks such as geometry processing, modern data structures etc. where fixed-size 128-bit SIMD is an excellent building block. That's why ARM has both NEON and SVE2, the latter does not make obsolete the former. And that bit about GPUs and how they are not good for vector processing... not even sure how to comment on it. Also, at the end of the day, specialised devices will vastly outperform any general-purpose CPU solution. That's why we see, say, Apple M1 matrix accelerators delivering matmul performance on par with workstation CPU solutions, despite using a fraction of power.

Overall, my impression is that the article is grasping at straws, ignores modern technology and ultimately fails to deliver. I aolso remain unconvinced by the initial premise that RISC-V follows the principle "not painting yourself into a corner due to choices which have short term benefit". I do think that choices like keeping instructions as simple as possible (even though it makes expression of common patterns verbose), avoiding flags registers, disregarding SIMD etc. could be characterised as "painting oneself into a corner".

A usual disclaimer: I do think that RISC-V is a great architecture for many domains. Simple low-power/low-cost controllers, specialised hardware, maybe even GPUs (with extensions) — the simplicity and openness of RISC-V makes it a great point of entry for basically anyone and invites experimentation. I just don't see much merit of RISC-V in the general-purpose high-performance consumer computing (laptop/desktop). In this space RISC-V does not have any notable advantages, it does have potential disadvantages (e.g. code density and lack of standard SIMD — yet). Most importantly, the CPU microarchitecture becomes the decisive factor, and designing a fast general-purpose CPU requires a lot of expertise and resources. It's not something that a small group of motivated folk can realistically pull off. So all the great things about RISC-V simply do not apply here.

socialdemocrat · on March 20, 2022

Author here: I have tried to clarify this better in the update. The point is that I am talking about AArch32 and AArch64 in the article. Yes, everybody has been going away from conditional instructions, because they don't work well in Out-of-Order superscalar processors, and they are pointless when you got good branch predictors.

HOWEVER, an argument in the ARM camp is that they are very useful when dealing with smaller chips. Remember ARM and RISC-V compete in the low range as well as higher range. AArch32 is not obsolete. It still has uses. There has been ARM fans claiming that conditional instructions make ARM superior for simple chips. The argument here was that RISC-V has way of dealing with simple In-Order chips as well.

Someone · on March 20, 2022

> There has been ARM fans claiming that conditional instructions make ARM superior for simple chips.

For those following this only from the sidelines, it would help strengthen the article if the article has links to such claims. I couldn’t easily find them, and would be curious as to their age, given that, reading https://en.wikipedia.org/wiki/Predication_(computer_architec..., ARM has made substantial changes to conditional execution a few times since 1994 (over 25 years ago); Thumb (1994) dropped them, Thumb-2 (2003) replaced them by, if I understand it correctly, an instruction “skip the next 4 instructions depending on flags”, and ARMv8 replaced them by conditional select.

(In general, providing links to articles claiming each proclaimed myth to be true would strengthen this article. I think I’ve only ever read about #1, and not with as strong a wording as “bloats”)

socialdemocrat · on March 20, 2022

If there was some good articles to point to I would. However I don't want to single out people ranting against RISC-V. This is more about opinions which keep popping out here in Hacker news, twitter, Quora and other places. I don't want this discussion to be turned personal.

It should be possible to discuss these opinions without singling out anyone.

I am however talking about claims put forth after ARMv8. The argument here has basically been this: Both ARM and RISC-V aims to cover both the low end and high end. Some ARM fans think that by not including conditional instructions RISC-V really only works for high-end CPUs. The idea here is that AArch32 would be better than RV32 for lower-end chips.

ribit · on March 21, 2022

Thanks for replying, much appreciated. I admit that my interest primarily goes towards desktop application and my comment was made from that angle. Regarding conditional instructions on in-order microarchitecture: I neither have a strong opinion nor am I familiar with the discussion. Intuitively at least, conditional instructions do seem like a useful tool for simple in-order CPUs, but as you point out instruction fusion can be used to solve a lot of things. It’s a question of tradeoffs, and it is not obvious to me what cost these tradeoffs bring. It does seem that RISC-V will need to more heavily rely on operation fusion to achieve better performance and efficiency (as opposed to something like ARM where some patters are already “pre-fused” in the ISA). The time will tell. I don’t think that any of these are trivial.

ncmncm · on March 21, 2022

Conditional execution (CMOV) in modern x86 (or any modern target) enables 2x better sort performance specifically in places where branch prediction utterly fails because the input data is not predictable. If RISC-V lacks this capability (like Arm64?), then congratulations, RISC-V has closed off the possibility of a 2x performance improvement in important algorithms.

This is also a place where RISC-V displays an ABI choice that is nonsensical: a bit of historical awareness would have made the boolean "true" value ~0, or -1, as indeed it effectively is in AVX2 and GPUs.

socialdemocrat · on March 20, 2022

> And that bit about GPUs and how they are not good for vector processing... not even sure how to comment on it. Also, at the end of the day, specialised devices will vastly outperform any general-purpose CPU solution. That's why we see, say, Apple M1 matrix accelerators delivering matmul performance on par with workstation CPU solutions, despite using a fraction of power.

Of course CPUs are good for vector processing compared to a general-purpose CPU. That was not the point at all. The point is that unlike older architectures such as Cray, they were not designed specifically for general-purpose vector processing but for graphics processing. That is why solutions such as SOC-1 built specifically for genera-purpose vector processing can compete with graphics cards made by giants like Nvidia.

The article is talking about adding vector processing both to RISC-V chips aimed at general purpose processing as well as to specialized RISC-V cores which are primarily designed for vector-processing. SOC-1 is an example of this. It has 4 general purpose RISC-V cores called ET-Maxion, while also having 1088 small ET-Minion cores made for vector processing. However these are still RISC-V cores, rather than some graphics card SM core.

I don't get your argument about SIMD being great for latency. RISC-V requires that vector registers are at minimum 128-bit you you can use RVV as a SIMD instruction-set with 128-bit registers if you want.

ribit · on March 21, 2022

ET-minion can complete with GPUs because it is essentially a GPU. Most GPUs are in-order RISC machines with very wide SIMD ALUs and huge register banks. Whether it runs RISC-V or some proprietary ISA doesn’t really matter. How you build things does. And anyway, RISC-V vector ISA seems to be closely inspired by GPU ISAs to begin with.

Regarding SIMD: maybe you are right. I don’t know. Performance of RISC-V vectors here is an unknown factor. If VSETVL is zero-cost and the CPU can rename vector partitions without added latency, sure. But I remain skeptical until proven otherwise. The very design of RISC-V vector stuff screams amortized processing of large data blocks, not flexible low-latency operations on small data blocks. And notably, I could not find basic SIMD shuffle/interleave operations in the spec. That’s perfectly fine for large data vectors where you can work with multiple instruction, data-parallel vector registers and blends, but it would kill performance for many algorithms that work on limited number of lanes.

audunw · on March 20, 2022

Why do you list code density as a potential disadvantage? With changes to the ISA that has already been approved, RISC-V will have the best code density of any significant ISA in real-world code.

The article touched on this briefly so it's odd that you would claim this without a source for the claim. I know there's some outdated benchmarks where it's slightly worse than Thumb for instance. But then Thumb isn't relevant for desktop CPUs.

The downside for RISC-V for high end desktop/laptop is lack of a large commercial backer (someone like Apple could pull it off, but clearly they've bet on ARM,which was clearly the right choice since RISC-V was far from ready). Lack of the huge legacy of tool chains and software built around x86 and ARM is also obviously a huge disadvantage.

But you could have said the same about ARM back in the day. The thing is I'm not sure if the advantages for RISC-V is big enough to take over all of ARMs markets the way ARM has the potential for with x86.

ribit · on March 21, 2022

No, you are right. It’s just frustrating that there is almost no information on this topic. The only data I was able to find is a very short paper from 2017 that compares code density for some simple use cases. The results suggest that Aarch64 and compressed RISC-V are in the same ballpark. I would like to see this looked at over larger and more complex code bases.

pjmlp · on March 20, 2022

It isn't odd at all, this is the kind of usual narrative when selling stuff to an audience that only has a passing knowledge of all issues.

So anyone that isn't deep into ARM architecture will indeed buy into the arguments being made, as they can't assert otherwise.

jhgb · on March 21, 2022

> In RISC-V the equivalent would require a whole 3 different instructions

I'm surprised that it wasn't pointed out that this should probably be eliminated by a compiler transformation. Rather than loading from r0+r1<<2 and incrementing r1 by one every loop iteration, surely it might be possible to load from just r0 and increment it by 4 every loop iteration?

avianes · on March 21, 2022

We don't need indexed loads only for loops. But even within loops, you may need indexed load.

One simple example: when you do loop unrolling, you must access elements n+4, n+8 and n+12 etc.

jhgb · on March 21, 2022

True, there are other uses. However some people seem to be pointing out that on average, the RISC-V code is still smaller on average.

As for unrolling, isn't this a job for RV64V?

dontlaugh · on March 20, 2022

It’s nice to have an open ISA, don’t get me wrong.

However, trade offs matter. Compressing instructions may improve density, but it makes them variable length. This is a big barrier to decoding in parallel, which is very important to high performance cores.

brucehoult · on March 20, 2022

This is really not a big deal with RISC-V's 2 instruction lengths and the encoding they use.

If decoding 32 bytes of code (256 bits, somewhere between 8 and 16 instructions) You can figure out where all the actual instructions start (yes, even the 16th instruction) with 2 layers of LUT6.

You can then use those outputs to mux two possible starting positions for 8 decoders that do 16 or 32 bit instructions, plus 8 decoders what will only ever do 16 bit instructions from fixed start positions (and might output a NOP or in some other way indicate they don't have an input).

OR you can use those outputs to mux the outputs of a 8 decoders that only do 32 bit instructions and 8 decoders that do 16 or 32 (all with fixed starting positions), plus again 8 decoders that only do 16 bit instructions from fixed start positions (possibly not used / NOP).

The first option uses less hardware but has higher latency.

That, again, is for decoding between 8 and 16 instructions per cycle, with an average on real code of close to 12.

That is more than is actually useful on normally branchy code.

In short: not a problem. Unlike x86 decoding.

skavi · on March 21, 2022

What would one do if they had half an instruction falling off the end of that window? Save it then deal with it when you have the other half next cycle?

I’m assuming the decode window is aligned, but that may be a bad assumption.

avianes · on March 20, 2022

You are right but RISC-V variable instruction size is indeed a good trade-off.

Unlike x86 where instructions can range from 1 up to 15 byte, current RISC-V ISA only has 2 instruction sizes.

Today x86 decoding is limiting because we want to decode more than ~4 instructions each cycle, for RISC-V to cause same decoding difficulty it would probably be required to decode more than ~20 instructions each cycle

dontlaugh · on March 20, 2022

I don’t know that it’s a good tradeoff. ARM64 has fixed length instructions with decent density and has proven to allow highly parallel decoding.

zik · on March 21, 2022

I've implemented RISC-V in verilog for fun and it has a parallel decoder. The ISA's cleverly designed to make it pretty easy. Honestly it was no big deal.

audunw · on March 20, 2022

Regular base instructions are always 32-bit, compressed are always 16-bit, and they're always aligned. I don't think there's a problem decoding them in parallel. You always know where the opcodes will be located in a 32-bit word - or set of 32-bit words - you're trying to decode.

What I've been wondering is how difficult it is to fuse instructions when the compressed instructions you're trying to fuse isn't aligned to a 32-bit word.

renox · on March 22, 2022

> ou always know where the opcodes will be located in a 32-bit word - or set of 32-bit words - you're trying to decode.

Uh? In RISC-V with the C extension, 32bit instructions are 16-bits aligned, so no you don't.

brucehoult · on March 20, 2022

I am so bored with people criticising RISC-V based on tiny code snippets of things that basically never happen in real code.

A function that does nothing but return an array element from an array base address and index passed to it? Really? Do you actually write junk like that? And if you write it does your compiler really not inline it? Why? Do you like big slow code? Once it's inlined, it's probably in a loop, and strength-reduced.

It's very easy to verify that in the real-world RISC-V code is more compact than amd64 and arm64. Just download the same version of Ubuntu or Fedora (etc) for each one and run the "size" command on the binaries. The RISC-V ones are consistently significantly smaller.

You can also, with quite a bit more work, count the number of µops each ISA executes. RISC-V executes slightly more instructions, but they are each simple and don't need expanding. Lots of x86 instructions get expanded into multiple µops and many 64 bit ARM instructions do too. In the end the number of µops executed by each is very similar.

Trying to judge the goodness of a modern ISA by looking at two or three instruction snippets is as silly as using Dhrystone as your only benchmark program.

devit · on March 20, 2022

I think the biggest issue is the lack of arithmetic with overflow checking, especially with a variant that calls a location in a control register on overflow.

This makes it very inefficient to compile languages that would like overflow checks on all arithmetic.

audunw · on March 20, 2022

A comment elsewhere here pointed out that RISC-V can do it with two fused compressed instructions for the most common operations. So seems like they did the right trade off to me.

erosenbe0 · on March 20, 2022

What are these r registers for AArch64? Author hasn't likely bothered to run any of the examples through an actual 64-bit ARM assembler. Dubious.

KSPAtlas · on March 20, 2022

I am personally a fan of RISC-V, and I have written low level code for it before.

mhh__ · on March 20, 2022

> Easily out-perform ARM in code density

> No data [that I can see at least]

brucehoult · on March 20, 2022

Some data from Ubuntu 21.10 for amd64, arm64, and riscv64:

https://www.reddit.com/r/RISCV/comments/tik718/addressing_cr...

sylware · on March 20, 2022

RISC-V is technically not bad enough to select arm64 or x86_64 over it, since those have beyond toxic IP tied to them.

From what I read in the comments, I don't expect compressed instructions on future high performance desktop/servers RISC-V CPU cores to be there.

stonogo · on March 20, 2022

It doesn't matter how convincing the sales pitch is when the product is not actually for sale.

One thing ARM and x86 got right that SPARC and POWER got wrong is widely-available machines available at reasonable prices. All the 'being right' in the world won't help if developers need a five-figure hardware budget to port to your platform. VMs don't cut it for bringup.

BirAdam · on March 20, 2022

$17 64 bit RISC-V https://linuxgizmos.com/17-sbc-runs-linux-on-allwinner-d1-ri...

$29 64 bit RISC-V in the same form factor as an RPi CM3 https://www.clockworkpi.com/product-page/copy-of-clockworkpi...

If you never look for it, you will believe it doesn’t exist.

I was very skeptical of ARM back in the day thinking that it was great for crappy little iTrinkets and Androids but not for “real computing”. I was clearly wrong. I was very skeptical of RISC-V until I recently heard Jim Keller explain why RISC-V has a bright future. He was rather convincing. This is especially true given his track record of straight-up magical results. Looking at different RISC-V machines, I think that the greatest advantage is that it is simple and can therefore be optimized more easily than complex designs, and due to being open, it has very low cost which will encourage more eyes trying more and different optimizations.

EDIT: Link to Jim Keller interview https://www.anandtech.com/show/16762/an-anandtech-interview-...

tsmi · on March 20, 2022

I agree mostly with Keller's take but I think he left of one key factor, the quality of the software tool chain.

The x86 tool chains are amazing. They're practically black magic in the kinds of optimizations they can do. Honestly, I think they're a lot of what is keeping Intel competitive in performance. ARM tool chains are also very good. I think they're a lot of the reason behind why ARM can beat RISCV in code space and performance on equivalent class hardware because honestly, like Keller says, they're not all that different for common case software. But frankly x86 and ARM toolchains should dominate RISCV when we just consider the amount of person-hours that have been devoted to these tools.

So for me the real question is, where are the resources that make RISCV toolchains competitive going to come from (and keep in mind x86 and ARM have open source toolchains too)? And, will these optimizations be made available to the public?

If we see significant investment in the toolchains from the likes of Google, Apple and nVidia, or even Intel. ARM needs to be really worried.

ansible · on March 20, 2022

I don't know that such a heavy investment in the toolchains for RISC-V are actually needed.

If you look at generated code, it seems fairly straightforward. There aren't a lot of tricks or anything.

tsmi · on March 21, 2022

It's not so much "tricks" that one needs to look out for.

The compiler has just tons of internal heuristics on when and when not to apply various code transformations. Those heuristics, first off may not even be applicable for your platform of choice, and even if they are, their magic numbers aren't necessarily tuned well to the platform and application at hand.

Here is a well written and concise case study, albeit somewhat old (2010), that illustrates what I am talking about. The results of variations measurements will have changed since then but the overall high level situation hasn't. If you read the paper, in your mind, just replace every instance of x86 with ARM and every instance of ARM with RISCV and you'll get the idea.

https://ctuning.org/dissemination/grow10-03.pdf

BirAdam · on March 20, 2022

I think the serious investment will be from Intel, Apple (with LLVM), and possibly Microsoft (into the GCC/Linux ecosystem).

nnx · on March 20, 2022

> I recently heard Jim Keller explain why RISC-V has a bright future

Would like to hear it too. Can you share a link?

BirAdam · on March 20, 2022

Updated my response to include a link to the transcription. The audio/video is here: https://www.youtube.com/watch?v=AFVDZeg4RVY

It’s actually important (if you’re not an engineer) to listen to the whole thing, because he drops knowledge all over the place.

ncmncm · on March 21, 2022

Watching that was my best use of time in the past month.

I watched at 1.5x speed.

pjmlp · on March 20, 2022

ARM has proven their place for real computing on Newton OS and Acorn Archimedes, no need to prove it again on crappy little iTrinkets and Androids.

Where is a RISC-V doing “real computing” on a Acorn Archimedes like personal computer?

rwmj · on March 20, 2022

There's lots of RISC-V hardware these days, from embedded RV32 chips up to machines you can run Linux on. It's nothing at all like SPARC/POWER.

johndoe0815 · on March 20, 2022

At least PowerPC machines were available for reasonable prices from Apple for about a decade - and Linux was quite well supported in addition to OS X. But with Motorola’s loss of interest in the PC and server market and IBM’s focus on processors for consoles, there was no future for Apple in the growing mobile market. After all, we’re still waiting for the G5 Powerbook :).

IshKebab · on March 20, 2022

Dubious. How is "you have to use this magic combination of instructions that compress & execute well" better than having a dedicated instruction?

Also no mention of the binary compatibility issues - which `-march` do you compile your code for? On x86 you have a choice of 3. For RISC-V as far as I can tell there are 96 valid targets.

socialdemocrat · on March 20, 2022

Because dedicated instructions suck up valuable encoding space, and the more instructions you have, the more instruction you have which potentially become obsolete with new advances in microarchitecture.

Not to mention that by sticking with simple single purpose instructions you make the CPU easier to teach to students. That is after all one of the goals of RISC-V in addition to creating a good ISA for industry.

Have we learned nothing about why we abandoned CISC in the first place? Those CPUs got riddled with instructions that never got used much.

google234123 · on March 20, 2022

With every node shrinkage those legacy instructions take up less space

bigcheesegs · on March 20, 2022

Encoding space, not die space.

google234123 · on March 21, 2022

Encoding space is almost infinite on intel though.

ansible · on March 21, 2022

> Encoding space is almost infinite on Intel though.

Yes... if you willing to tolerate ever more complex and slow instruction decoding. This is a very bad tradeoff.

mhh__ · on March 20, 2022

No. Firstly -march (or similar, e.g. -mcpu in LLVM land) should target a chip not individual instruction sets.

Secondly, AVX-512 alone has a handful of different extensions. There are a bunch of different SSE variants, and similarly instructions are still being added to the VEX prefix (normal AVX).

There is more potentially for getting it wrong with riscv but 64 bit implies a number of extensions too so it's too far off what amd64 originally meant for X86 (e.g. implies SSE2)

IshKebab · on March 20, 2022

What do you mean "no"? My comment was entirely factual.

> Firstly -march (or similar, e.g. -mcpu in LLVM land) should target a chip not individual instruction sets.

LLVM still uses -march. And no you shouldn't target a specific chip unless you know your code will only run on that chip. That's the whole point I'm making. Sometimes you do know that (in embedded situations) but often you don't. Desktop apps aren't compiled for specific chips.

> Secondly, AVX-512 alone has a handful of different extensions.

Yes but these are generally linear - if an x86 chip supports extension N it will support extension N-1 too. Not true for RISC-V.

mhh__ · on March 20, 2022

The LLVM tools (like llc) use -mcpu. Clang mimics GCC. My point about the specific chip is that you have to know it anyway if you're planning on targeting a combination of extensions so you might as well use it.

As for linearity, the "generally" bit will apply to RISC-V by the time we have real desktop class chips using the ISA. We still can't assume AVX support for most programs, I don't view this as any different to RISC-V extensions. Just this ~year Intel added VEX-coded AI NN acceleration instructions, I assume RISC-V has similar plans.

IshKebab · on March 20, 2022

LLVM uses -march and -mcpu. It seems to be a bit of a mess which one you should use and also depend on the architecture.

Time will tell if there's a de facto minimum set of extensions for desktop RISC-V. Let's hope so, but it isn't guaranteed.

monocasa · on March 20, 2022

> Yes but these are generally linear - if an x86 chip supports extension N it will support extension N-1 too. Not true for RISC-V.

Not if you include AMD and Intel cores in that.

IshKebab · on March 20, 2022

Why do you say that? Look here:

https://clang.llvm.org/docs/UsersManual.html#x86

monocasa · on March 20, 2022

That list isn't really an accurate picture of the world, but a vague attempt to make sense of the madness.

There's plenty of cores that don't follow that versioning scheme, and it's not an Intel or AMD construct.

albanread · on March 20, 2022

People should zoom right out and think about the whole RISC-V project. When our phones have billions of transistors, are we seriously supposed to believe that RISC philosophy still matters. Personally I greatly prefer the user programmable 68000 family of processors. The marketing of RISC-V is perhaps the most impressive thing about it. Each to their own, I can see why giant SSD manufacturers want to use a license free design and share the cost of compiler development. Is there really anything else?

socialdemocrat · on March 21, 2022

I loved the 68k. It was what got me started with Assembly. But one of the key reasons I have an interest in RISC-V today is for its educational potential. I know how x86 killed my interest in assembly coding. ARM honestly isn't all that much better.

RISC-V gives people a way to learn and understand what a modern CPU is like. Remember Donald Knuth's books. He teaches algorithms on an imaginary CPU. As CISC architecture got superseded by RISC, he started using an imaginary RISC CPU in teaching.

His point is that people implementing stuff need to have some sense of how the hardware works to understand tradeoffs. RISC-V is in my view a great CPU arch to give that kind of understanding for somebody who is not necessarily interested in writing assemblers, compilers or what not.

Beyond that RISC-V really fits well with the heterogenous computing trend we are moving towards where specialized hardware is increasingly doing more and more of our tasks. I would say it is and advantage that these different specialized chips have some commonality between them. RISC-V is giving people a way of creating a whole ecosystem of chips for a variety of purpose which share a lot of instructions, debuggers, profilers, compilers and other tools.

There is no way x86 could be part of that revolution. x86 is stuck as a general purpose CPU. RISC-V on the other hand will power desktop computers, smart phones, micro-controllers, AI accelerator cards, super-computers and just about anything.

monocasa · on March 20, 2022

> When our phones have billions of transistors, are we seriously supposed to believe that RISC philosophy still matters.

The point isn't just saving gates because it's cheaper. Less gates means less critical path length, meaning less power consumption, and/or higher overall performance when compared apples to apples.

kortex · on March 20, 2022

Yeah, absolutely. Personally, when I zoom out, and look at the trends of engineering in general: simpler modular systems that compose well together vs bespoke solutions, RISC-V precisely follows the trend. Reduce global state. Make it easier (for humans and algos) to reason about control flow. Have a simple core with optional extensions. This all makes building multi-core solutions way simpler. We are fast running out of transistor density improvements. But we are getting way better at building coprocessors. There's clear value in "doing more simple things in parallel".

panick21_ · on March 21, 2022

> RISC philosophy still matters

What matters is not 'RISC philosophy' but that it is an Open Standard that allows for Open implementation.