This biggest issue is one instruction spanning two cache lines, and even two pages. This means a bunch of tricky cases that is the source of bugs and overheads.
It also means you cannot tell instruction boundaries until you directly fetch instructions, so you cannot do any predecode in the cache that would help you figure out dependencies, branch targets, etc. These things matter when you are trying to fetch 8+ instructions per cycle.
> This biggest issue is one instruction spanning two cache lines
Even with fixed (32 bit) instruction lengths aligned on 32 bit, when we have to decode a group of 8 instructions you are facing this kind of issue.
So you either have to cut the instruction group (and thus not take full advantage of the 8 way decoder) or you have to implement a more complex prefetch with a longer pipeline.
And these special cases can be handled in these pipeline stages.
> It also means you cannot tell instruction boundaries until you directly fetch instructions
I mean, AMD does that on x86, with 14 instruction lengths.
It can be done for RISC-V, it's much cheaper than x86, and it takes significantly less surface area than a bigger cache to compensate.
Given the compression stuff is an extension (and so far as I can tell the 16-bit alignment for 32-bit instructions that can result in that sort of spanning is part of that extension), so far as I can tell you could implement said extension for tiny hardware where every byte counts, and then for hardware where you're wanting to fetch 8+ instructions per cycle just ... not implement it?
Wait (he says to himself, realising he's an idiot immediately -before- posting the comment for once). You said upthread the C extension is specified as part of the standard UNIX profile, so I guess people are effectively required to implement it currently?
If that was changed, would that be sufficient to dissolve the issues for people wanting to design high performance implementations, or are there other problems inherent to the extension having been specified at all? (apologies for the 101 level questions, the only processor I really understood was the ARM2 so my curiosity vastly exceeds my knowledge here)
It also means you cannot tell instruction boundaries until you directly fetch instructions, so you cannot do any predecode in the cache that would help you figure out dependencies, branch targets, etc. These things matter when you are trying to fetch 8+ instructions per cycle.