This is really not a big deal with RISC-V's 2 instruction lengths and the encoding they use.
If decoding 32 bytes of code (256 bits, somewhere between 8 and 16 instructions) You can figure out where all the actual instructions start (yes, even the 16th instruction) with 2 layers of LUT6.
You can then use those outputs to mux two possible starting positions for 8 decoders that do 16 or 32 bit instructions, plus 8 decoders what will only ever do 16 bit instructions from fixed start positions (and might output a NOP or in some other way indicate they don't have an input).
OR you can use those outputs to mux the outputs of a 8 decoders that only do 32 bit instructions and 8 decoders that do 16 or 32 (all with fixed starting positions), plus again 8 decoders that only do 16 bit instructions from fixed start positions (possibly not used / NOP).
The first option uses less hardware but has higher latency.
That, again, is for decoding between 8 and 16 instructions per cycle, with an average on real code of close to 12.
That is more than is actually useful on normally branchy code.
What would one do if they had half an instruction falling off the end of that window? Save it then deal with it when you have the other half next cycle?
I’m assuming the decode window is aligned, but that may be a bad assumption.
If decoding 32 bytes of code (256 bits, somewhere between 8 and 16 instructions) You can figure out where all the actual instructions start (yes, even the 16th instruction) with 2 layers of LUT6.
You can then use those outputs to mux two possible starting positions for 8 decoders that do 16 or 32 bit instructions, plus 8 decoders what will only ever do 16 bit instructions from fixed start positions (and might output a NOP or in some other way indicate they don't have an input).
OR you can use those outputs to mux the outputs of a 8 decoders that only do 32 bit instructions and 8 decoders that do 16 or 32 (all with fixed starting positions), plus again 8 decoders that only do 16 bit instructions from fixed start positions (possibly not used / NOP).
The first option uses less hardware but has higher latency.
That, again, is for decoding between 8 and 16 instructions per cycle, with an average on real code of close to 12.
That is more than is actually useful on normally branchy code.
In short: not a problem. Unlike x86 decoding.