1. The biggest chip market is laptops and getting 15% better performance for 80%...

clamchowder · on Aug 11, 2024

1. Yeah I agree, both X Elite and many Intel/AMD chips clock well past their efficiency sweet spot at stock. There is a cost to extra pipeline stages, but no one is designing anything like Tejas/Jayhawk, or even earlier P4 variants these days. Also P4 had worse problems (like not being able to cancel bogus ops until retirement) than just a long pipeline.

Arm's predecoded L1i cache is not "free" and can't be filled with simple data moves. You need predecode logic to translate raw instruction bytes into an intermediate format. If Arm expanded predecode to handle fusion cases in A715, that predecode logic is likely more complex than in proir generations.

2. Size/area is different from power consumption. Also the decoder is far from the only change. The BTBs were changed from 2 to 3 level, and that can help efficiency (could make a smaller L2 BTB with similar latency, while a slower third level keeps capacity up). TLBs are bigger, probably reducing page walks. Remember page walks are memory accesses and the paper earlier showed data transfers count for a large percentage of dynamic power.

4. IMO no one is really RISC or CISC these days

8. Sure you can align the function or not. I don't think it matters except in rare corner cases on very old cores. Not sure why you think it's an overall net negative. "feeling weird" does not make for solid analysis.

Most x86 instructions are not microcode only. Again, check your data with performance counters. Microcoded instructions are in the extreme minority. Maybe microcoded instructions were more common in 1978 with the 8086, but a few things have changed between then and now. Also microcoded instructions do not cost thousands of cycles, have you checked? i.e. a gather is ~22 micro ops on Haswell, from https://uops.info/table.html Golden Cove does it in 5-7 uops.

ISA history has a lot of failed examples where people tried to lean on the ISA to simplify the core architecture. EPIC/VLIW, branch delay slots, and register windows have all died off. Mill is a dumb idea and never went anywhere. Everyone has converged on big OoO machines for a reason, even though doing OoO execution is really complex.

If you're interested in cases where ISA does matter, look at GPUs. VLIW had some success there (AMD Terascale, the HD 2xxx to 6xxx generations). Static instruction scheduling is used in Nvidia GPUs since Kepler. In CPUs ISA really doesn't matter unless you do something that actively makes an OoO implementation harder, like register windows or predication.