I'm just looking at the motherboard layout at the very top of the article, and i...

phire · on June 26, 2020

The xbox (like many consoles, n64, gamecube, xbox360, wii, wii u, xbox one, ps4, switch ps5, XSX) has unified memory, as in the CPU and GPU share the same sdram.

The only way to do this is to have one chip (It's always the GPU. The GPU needs more memory bandwidth) connected directly to the dram, and the second chip (CPU) has to send memory requests to the second chip.

Though, this console dates to a time when CPUs didn't typically have dram controllers onboard. PCs usually relied on a northbridge chip to have the dram controllers, along with the routing to all peripherals (PCI/AGP) and present a nice tidy Front-side-bus that the CPU understands. In the case of the xbox, the GPU is acting as a combined Northbridge/GPU (a design that was common at the time in low-cost desktops and laptops)

Unified memory has a large number of advantages for consoles. It lowers cost. It gets rid of copying delays between GPU and CPU memory and it allows the game developer to dynamically allocate memory to the GPU or CPU depending on their needs.

wtallis · on June 27, 2020

> In the case of the xbox, the GPU is acting as a combined Northbridge/GPU (a design that was common at the time in low-cost desktops and laptops)

As I recall, the earliest Intel integrated graphics added almost nothing to the production cost of the chipset. The die needed a certain perimeter to support all the IO connections, and that left quite a bit of unused silicon in the middle. Putting GPU logic in that space was almost free (minus R&D), and only slightly increased the total pin count. Intel got to capture slightly more revenue per PC and deny a lot of revenue to competing chip companies making discrete GPUs.

The situation today is very different, with GPU and CPU on the same die and the GPU blocks taking up far more space than the CPU cores. The integrated GPU is an important part of the chip cost, and that means desktop processors often have worse (smaller) iGPUs than laptop chips.

monocasa · on June 27, 2020

Interestingly you see one of the major compelling reasons for the north bridge functionality to be on a separate chip. That large fixed area cost component (the pads have a lot analog components that don't shrink like logic does) can be manufactured on whatever tried and true older process node gets amazing yields for the area cost rather than the cost per gate (like you want your CPUs on).

It's a lot like what you see today with AMD's chiplets at 7nm connected to a central I/O die on 14nm. Just the classic systems were integrated at the board level instead of a special interposer.

EDIT: I wonder if the switch to serdes links instead of parallel buses (infinity fabric instead of hypertransport) is a large part of what made this idea useful again, reducing the number of off chip signals again for the CPU dies. I wonder if we'll therefore see a replacement for Intel's QPI if they switch to chiplets too.

phire · on June 27, 2020

Infinity fabric isn't serial.

Wikichip [1] says there are two versions of infinity fabric (which is a super-set of HyperTransport), one optimised for on-package communication that is 32 bits wide, and one optimised for inter-socket communication that is 16bits wide.

I'm not a hardware person, just a software person who dabbles in hardware, so I don't know if the term "SerDes" strongly implies serial and AMD have misused it here, or if SerDes is generic enough to apply to any SERialisation/DESerialisation.

But still, I think you are on the right track. The investment and development into low power, high-speed, and low-latency on-package links is probably what has enabled chiplets to become relevant now.

Because multi-chip modules have always existed.

[1] https://en.wikichip.org/wiki/amd/infinity_fabric

dfox · on June 27, 2020

Probably all modern fast interconnects work by interleaving some kind of "physical frames" (often bytes with some 8B10B encoding) across multiple distinct serial links that are synchronized as if these frames were bits of parallel interface.

The primary reason for this design is that at current frequencies it is essentially impossible to manufacture the physical parallel interface with equal-enough wire lengths. Interestingly, memory interfaces (like DDR4) use opposite approach: the interface is still mostly parallel, but memory controller measures the delays and mismatches of the physical wires and compensates for that in its timing.

cogman10 · on June 27, 2020

> the interface is still mostly parallel

Really? That's crazy! I thought DDR was serial connections. In fact, I thought parallel connections had mostly gone the way of the dodo. Serial is just so much less complicated.

temac · on June 27, 2020

IFIS multiplexes with PCIe so it probably qualifies as quite serial. IFOP a little less, but that's blurry at this point.

wtallis · on June 27, 2020

> I wonder if we'll therefore see a replacement for Intel's QPI if they switch to chiplets too.

I think Intel's trying to ensure they have the advanced packaging/interposer/bridge tech to handle wide parallel connections between chiplets. If it works out, they might even end up moving in the opposite direction—toward wider interconnects rather than narrower.

vvanders · on June 27, 2020

One notable exception from that list is the PS3. That thing was a real bastard to write for but if you managed to vectorize things properly it did really go at a good clip.

One fun thing was optimizing for the SPUs meant that your code was really cache coherent and usually saw significant gains on all platforms. Of course most people at that time wrote for PC+360 first and entered a world of pain when PS3 came along.

If there's one lesson to take away it's always build for your most constrained platforms first. There's still a few funky architectures out there(RPi I'm looking at you[1]) so it's always worth understanding where the hardware constraints in your system can come back to cause havok.

[1] https://www.raspberrypi.org/documentation/configuration/conf...

Narishma · on June 27, 2020

> cache coherent

What do you mean by this?

fulafel · on June 27, 2020

Why would it be impossible to have >1 chip access the SDRAM? Like Amiga did.

Especially if the GPU already had custom silicon for it.

Whether it would be good engineering (cost, time to market, risks) is of course another issue.

phire · on June 27, 2020

It works for the Amiga because the Amiga has simple DRAM.

It gets harder and harder to do such external muxing as the ram gets more and more complex. With multiple banks, row open delays, bursts and more complex signalling (fast and faster ual data rate at lower and lower voltages) it's near impossible to control modern DRAM without a proper controller.

And that controller has to live inside a single chip. It would be insanity to try and have two different dram controllers multiplexing the same DRAM chips.

rasz · on June 27, 2020

Amiga worked exactly the same way. All memory access to the so called "Chip" ram had to go thru Agnus. CPU address lines didnt touch ram chips directly. CPU was just a passenger riding on the back of powerful GPU.

https://www.pmsoft.nl/amiga/A500-block-diagram.jpg

fulafel · on June 28, 2020

Notice the data bus is shared but the address bus is not in the pic. So there was arbitration by Agnus but the data didn't go through it.

rasz · on June 28, 2020

Low level implementation detail. Agnus is the memory controller here, handling refresh and addressing. Block diagram tristate latch (74LS244 & 74LS373 in real hardware) should be considered part of the chipset (controlled by Gary). Take away Agnus(or even Gary) and CPU cant do anything, cant really say there is any ">1 chip access the SDRAM" here. We would have to go back all the way to C64 to say cpu and graphic chip share same sdram bus ~equally.

zozbot234 · on June 27, 2020

The Amiga "chip" memory was actually quite slow to use, precisely due to it being accessed by both the CPU and chipset - the solution was to add "fast", CPU-only RAM. The address space was unified between "chip" and "fast" memory, but the underlying hardware arrangement was different.

fulafel · on June 28, 2020

The chip mem slowness only came in the later Amiga models with faster CPUs with 32 bit memory buses I think? Part of the constant trend of custom chipset falling behind, after the first machine. (With a few increasingly laggard refreshes)

MrBuddyCasino · on June 27, 2020

> The only way to do this is to have one chip (It's always the GPU. The GPU needs more memory bandwidth) connected directly to the dram, and the second chip (CPU) has to send memory requests to the second chip.

I don't think thats true, in embedded architectures its not uncommon to have dual-port RAM.

phire · on June 27, 2020

Inside chips it's common to have dual port ram. But it's really expensive.

I'm not aware of any designs which have large amounts of dual-port RAM as main system memory.

MrBuddyCasino · on June 27, 2020

Fair enough, I should have been more precise. I wanted to point out that it is not a hard technical limitation, but should have added that in that specific use case it is economically unfeasible.

wtallis · on June 26, 2020

CPUs didn't have DRAM controllers on-die back then. The GPU is performing that function for the system, and the link between the CPU and GPU is the CPU's front side bus rather than PCI(e). The CPU and GPU are also close together so they can both be cooled effectively by the same fan.

henryfjordan · on June 26, 2020

It looks like the CPU and GPU are really close, which makes sense because it looks like there's a ton of pins shared between the two.

Ultimately designing a circuit board layout is an optimization problem. You usually have some constraints, like how close chips can be before they start interfering with each other magnetically, where the I/O will be, and where you need holes to mount the board. Then you either try to be a pathing optimizer yourself or you run a program that will layout your board for you.

I'm not sure about the XBox, but game consoles sometimes have faster Memory->GPU pipelines than normal PCs to speed up render times, which might be why the GPU is the most central component.

drivebyubnt · on June 26, 2020

A 3 chip solution was very common for that era. See https://en.wikipedia.org/wiki/Northbridge_(computing)

The only unusual thing here is that the GPU and Northbridge are the same chip.

easde · on June 26, 2020

The architecture here is similar to old PCs (roughly before 2010) that had integrated graphics in the northbridge. The memory controller also resides in the northbridge. The CPU communicates with the northbridge through the front-side bus. Incidentally the northbridge also used to be responsible for high-speed I/O such as PCIe, so even if you had a discrete GPU it would not be connected directly to the CPU.

Over time CPUs have integrated all those features on-die, resulting in today's SoC-like processors where the "chipset" is merely an I/O expander connected over a PCIe-like link.

noisem4ker · on June 28, 2020

>high-speed I/O such as PCIe

Back in the years we're talking about, that would be AGP (https://en.m.wikipedia.org/wiki/Accelerated_Graphics_Port).

flipacholas · on June 26, 2020

That design is called Unified Memory Architecture or 'UMA' and can save a lot of production costs at the expense of greater memory latency. The Xbox is not the first one to implement it (the N64 is another good example).