Hacker News .hnnew | past | comments | ask | show | jobs | submitlogin

> A Opus 4.7/Gpt5.5 class model is 5 trillion parameters[1].

You could run it on a cluster of nodes that each do some mix of fetching parameters from disk and caching them in RAM. Use pipeline parallelism to minimize network bandwidth requirements given the huge size. Then time to first token may be a bit slow, but sustained inference should achieve enough throughput for a single user. That's a costly setup of course, but it doesn't cost $900k.



> You could run it on a cluster of nodes

Not sure this is a MBP either.


Not even a cluster of Mac Pros could run a dense 5T parameter model with RDMA, to my knowledge.


SOTA models are reportedly MoE, not dense.


A 5T MoE model is still bottlenecked by streaming weights from SSD, in addition to compute bottlenecks during prefill and decode.


True but a cluster built on pipeline parallelism can naturally stream from multiple SSD's in parallel. That probably makes offload somewhat more effective. And you also have RAM caching available as a natural possibility.


You won't be RAM caching much of anything with experts that are 220b parameters worth of layers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: