Hacker News .hnnew | past | comments | ask | show | jobs | submit | bertili's commentslogin

It's fascinating that a $999 Mac Mini (M4 32GB) with almost similar wattage as a human brain gets us this far.

Interesting thought, I looked it up out of curiosity and fund 155w max (but realistically more like 80w sustained) for the mac under load, and just around 20watts for the brain, surprisingly almost constant whether “under load” or not.

> 155w max (but realistically more like 80w sustained)

155W PSU seems to be unified with M4 Pro model, plus there's reserve for peripherals (~55W for 5 USB/Thunderbolt ports).

Apple lists 65W for base M4 Mac itself: https://support.apple.com/en-am/103253

Notebookcheck found same number: https://www.notebookcheck.net/Apple-Mac-Mini-M4-review-Small...


I clocked my M4 at 108 Watts while running inference using Qwen3.6-35b-a3b via Al dente.

A relief to see the Qwen team still publishing open weights, after the kneecapping [1] and departures of Junyang Lin and others [2]!

[1] https://hackernews.hn/item?id=47246746 [2] https://hackernews.hn/item?id=47249343


This is just one model in the Qwen 3.6 series. They will most likely release the other small sizes (not much sense in keeping them proprietary) and perhaps their 122A10B size also, but the flagship 397A17B size seems to have been excluded.

And shout-out to Qwen if they release 122b -- Jeff Barr's original Gemma 4 tweet said they'd release a ~122b, then it got redacted :(

122b would be awesome. It is the largest size you can kinda run with a beefy consumer PC. I wondered about gemma stopping in the 30b category, it is already very strong. 122b might have been too close to being really useful.

> not much sense in keeping them proprietary

Maybe for LLMs since everyone has their own competing LLM, but with Video models, Wan 2.2 did a rug pull, left a huge gap for the community that built around Wan 2.2 too, and I don't think a single open video model has come close since. Wan is at 2.7 now, and its been nearly a year since the last update.


Is there any source for these claims?

https://x.com/ChujieZheng/status/2039909917323383036 is the pre-release poll they did. ~397B was not a listed choice and plenty of people took it as a signal that it might not be up for release.

A Qwen research member had a poll on X asking what Qwen 3.6 sizes people wanted to see:

https://x.com/ChujieZheng/status/2039909917323383036

Likely to drive engagement, but the poll excluded the large model size.


397A17B = 397B total weights, 17B per expert?

That's not how it works. Many people get confused by the “expert” naming, when in reality the key part of the original name “sparse mixture of experts” is sparse.

Experts are just chunks of each layers MLP that are only partially activated by each token, there are thousands of “experts” in such a model (for Qwen3-30BA3, it was 48 layers x 128 “experts” per layer with only 8 active at each token)


17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active.

If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).

When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.


397B params, 17B activated at the same time

Those 17B might be split among multiple experts that are activated simultaneously


How many people/hackernews can run a 397b param model at home? Probably like 20-30.

The point is that open weights turns puts inference on the open market, so if your model is actually good and providers want to serve it, it will drive costs down and inference speeds up. Like Cerebras running Qwen 3 235B Instruct at 1.4k tps for cheaper than Claude Haiku (let that tps number sink in for a second. For reference, Claude Opus runs ~30-40 tps, Claude Haiku at ~60. Several orders of magnitude difference). As a company developing models, it means you can't easily capture the inference margins even though I believe you get a small kickback from the providers.

So I understand why they wouldn't want to go open weight, but on the other hand, open weight wins you popularity/sentiment if the model is any good, researchers (both academic and other labs) working on your stuff, etc etc. Local-first usage is only part of the story here. My guess is Qwen 3.5 was successful enough that now they want to start reaping the profits. Unfortunately most of Qwen 3.5's success is because it's heavily (and successfully!) optimized for extremely long-context usage on heavily constrained VRAM (i.e. local) systems, as a result of its DeltaNet attention layers.


You can rent a cloud H200 with 140GB VRAM in a server with 256GB system ram for $3-4/hr.

Can you tell me where? I used runpod before, but they don't have systems like that.

This is like saying that Open source is not important because I don't have a machine to run it on right now. Of course it is important. We don't have any state of the art Language models that are open source, but some are still Open Weight. Better than nothing, and the only way to secure some type of privacy and control over own AI use. It is my goal to run these large models locally eventually; if they all go away that is not even a possibility. . .

I can (barely, but sustainably) run Q3.5 397B on my Mac Studio with 256GB unified. It cost $10,000 but that's well within reach for most people who are here, I expect.

Hacker News moment

$10k is well outside my budget for frivolous computer purchases.

It would be plenty in-budget if the software part of local AI was a bit more full-featured than it is at present. I want stuff like SSD offload for cold expert weights and/or for saved/cached KV-context, dynamic context sizing, NPU use for prefill, distributed inference over the network, etc. etc. to all be things that just work for most users, without them having to set anything up in an overly error-prone way. The system should not just explode when someone tries to run something slightly larger; it should undergo graceful degradation and let them figure out where the reasonable limits are.

But it's well within the budget of a small company that wants to run a model locally. There are plenty of reasons to run one locally even if it's not state of the art, such as for privacy, being able to do unlimited local experiments, or refining it to solve niche problems.

yeah, but if you really really wanted to and/or your livelyhood depended on it, you probably could afford it.

99.97% of HN users are nodding… :)

There are way too many good uses of these models for local that I fully expect a standard workstation 10 years from now to start at 128GB of RAM and have at least a workstation inference device.

or if you believe a lot of HN crowd we are in AI bubble and in 10 years inference will be dirt cheap when all of this crashes and we have all this hardware in data centers and it won't make any sense to run monster workstations at home (I work 128GB M4 but not run inference, just too many electron apps running at the same time...) :)

> I work 128GB M4 but not run inference, just too many electron apps running at the same time.

This is somewhat depressing - needing a couple of thousand bucks worth of ram just to run your chat app and code/text editor and API doco tool and forum app and notetaking app all at the same time...


Crucial (Micron) sold 128GB of DDR5-5600 in SODIMM form for $280 a year ago. It would be slower tham the same amount on an M4 Mac, but still, I object to characterizing either as “a couple thousand bucks worth”.

I( get that number by optioning up a Mac Studio to 128GB at the Apple Store.

(Admittedly, Apple should be facing criminal price gouging law suits for their ram pricing.)


Inference will be dirt cheap for things like coding but you'll want much more compute for architectural planning, personal assistants with persistent real time "thinking / memory", as well as real time multimedia. I could put 10 M4s to work right now and it won't be enough for what I've been cooking.

That's kind of a specific percentage. What numbers did you use to get there?

Just have to reclassify it as non-frivolous then. $10k's not a lot for something as important as a car, if you live somewhere where one is required. Housing is typically gonna cost you more than $10k to own. I probably spend close to $10k for food for 1.5 years.

So if you just huff enough of the AI Kool aid, you too can own a Mac Studio. Or an M5 MacBook. Or a dual 3090 rig.


For some reason you were being downvoted but I enjoy hearing how people are running open weights models at home (NOT in the cloud), and what kind of hardware they need, even if it's out of my price range.

I'm running it on my Intel Xeon W5 with 256GB of DDR5 and Nvidia 72GB VRAM. Paid $7-8k for this system. Probably cost twice as much now.

Using UD-IQ4_NL quants.

Getting 13 t/s. Using it with thinking disabled.


I get 20 t/s on the UD-Q6_K_XL quant, Radeon 6800 XT.

In where I am living, 10k USD is a little more than 3 years worth of rent, for a relatively new and convenient 2 bedroom apartment.

$277 a month for a two bedroom is literally 6-10% of what someone in the SF Bagholder Area pays.

Either you're in Africa, southeast Asia or south/central Amarica.

How do you even afford internet?


Yes, I am in SEA. Home internet here costs 10$ per month.

My point was: not every person browsing this site has high living standard, and the ability to spend 10k on computing is a privilege.


you have proved my point

I'm running it on dual DGX Sparks.

I'm interested in your experiences running dual

which exact model, and how many tokens per second for generation?

According to this blog (https://kaitchup.substack.com/p/lessons-from-gguf-evaluation...) the UD_IQ2_M quants are quite strong (rel. error to the base is very low), so it's around 120GB of RAM needed, while the experts can be loaded into VRAM and the rest offloaded into system RAM. It's a high end consumer PC, sure, but not unaffordable. For example, I got an older rig with a RTX 6000 ADA (48GB VRAM), 128 GB RAM and a Threadripper, which runs this quant offloaded at 20 tps

I’ve mentioned this as an option in other discussions, but if you don’t care that much about tok/sec, 4x Xeon E7-8890 v4s with 1TB of DDR3 in a supermicro X10QBi will run a 397b model for <$2k (probably closer to $1500). Power use is pretty high per token but the entry price cannot be beat.

Full (non-quantized, non-distilled) DeepSeek runs at 1-2 tok/sec. A model half the size would probably be a little faster. This is also only with the basic NUMA functionality that was in llama.cpp a few months ago, I know they’ve added more interesting distribution mechanisms recently that I haven’t had a chance to test yet.


It doesn't matter how many can run it now, it's about freedom. Having a large open weights model available allows you to do things you can't do with closed models.

OpenRouter.

Yeah I think there’s benefits to third-party providers being able to run the large models and have stronger guarantees about ZDR and knowing where they are hosted! So Open Weights for even the large models we can’t personally serve on our laptops is still useful.

If you're running it from OpenRouter, you might as well use Qwen3.6 Plus. You don't need to be picky about a particular model size of 3.6. If you just want the 397b version to save money, just pick a cheaper model like M2.7.

The 397B model can be run at home with the weights stored on an SSD (or on 2 SSDs, for double throughput).

Probably too slow for chat, but usable as a coding assistant.


I think you have that backwards. Agentic coding is way more demanding than simple chat. The request/response loops (tool calling) are much tighter and more numerous, and the context is waaaaay bigger in general.

In processing power, but chat is interactive. Agentic coding, you come up with a plan and sign off on it, and then just let it go for a while. It's the difference between speed and latency.

Running the mxfp4 unsloth quant of qwen3.5-397b-a17b, I get 40 tps prefill, 20tps decode.

AMD threadripper pro 9965WX, 256gb ddr5 5600, rtx 4090.


It only has 17b active params, it's a mixture of experts model. So probably a lot more people than you realize!

I really wish they released qwen-image 2.0 as open weights.

The timing is interesting as Apple supposedly will distill google models in the upcoming Siri update [1]. So maybe Gemma is a lower bound on what we can expect baked into iPhones.

[1] https://hackernews.hn/item?id=47520438



Comparing a model you can downloads weights for with an API-only model doesn't make much sense.


My money's on whatever models qwen does release edging ahead. Probably not by much, but I reckon they'll be better coders just because that's where qwen's edge over gemma has always been. Plus after having seen this land they'll probably tack on a couple of epochs just to be sure.


The Qwen Plus models should be compared to Gemini, not Gemma.


Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?


Most definitely - the popular engines have extensive support for doing this and controlling exactly which weights end up where (llama.cpp: https://github.com/ggml-org/llama.cpp/blob/master/tools/cli/... , vllm: https://docs.vllm.ai/en/stable/configuration/engine_args/#of... , sglang (haven't tried this): https://docs.sglang.io/advanced_features/server_arguments.ht...).

Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though.


Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model.


Using system memory and CPU compute for some of the layers that don’t fit into GPU memory is already supported by common tools.

It’s workable for mixture of experts models but the performance falls off a cliff as soon as the model overflows out of the GPU and into system RAM. There is another performance cliff when the model has to be fetched from disk on every pass.


It's less of a "performance falls off a cliff" problem and more of a "once you offload to RAM/storage, your bottleneck is the RAM/storage and basically everything else no longer matters". This means if you know you're going to be relying on heavy offload, you stop optimizing for e.g. lots of VRAM and GPU compute since that doesn't matter. That saves resources that you can use for scaling out.


It depends on the model and the mix. For some MoE models lately it’s been reasonably fast to offload part of the processing to CPU. The speed of the GPU still contributes a lot as long as it’s not too small of a relative portion of compute.


My thoughts exactly. Something like this could make it so that modest GPU capacity, like a pair of 3090s , and lots of RAM could make big inference more practical for personal labs


Why couldn't you take the same approach on Liunx, and load from SSD?

I assumed the only reason this particular project wouldn't be usable on Linux is because it uses Metal...


Better than frontier pelicans as of 2025


Last Chinese new year we would not have predicted a Sonnet 4.5 level model that runs local and fast on a 2026 M5 Max MacBook Pro, but it's now a real possibility.


I’m still waiting for real world results that match Sonnet 4.5.

Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.

Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.

They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.


Yeah I wouldn't get too excited. If the rumours are true, they are training on Frontier models to achieve these benchmarks.


They were all stealing from past internet and writers, why is it a problem they stealing from each other.


This. Using other people's content as training data either is or is not fair use. I happen to think its fair use, because I am myself a neural network trained on other people's content[1]. But, that goes in both directions.

1: https://xkcd.com/2173/


Nobody is saying it's a problem.


because dario doesnt like it


I think this is the case for almost all of these models - for a while kimi k2.5 was responding that it was claude/opus. Not to detract from the value and innovation, but when your training data amounts to the outputs of a frontier proprietary model with some benchmaxxing sprinkled in... it's hard to make the case that you're overtaking the competition.

The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.

edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:

https://docs.google.com/document/d/1zrX8L2_J0cF8nyhUwyL1Zri9...


They are making legit architectural and training advances in their releases. They don't have the huge data caches that the american labs built up before people started locking down their data, and they don't (yet) have the huge budgets the American labs have for post training, so it's only natural to do data augmentation. Now that capital allocation is being accelerated for AI labs in China, I expect Chinese models to start leapfrogging to #2 overall regularly. #1 will likely always be OpenAI or Anthropic (for the next 2-3 years at least), but well timed releases from Z.AI or Moonshot have a very good chance to hold second place for a month or two.


Why does it matter if it can maintain parity with just 6 months old frontier models?


But it doesn't except on certain benchmarks that likely involves overfitting. Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328


Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better.


I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close.


Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?


This could be a good thing. ARC-AGI has become a target for America labs to train on. But there is no evidence that improvements on ARC performance translate to other skills. In fact there is some evidence that it hurts performance. When openai trained a version of o1 on ARC it got worse at everything else.


That's a link from July of 2025, so, definitely not about the current releaase.


...which conveniently avoids testing on this benchmark. A fresh account just to post on this thread is also suspect.


GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.


It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!"


That's not because 4o is good at things, that's because it's pretty much the most sycophantic model and people easily fall for a model incorrectly agreeing with them then a model correctly calling them out.


because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete....


If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.

If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.

Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.


> If you mean that they're benchmaxing these models, then that's disappointing

Benchmaxxing is the norm in open weight models. It has been like this for a year or more.

I’ve tried multiple models that are supposedly Sonnet 4.5 level and none of them come close when you start doing serious work. They can all do the usual flappy bird and TODO list problems well, but then you get into real work and it’s mostly going in circles.

Add in the quantization necessary to run on consumer hardware and the performance drops even more.


Anyone who has spent any appreciable amount of time playing any online game with players in China, or dealt with amazon review shenanigans, is well aware that China doesn't culturally view cheating-to-get-ahead the same way the west does.


> they are training on Frontier models to achieve these benchmarks.

Why cant the frontier labs block their API usage?


I hope China keeps making big open weights models. I'm not excited about local models. I want to run hosted open weights models on server GPUs.

People can always distill them.


Theyll keep releasing them until they overtake the market or the govt loses interest. Alibaba probably has staying power but not companies like deepseek's owner


Will 2026 M5 MacBook come with 390+GB of RAM?


Quants will push it below 256GB without completely lobotomizing it.


> without completely lobotomizing it

The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.


In general, quantizing down to 6 bits gives no measurable loss in performance. Down to 4 bits gives small measurable loss in performance. It starts dropping faster at 3 bits, and at 1 bit it can fall below the performance of the next smaller model in the family (where families tend to have model sizes at factors of 4 in number of parameters)

So in the same family, you can generally quantize all the way down to 2 bits before you want to drop down to the next smaller model size.

Between families, there will obviously be more variation. You really need to have evals specific to your use case if you want to compare them, as there can be quite different performance on different types of problems between model families, and because of optimizing for benchmakrs it's really helpful to have your own to really test it out.


Did you run say SWE Bench Verified? Where does this claim coming from? It's just an urban legend.


> In general, quantizing down to 6 bits gives no measurable loss in performance.

...this can't be literally true or no one (including e.g. OpenAI) would use > 6 bits, right?


NVIDIA is showing training at 4 bits (NVPF4), and 4 bit quants have been standard for running LLMs at home for quite a while because performance was good enough.


I mean, GPT-OSS is delivered as a 4 bit model; and apparently they even trained it at 4 bits. Many train at 16 bits because it provides improved stability for gradient descent, but there are methods that allow even training at smaller quantizations efficiently.

There was a paper that I had been looking at, that I can't find right now, that demonstrated what I mentioned, it showed only imperceptible changes down to 6 bit quants, then performance decreasing more and more rapidly until it crossed over the next smaller model at 1 bit. But unfortunately, I can't seem to find it again.

There's this article from Unsloth, where they show MMLU scores for quantized Llama 4 models. They are of an 8 bit base model, so not quite the same as comparing to 16 bit models, but you see no reduction in score at 6 bits, while it starts falling after that. https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs/uns...

Anyhow, like anything in machine learning, if you want to be certain, you probably need to run your own evals. But when researching, I found enough evidence that down to 6 bit quants you really lose very little performance, and even at much smaller quants the number of parameters tends to be more important than the quantization, all the way down to 2 bits, that it acts as a good rule of thumb, and I'll generally grab a 6 to 8 bit quant to save on RAM without really thinking about, and I try out models down to 2 bits if I need to in order to fit them into my system.


This isn't the paper that I was thinking of, but it shows a similar trend to the one I was looking at. In this particular case, even down to 5 bits showed no measurable reduction in performance (actually a slight increase, but that probably just means that you're withing the noise of what this test can distinguish), then you see performance dropping off rapidly as it gets down to 3 various 3 bit quants: https://arxiv.org/pdf/2601.14277

There was another paper that did a similar test, but with several models in a family, and all the way down to 1 bit, and it was only at 1 bit that it crossed over to having worse performance than the next smaller model. But yeah, I'm having a hard time finding that paper again.


So, why does ChatGPT not use fewer bits? Sure they have big data centers but they still have to pay for those.


Why do you think ChatGPT doesn't use a quant? GPT-OSS, which OpenAI released as open weights, uses a 4 bit quant, which is in some ways a sweet spot, it loses a small amount of performance in exchange for a very large reduction in memory usage compared to something like fp16. I think it's perfectly reasonable to expect that ChatGPT also uses the same technique, but we don't know because their SOTA models aren't open.

https://arxiv.org/pdf/2508.10925


Most certainly not, but the Unsloth MLX fits 256GB.


Curious what the prefilled and token generation speed is. Apple hardware already seem embarrassingly slow for the prefill step, and OK with the token generation, but that's with way smaller models (1/4 size), so at this size? Might fit, but guessing it might be all but usable sadly.


They're claiming 20+tps inference on a macbook with the unsloth quant.


Yeah, I'm guessing the Mac users still aren't very fond of sharing the time the prefill takes, still. They usually only share the tok/s output, never the input.


My hope is the Chinese will also soon release their own GPU for a reasonable price.


'fast'

I'm sure it can do 2+2= fast

After that? No way.

There is a reason NVIDIA is #1 and my fortune 20 company did not buy a macbook for our local AI.

What inspires people to post this? Astroturfing? Fanboyism? Post Purchase remorse?


I have a Mac Studio m3 ultra on my desk, and a user account on a HPC full of NVIDIA GH200. I use both and the Mac has its purpose.

It can notably run some of the best open weight models with little power and without triggering its fan.


It can run and the token generation is fast enough, but the prompt processing is so slow that it makes them next to useless. That is the case with my M3 Pro at least, compared to the RTX I have on my Windows machine.

This is why I'm personally waiting for M5/M6 to finally have some decent prompt processing performance, it makes a huge difference in all the agentic tools.


Just add a DGX Spark for token prefill and stream it to M3 using Exo. M5 Ultra should have about the same compute as DGX Spark for FP4 and you don't have to wait until Apple releases it. Also, a 128GB "appliance" like that is now "super cheap" given the RAM prices and this won't last long.


>with little power and without triggering its fan.

This is how I know something is fishy.

No one cares about this. This became a new benchmark when Apple couldn't compete anywhere else.

I understand if you already made the mistake of buying something that doesn't perform as well as you were expecting, you are going to look for ways to justify the purchase. "It runs with little power" is on 0 people's christmas list.


It was for my team. Running useful LLMs on battery power is neat for example. Some simply care a bit about sustainability.

It’s also good value if you want a lot of memory.

What would you advice for people with a similar budget? It’s a real question.


But you arent really running LLMs. You just say you are.

There is novelty, but not practical use case.

My $700, 2023, 3060 laptop runs 8B models. At the enterprise level we got 2, A6000s.

Both are useful and were used for economic gain. I don't think you have gotten any gain.


Yes a good phone can run a quantised 8B too.

Two A6000 is fast but quite limited in memory. It depends on the use case.


>Yes a good phone can run a quantised 8B too.

Mac expectations in a nutshell lmao

I already knew this because we tried doing it at an enterprise level, but it makes me well aware nothing has changed in the last year.

We are not talking about the same things. You are talking about "Teknickaly possible". I'm talking about useful.


If you are happy with 96GB of memory, nice for you.


I use my local AI, so: yes very much.

Fancy RAM doesn't mean much when you are just using it for facebook. Oh I guess you can pretend to use Local LLMs on HN too.


Exactly. The emperor has no clothes. The largest investments in US tech in history and yet there less than a year of moat. OpenAI or Anthropic will not be able to compete with Chinese server farms and so the US strategy is misplaced investments that will come home to roast.

And we will have Deepseek 4 in a few days...


Surely this is the elephant in the room, but the point here is that Apple as control over its ecosystem, so it may be able to sandbox and make entitlements and transparency good enough, in the apps that the bot can access.


Like I said: sandboxing doesn't solve the problem.

As long as the agent creates more than just text, it can leak data. If it can access the internet in any manner, it can leak data.

The models are extremely creative and good at figuring out stuff, even circumventing safety measures that are not fully air tight. Most of the time they catch the deception, but in some very well crafted exploits they don't.


The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.


That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.


People are running the previous Kimi K2 on 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s. Its still premature, but not a completely crazy proposition for the near future, giving the rate of progress.


> 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s

Keep in mind that most people posting speed benchmarks try them with basically 0 context. Those speeds will not hold at 32/64/128k context length.


If "fast" routing is per-token, the experts can just reside on SSD's. the performance is good enough these days. You don't need to globally share unified memory across the nodes, you'd just run distributed inference.

Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.


Depends on if you are using tensor parallelism or pipeline parallelism, in the second case you don't need any sharing.


RDMA over Thunderbolt is a thing now.


I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.


Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.


I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s

These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.


You do realize that's entirely made up, right?

Could be true, could be fake - the only thing we can be sure of is that it's made up with no basis in reality.

This is not how you use llms effectively, that's how you give everyone that's using them a bad name from association


That's great for affordable local use but it'll be slow: even with the proper multi-node inference setup, the thunderbolt link will be a comparative bottleneck.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: