More

senko · 2026-06-11T23:20:29 1781220029

The post mainly talks about coding from security point of view. Fair enough.

In my own (limited) testing so far, Fable is the most capable model (for coding in general), and the most expensive.

It pretty much saturated my "LLMCraft" benchmark to implement a mini RTS: https://senko.net/vibecode-bench/2026/rts-fable-5.html (prompt and results for other models here: https://senko.net/vibecode-bench/ )

That said, combined with workflows and high thinking effort, burns through tokens (and money) at an alarming rate.

It may be too good (snd too expensive) for most tasks - using it alongside cheaper models for grunt work is probably the winning strategy.

senko · 2026-06-06T09:03:01 1780736581

LinkedIn in particular is quite aggressively blocking any automated attempts to read or navigate through it.

I post quite a lot there and wanted to have a copy of my posts on my blog[0] to preserve them. For a few months I was able to use a headless browser + claude code, then LI wised up and started logging it out, so I had to use a regular Chrome, log in manually and then tell the LLM to take over and slowly go through my feed.

If you're accessing sites which are not actively blocking bots, or - gasp - have an API, it's much better.

[0] example: https://blog.senko.net/may-quick-takes

darksim905 · 2026-06-10T14:52:04 1781103124

useful, thanks!

senko · 2026-06-04T22:50:07 1780613407

Kagi's only office/hub is in Belgrade, which may not be EU, but it's (literally) close enough. Employees are remote.

Freediver (founder) is US based and Kagi is an US entity, so must comply with any warrants there.

But I guess they could set up a Serbian subsidiary?

senko · 2026-06-03T18:20:59 1780510859

I ran the Q4 quant (used with llama.cpp) though my "minesweeper" vibe-coding benchmark: https://senko.net/vibecode-bench/2026/minesweeper-gamma-4-12...

The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.

So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)

I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.

To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).

Lists of various models I tested: https://senko.net/vibecode-bench/

0xbadcafebee · 2026-06-03T21:10:07 1780521007

It was almost certainly not trained for coding, as it's got both audio and vision input, is only 12B, and nowhere in the announcement is coding mentioned. It will likely not have good performance on coding in general, compared to other small models like Qwen 3.6 35B A3B, Gemma 4 26B A4B, Nvidia Nemotron 3 Nano 30B-A3B, gpt-oss-20b.

For 16GB laptops, Qwen 3.5 9B is the undisputed champ.

Gemma 4 31B is the top dog at small model coding, but is dense so it needs ~48GB unified RAM for full context. If you want decent coding on a laptop you need a lot of RAM. But this shouldn't be surprising, dev machines have always needed lots of resources.

dirkg · 2026-06-04T05:25:15 1780550715

> For 16GB laptops, Qwen 3.5 9B is the undisputed champ.

you can run qwen 3.6 35BA3B on a 12-16GB vram gpu and ot works pretty well.

https://www.youtube.com/watch?v=8F_5pdcD3HY&t=1s

even the 27B in some quants can fit.

https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27b...

qwen IMO is far better for coding, esp agentic coding when combined with something like Pi, it comes probably close enough to Sonnet for a lot of use cases.

Gemma family is better for almost all other tasks you'd use a local llm for.

ricardobayes · 2026-06-04T13:28:49 1780579729

You can run it, however those low quantized models (iQ2, iQ4, Q2) will very likely underperform the 9B versions at Q6/Q8.

kanemcgrath · 2026-06-04T21:33:14 1780608794

Something about qwen models hold up really well even at low quants. for most other models anything under q5 is cooked, but on 35B-A3B I can get a lot of things done even at q3_xl. It is definitely better than full precision 9B

selicos · 2026-06-04T16:29:48 1780590588

I want to try a hybrid setup of Gemma 4 E4B with lots of context for general, then Qwen 3.5 9B or larger for coding. Strix Halo set up this weekend, which may enable even larger Qwen models with tons of context.

dofm · 2026-06-04T14:29:13 1780583353

The larger Gemma models are quite good at PHP. I would not be surprised if that was a training objective — it's one of the more consumer-focussed programming languages. They have very good knowledge of wordpress hooks.

dotancohen · 2026-06-03T23:34:55 1780529695

  > For 16GB laptops, Qwen 3.5 9B is the undisputed champ.

You seem like the guy to ask. For a laptop with 12GB VRAM (RTX 5070) and 32 GB system RAM, what is a good multilingual (English, Hebrew, Greek) model for conversing with personal notes in Org mode format? I don't care how long updating the model or rag takes, and even inference can be reasonably slow, but the results of the query as they relate to my personal notes are important. I don't care about general knowledge, for those questions I can use e.g. ChatGPT.

Thanks

akmarinov · 2026-06-04T05:14:31 1780550071

Joins us over on Reddit at r/LocalLlaMA to get 10 different opinions on that

dotancohen · 2026-06-04T09:51:44 1780566704

I read there regularly. I find little value there between the memes. I was hoping to ask a knowledgeable person here.

alfiedotwtf · 2026-06-04T11:03:31 1780571011

/r/localllama for a while now seems to prefer Gemma 4 E4B for creative writing (especially the uncensored GGUFs).

plagiarist · 2026-06-04T18:23:42 1780597422

Do they prefer E4B over the larger models or is it a matter of what fits their machine? I assume 4B isn't large enough to get interesting writing but I don't know anything about it.

KludgeShySir · 2026-06-05T18:14:39 1780683279

Gemma4-31b seems to be very highly regarded for creative writing, especially its finetunes. As for comparison to E4B, I can't say.

Creative writing is not my focus, so this is only secondhand information. r/LocalLlama tends to focus more on the technical side; if you want more creative side check out r/SillyTavern as well (but type of info does bleed over between them).

nl · 2026-06-04T11:32:24 1780572744

Qwen 3.5 35B A3

Qwen models are always good. The 35B A3 model is a MoE model which means it has higher performance in RAM constrained environments compared to the 27B dense model (which is better at coding).

I don't have experience to rate it's Hebrew or Greek performance but apparently it's not bad.

sourcecodeplz · 2026-06-04T02:21:58 1780539718

Any Gemma 4 model, they are great at translations, multilingual

silversmith · 2026-06-04T05:26:39 1780550799

For the biggest languages, Spanish, French, maybe.

For smaller ones like my native Latvian, the output could be confused for good translation from across the room, the words do look like Latvian words. But the quality is Google translate circa 20 years ago, tops.

It could probably do a decent enough translation to English, if all you need is to get the gist of text. But for smaller European language outputs, nothing comes close to Gemini.

dotancohen · 2026-06-04T08:24:28 1780561468

While Gemini 4 seems fine, Gemma 4 does not do Hebrew well. I've replaced it with Aya Expanse and am getting much better results, but there is still much improvement to be had.

I'm not doing translations, rather querying Hebrew text with a Hebrew prompt.

emmelaich · 2026-06-04T03:03:19 1780542199

You may like https://www.llmfit.org/

(not recommendation, I've not used it .. yet)

hypfer · 2026-06-04T08:06:56 1780560416

Just tried it and honestly it's a terrible experience lacking any sort of intent or reason.

Which is unsurprising in the AI space.

You get a wall of text showing you various random fine-tuned models by random people, and that is basically it.

Actual sane default requirements like "just give me the normal AI labs", "please filter for dense only" and "I want this exact context size at this quant" are not part of the tool, apparently. Neither is "compare these quants for me for the same model".

Or maybe it's just hidden enough that I did not find them before I've stopped caring.

Conway's law is at it again.

____

Edit:

I have since then had qwen3.6 ponder the codebase and think about my complaints.

Seems to require a major data model overhaul to actually fix those, so they're legit. Which I didn't doubt, but nice to have some extra fabricated confirmation after it initially refused and said "nooooo the readme says otherwise nooo hypfer is just a hater noo"

___

Edit 2:

It gets worse the longer I stare at it. This could've been a web calculator.

hypfer · 2026-06-04T13:55:40 1780581340

Done:

https://github.com/Hypfer/will-it-fit-llama-cpp

https://hypfer.github.io/will-it-fit-llama-cpp/

hparadiz · 2026-06-04T10:27:28 1780568848

We need benchmarks by engine, cli switch sets, and device with filters by cpu, gpu, and type. And if someone could please aggregate that in a way where people can upload results and just automatically see the best of any model for their device that would be a killer app.

alfiedotwtf · 2026-06-04T11:05:25 1780571125

I've wanted to vibe code a tuning app, that pumps data through your CPU-GPU-RAM to try and determine the best parameters for each model, but I think it's just too much work compared to manually running by hand a one-liner and changing things here and there.

dofm · 2026-06-04T14:32:54 1780583574

I have found these things to be fully exasperating, to be honest, even though I am seeking information about a pretty "known" machine — a 64GB M1 Max MBP.

(Honestly I think Apple's "AI push" could do worse than just focus on a curated model library, a couple of Apple-standard Gemini distillations, an OS-level model manager and some sort of tweak of their containers system to do what Docker's sbx does. They could demystify a lot of this shit.)

tacomagick · 2026-06-04T05:09:34 1780549774

Gemma 4 26A4B

kajecounterhack · 2026-06-03T21:48:11 1780523291

Have you found Gemma 4 31B better than Qwen 3.6 27B Q8? I just started using Qwen + Pi agent and it's great, but "which model works best" is still totally crowdsourced and I was going off of peoples' opinions on reddit. Would love to hear more opinions if people have them.

embedding-shape · 2026-06-03T22:37:50 1780526270

> Have you found Gemma 4 31B better than Qwen 3.6 27B Q8?

Which quant of Gemma? For coding Qwen seems to be pretty far ahead, but generally Gemma seems to have a "vaster" set of knowledge, but armed with a search tool it doesn't really matter, and Qwen 3.6 been really great for all sorts of tool calling. I mostly do programming and related things though, fwiw.

> I was going off of peoples' opinions on reddit

It's extremely astroturfed all over the place, especially the larger subreddits, and especially the one related to a specific animal in a specific location. It's sad, as early on it was a great resource, but now it's mostly paid posts and a race to the bottom, with lots of piling, and all the knowledgeable people I used to recognize are nowhere to be found.

xenophonf · 2026-06-03T23:06:13 1780527973

It took me way too long to realize you were referring to r/localllama.

MoonWalk · 2026-06-03T23:41:03 1780530063

Why the obfuscation in the first place?

embedding-shape · 2026-06-04T10:24:11 1780568651

Just a bit of flair. Also, bunch of people have "keyword watchers" setup for various terms, so when you mention certain things on HN, reddit and elsewhere, you get commentators who enter the conversation not because the context or larger conversation, but because the single term/thing they care deeply about was mentioned, and it just gets very boring to read the whole attackers/defenders comments over and over again. But ultimately I just did it like that because it was more fun to write it like that.

MoonWalk · 2026-06-08T16:56:52 1780937812

But it renders the comment baffling to those who have never heard of that forum. I'm on here and Reddit quite a bit, and never heard of it.

zozbot234 · 2026-06-04T00:10:01 1780531801

I'm not sure that GP is correct, many people in that forum tend to hate Qwen for closing up many of their more recent models and leaving the whole local inference community 'stranded' on their older releases.

julianlam · 2026-06-04T04:34:43 1780547683

Are you sure? Prior to today the sub seems to be pretty partial to Qwen.

kajecounterhack · 2026-06-04T04:08:40 1780546120

That was definitely not the subreddit where I got my info.

thangalin · 2026-06-03T22:35:55 1780526155

Yes. I'm using Gemma-4 31B (gemma-4-31B-it-assistant.Q4_K_M.gguf) with llama.cpp to attribute quotations throughout chapters of my sci-fi novel. I started with Qwen3, but couldn't get it to work. Qwen3 TTS Voice Design, on the other hand, is incredible (Qwen3-TTS-12Hz-1.7B-VoiceDesign). I'm using both for an audiobook generator that produces a variety of voices.

Screens:

* https://i.ibb.co/TBBV5nJk/kl-01.png (voice design)

* https://i.ibb.co/nNvvKDyV/kl-02.png (quotation attributions)

khimaros · 2026-06-04T17:55:41 1780595741

building something similar: https://github.com/khimaros/autiobook

qingcharles · 2026-06-04T06:36:29 1780554989

Gemma 4 31B is enormously impressive. You get 1000 requests/day for free on Google's API and another 1000/day off OpenRouter. Only problem is you get 503 like crazy.

iso1631 · 2026-06-03T23:48:24 1780530504

I find ram crazy. My thinkpad has 32G of ram, it's a t470 that's nearly a decade old

Why do people with modern laptops have such little amounts of ram?

willy_k · 2026-06-04T00:28:58 1780532938

The ram that’s important for LLMs is gpu-accessible memory, meaning either systems with unified ram or VRAM, the latter of which is tied to the caliber of GPU one has.

alfiedotwtf · 2026-06-04T11:08:17 1780571297

8Gb was the standard for a long time (before Apple went Silicon), because from what I understood, is that SDRAM needs to contantly power cycle the memory bus otherwise the bits will fade, and so by having more RAM, your battery would last a little less... this was around the time when 3 hours charge was unheard of, so every little bit helped.

Probably doesn't matter these days with all-day batterys, but now the demand-supply curve is lopsided.

doubled112 · 2026-06-04T00:28:17 1780532897

My job still issues 16GB laptops as standard. You need a business reason to get more. This has been going on since before the price hikes.

I’m a system administrator and I can do my job with no issues at 16GB. Most days 8GB would likely be enough, since I’m just using and abusing other systems anyway.

Java devs at my last job were still running 16GB in 2020. Admittedly that was a while ago. Still not a decade.

Close some Chrome tabs?

SturgeonsLaw · 2026-06-04T09:36:11 1780565771

Unified memory is soldered to the motherboard and needs to be ordered with the new laptop, for prices that are well above what the equivalent amount of SODIMM would cost.

Fine if work's paying, but for personal devices (that might have been purchased before local models got good), people have what they have.

AshleyGrant · 2026-06-05T17:57:28 1780682248

It doesn't have to be soldered to the motherboard. I've got a Minisforum PC that has unified memory installed via dual SODIMM slots. I put 64 gigs of DDR5 sticks that cost me over $600 and can determine the split between the system and VRAM in the BIOS.

senko · 2026-06-03T21:45:38 1780523138

Yeah, I agree 24B-36B sizes are better in general.

I don't have unified RAM tho and offloading to CPU is dog slow, which is why I'm interested in 7b-12b models.

jmpeax · 2026-06-04T02:26:43 1780540003

> nowhere in the announcement is coding mentioned

It's right there in the middle benchmark bar "LiveCode Bench" 72%.

ricardobayes · 2026-06-04T13:23:06 1780579386

Qwen 3.5 9B is great for coding, but somehow, based on a few hours of subjetive tests, the Gemma 4 12B seems even better.

mark_l_watson · 2026-06-05T13:05:13 1780664713

I had odd Gemma 4 12B results: it was ‘almost excellent’ for writing code in a variety of languages if I was using a detailed one-shot prompt describing new code to write.

I had horrible luck with Gemma 4 12B with a variety of coding harnesses - but as usual Qwen 3.5 9B did OK.

EDIT: CORRECTION: I pulled a fresh copy of Gemma 4 12B and inference code and the tool use problems in my test harnesses are fixed. Gemma 4 12B is slow on my 16B MacBook Air, put produces OK results.

dofm · 2026-06-04T14:27:41 1780583261

It does appear to have training for javascript and PHP, from what I can see, and pretty solid knowledge of wordpress and woocommerce. I would guess it has beginner-friendly knowledge of Python, too?

(Though it is gaslighting me about PHP anonymous functions.)

I would not use it to write code (the MoE 26B writes really good PHP), but it appears to have absolutely good enough knowledge to write implementation plans, and I think that could be useful in a sort of agentic coding tutorial environment.

I test these models with simple things. My favourite mini test is asking an AI to write a "last login" tracker facility for wordpress with a sortable admin column, which is trivial code — only a few lines -- but touches on a reasonably deep bit of the WP API. If you ask it to prompt you with clarifying questions, those questions are quite revealing.

It can write the code. Not tested it but I am sure it works. It's not as elegant.

It is not as good at understanding nuanced instructions as either the 26B or the sparse Qwen 3.6. There are concise things you can say in a prompt to Qwen 3.6 that have it draw logical conclusions that fully impress me.

I am more impressed by it than I expected. I reckon this would be quite useful in a tutorial tool.

(I say this as a sort of qualified cynic; I think much of the AI circus is a farce. But if these things are to ever be useful for teaching without making people dependent on some cloud "intelligence tap", this is progress)

sgt101 · 2026-06-04T19:46:36 1780602396

31B won't run in 48GB for me - it needs 54.

yassa9 · 2026-06-04T20:51:56 1780606316

what quantization did u try ? u can use Q4 quantization, im pretty sure that 48GB would be enough

sgt101 · 2026-06-06T07:05:55 1780729555

8bits is fine.... I was talking full bore.

superkuh · 2026-06-03T21:20:34 1780521634

>consumer-grade card with 12G of VRAM and got 5t/s

That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend.

senko · 2026-06-03T21:42:49 1780522969

Good catch. I haven't looked deeply into it. This is with Vulkan backend on Linux which I understand should be roughly comparable to CUDA? Gfx is rtx 3060(ti?).

I should play a bit more with llama.cpp options and see what bappened there. Thanks!

superkuh · 2026-06-04T00:10:45 1780531845

I've had it happen in the past with llama.cpp on linux that the CPU will present itself as a vulkan device GPU1 with "PHYSICAL_DEVICE_TYPE_CPU" and had a mix-up. Might want to try llama-server --list-devices and then append --device Vulkan0 or whatever.

pja · 2026-06-04T20:30:53 1780605053

The 8 bit quant runs at 36tps using Vulkan on my AMD rx9070.

zigzag312 · 2026-06-03T20:12:47 1780517567

> It roughly compares with GPT-4.1 (!!), released 14 months ago

I think the mayor win for coding was reasoning. That's why such a small model can match GPT-4.1 in coding, but I suspect that GPT-4.1 still wins in general world knowledge due to bigger size.

mdp2021 · 2026-06-03T21:17:29 1780521449

> I suspect ... still wins in general world knowledge due to bigger size

Encyclopedic knowledge matters relatively little in perspective, given the expectable future developments: even the more knowledgeable of us will use that knowledge for reasoning and intuition (and we will have absorbed the intellectual keys during our training), but under our professional hat we should in theory be ready to go "I stand corrected" and "more precisely" with the actual data at hand.

I.e.: for the encyclopedic knowledge needed, the /understander/ will have a RAG subsystem and a corpus of knowledge to inquire upon processing queries.

(Corroboration: we can't delirate, and neither can the machine...)

bitexploder · 2026-06-03T22:23:51 1780525431

Don't LLMs work on attention though? The closer in their hyperdimensional space you can land your problem to their inherent understand the better they are at understanding your problem domain. RAG loops can be very slow and agents may simply lack the knowledge to use them correctly.

mdp2021 · 2026-06-06T08:26:26 1780734386

But, in short, the ability to manage information, to process it properly, is more important in this regard than just having the information. "Having" more knowledge is not a guarantee to "using" it better.

And to improve reliability, if the machine can check, it will have to check. "Costly" cannot be an excuse.

zigzag312 · 2026-06-10T14:39:01 1781102341

Understanding of a specific problem space can be a prerequisite to be able to form a proper query (i.e. to ask the correct question).

Model doesn't know what it doesn't know.

mdp2021 · 2026-06-11T06:53:24 1781160804

Your suggestion is not clear: yes we reason and define relevant details (maybe through further information retrieval) to better construct queries - that is what Analytical school of thought taught and insisted on -, and even more crucial is that the subsequent delegated steps, of constructing replies, imply reasoning and information retrieval.

Said abilities - intellectual strength - are immensely more important than notions. The relation between network size and intellectual strength, vs network size and notions (original topic in this branch), is presumably not yet that clear. Intelligent models may not necessarily be embedded with explicit information of everything, though they will have to have ways to reach that upon contingent necessity (to solve specific problems). Like us.

zigzag312 · 2026-06-11T12:59:13 1781182753

I agree with what you said. I just wanted to add that intelligent models probably need to have some notion embedded (but not everything), as some information retrieval is not trivial. Too few embedded notions will hurt it's ability to solve problems but from some point onward you'll get diminishing returns (where it starts to make sense to rely just on information retrieval).

For example, you if you instruct a model to create decoder for some data type users will upload to your website. The intelligent model without notions will retrieve information about that data type and build a working decoder, but it might miss from context that users uploading to a website means untrusted input and thus won't even try to gather information about what it needs to be done to securely handle such uploaded data.

Or if you give it a task to translate text to a language it didn't encounter during training. You can provide it with grammar rules and a dictionary for information retrieval, but I guess it won't perform as well as inteligent model that already has some fundamental notions of that language and only needs a dictionary to expand its vocabulary.

Gpt-4.1 only knows a lot of patterns, but doesn't have reasoning intelligence that would help it properly use that knowledge. So, a small reasoning model can easily beat it in a lot of tasks. The question is how will, 14 months from now, new small reasoning models compare to current big reasoning models.

How much information needs to be embedded is not yet clear, but currently, bigger reasoning models are still better at complex tasks than small reasoning models. Either sweet spot of embedded notions is higher that what current small models have or information retrieval ability needs to improve.

pu_pe · 2026-06-04T11:28:41 1780572521

I agree with you in general, but depending on the task I also find that a certain level of encyclopedic knowledge can be very valuable. For example, if you use it for coding, the model will likely not resort to search or RAGs when deciding whether to use a particular package or stack.

coldcity_again · 2026-06-03T21:40:06 1780522806

A great position to take. Strong opinions, weakly held.

McGlockenshire · 2026-06-03T20:45:59 1780519559

> my consumer-grade card with 12G of VRAM and got 5t/s for output

Thank you for giving me hope!

UncleOxidant · 2026-06-04T15:44:49 1780587889

I've heard the assertion that the Gemma 4 models don't do well with lower quantization. I wonder if the "bizzare/trivial" syntax errors would go away at Q8?

amelius · 2026-06-04T20:26:24 1780604784

> it would do an extra closing bracket or paren a few times

I had this with Gemini: in the middle of a C++ program it once said RParen instead of using )

It was easy to fix of course, but it makes you question what is going on inside its head.

pja · 2026-06-04T20:34:10 1780605250

The Unsloth 8bit quant seems to manage this task without any syntax errors.

frikk · 2026-06-03T19:05:48 1780513548

Thank you for sharing this. Do you think the syntactical issues could be addressed with fine tuning or some other kind of parameter tweaking? That's frustrating hah.

profunctor · 2026-06-03T19:32:18 1780515138

With a harness you could feed the code to a linter and if there are errors feed that to a model automatically. It’s amazing that the models are good enough that I haven’t bothered doing this

DeathArrow · 2026-06-04T08:03:59 1780560239

>The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually

Can you instruct it to use a lsp?

borissk · 2026-06-04T14:34:10 1780583650

We are really getting close to singularity - the pace of LLM improvement is constantly accelerating.

pseudosavant · 2026-06-03T22:22:17 1780525337

Models this small and this capable bode really well for the usefulness of a PC like the RTX Spark that Nvidia/Microsoft announced this week. 128GB of unified memory will likely be more than sufficient for effective local agentic coding, even if SOTA cloud models will still be even better.

Up until this point, I've found the cost/value to unequivocally favor using a cloud subscription, but I would be lying if I didn't worry that one day OpenAI is going to increase the price for my subscription by 5-10x. I rely on these tools enough that if there is a real viable local option, I'm going to take it.

pseudollm · 2026-06-03T23:52:43 1780530763

> usefulness of the RTX Spark

Not really. There's a reason the announcement didn't include ANY benchmark (!) and didn't mention EXACTLY what is the memory bandwidth. It's going to be dog-slow unusable for large models, as tok/sec is basically bandwidth divided by active weights. Rumoured 300GB/s / 30GB active weights (decent model) = 10 tokens per second, which is really slow

SwellJoe · 2026-06-04T00:04:34 1780531474

Yep, I have a Strix Halo and while it can run models bigger than Qwen 3.6 27b, it's not usable interactively when you do. ds4 patched for ROCm works, but at such a slow speed, it's not usable for coding agents.

The Nvidia boxes have only slightly more memory bandwidth, so I wouldn't expect them to be notably faster. At least not enough to make it useful interactively at that scale.

zozbot234 · 2026-06-04T00:17:31 1780532251

Why does everyone expect interactivity from local AI? It's not the best use of the hardware, especially not miniPC hardware. Long-term batched inference with larger and more capable models is much more feasible AIUI.

int_19h · 2026-06-04T03:49:32 1780544972

I can't speak for others but IMO the only reason to run models locally right now is privacy - i.e. you don't trust any of the cloud providers to not look at your prompts. Price-wise the market is extremely competitive and cheap model serving favors large scale so anything that can be run locally can be run cheaper in the cloud. But if privacy is important, then it's important for everything, including traditional chatbot applications, which kinda do require interactivity.

SwellJoe · 2026-06-04T00:29:10 1780532950

Even batched it's uncomfortably slow. I started to benchmark ds4 with my security vulnerability benchmark (after Qwen 3.6 dense and MoE and a bunch of cloud models), but it was going to tie up the Strix Halo for more than a day, so I decided not to run it as it would prevent me from doing other stuff with it during that time.

Even batched usage needs to be fast enough to deliver results in a reasonable time. Overnight runs are useful, 24 hour runs are...less so.

Anyway, most of the time people are talking about interactive use, and there's currently an upper bound on how large a model can be for local hosting on a reasonable budget (i.e. not a crazy amount more expensive than what a high end developer desktop or laptop costs). The sweet spot is probably currently the big Qwen 3.6 or Gemma 4 models, which are in the ~60GB range for 8-bit quantization plus a large context.

hedgehog · 2026-06-04T01:18:08 1780535888

The 6-bit versions + 8-bit KV cache seems to save a good bit of memory without a significant loss of quality. The Qwen 35B is pretty fast in my testing, but MiniMax M2.7 230B is in some ways faster (way fewer tokens to arrive at an answer) even though it is much larger.

SwellJoe · 2026-06-04T02:03:01 1780538581

Qwen 3.6 35B-A3B with MTP at 8 bits is blazing fast, something like 50-60 tokens per second. That's plenty fast for interactive use, so I haven't tried lower bits. Unfortunately the MoE is notably dumber than the dense model (for the case I have data about...I've been benchmarking models for security vulnerability scanning, and 27B is notably better on hard bugs).

The dense model is almost usable, but feels really sluggish, even with MTP. I think it's about 12-15 tokens/second on the Strix Halo. Slow enough to where I'd rather pay to use a cloud model.

I might try the 6-bit version of the dense model to see how it behaves, though. Maybe it'll retain its bug hunting abilities while making it fast enough to use interactively and not take all day for benchmark runs.

hedgehog · 2026-06-04T03:09:42 1780542582

Same chip, with a 6 bit 35B and 8 bit KV cache I see about 500 prefill and 55 decode at 30k into the context window. MiniMax seemed a bit lower token rate but much, much less prone to 40k tokens of monologue before generating an answer. A pattern I like is to use a smaller model to do most execution and then a larger model to review transcripts and output and do any fixups and tooling improvements (this is all batch jobs so all I care about is overall throughput).

milch · 2026-06-04T05:44:19 1780551859

What hardware do you need to run MiniMax M2.7 230B locally?

hedgehog · 2026-06-04T08:17:13 1780561033

Ryzen 395 is what I'm using, anything with 128GB+ of RAM accessible to the GPU should work fine for a 4 bit version of the model (so Spark or Mac Studio should be ok too).

dirkg · 2026-06-04T05:38:33 1780551513

The RTX/DGX Spark, Mac Ultras with 128GB unified ram are all ~$5k. Its still an expensive toy for rich people, it might as well be an H100 for 99.9% of the population (not devs with high paying jobs, of course).

the value of local models is allowing normal people to access AI without needing to subscribe to cloud services. this is esp imp for the rest of the world where even a 12GB gpu is extremely expensive.

there is no real viable local option that will come even close to Sonnet/Gemini Flash or the cheaper chinese models. Even if your pc costs <$2k you are never going to recoup the hw costs, and the results will be far worse.

green7ea · 2026-06-04T13:28:55 1780579735

I'm using a Strix Halo laptop (~3k, 64GiB) and with Gemma 4 and Qwen 3.6, both at 8 bits, I'm seeing very impressive results.

As a work tool, this is reasonably priced. You can save a bit of money by opting for a non-laptop form factor.

organsnyder · 2026-06-04T14:05:49 1780581949

My Framework Desktop with 128GB was about half that. I did luck out by buying before RAM prices went crazy, though.

I'm looking forward to the fallout when the data center bubble bursts. There's a good possibility we'll see a glut of hardware, either on the used market or from manufacturers that no longer have massive orders from OpenAI and the like.

zozbot234 · 2026-06-04T00:05:42 1780531542

RTX Spark is pretty much the DGX Spark in a laptop form factor, plus some lower-performing chips in the same series to be released later according to rumors. We know quite well how the top-of-the-line chip performs: it's very interesting for some application areas, less so for others.

senko · 2026-05-30T16:39:23 1780159163

Yeah, it's weird, nobody's saying "we should make all the data centres use closed loop cooling even if it's more expensive for them!", but a lot of voices are yelling "AI uses water!", referring to the same thing.

I mean, email and Hacker News and Netflix use water, too.

skeledrew · 2026-05-30T17:09:54 1780160994

Something that I've started looking into and I think could become an interesting metric is resource usage comparison of # of average-request prompts against minutes of audio/video streaming. Then we can start to say things like "you know, watching a 10-minute YouTube video uses roughly the same amount of resources as 60 prompts" and hopefully have a more down-to-earth conversation surrounding our ecological impact and how we assign value.

senko · 2026-05-30T16:34:54 1780158894

This post, like many others, confuses AI with Big Tech (or maybe that's intentional).

I can wholehartedly agree with everything said there, if I mentally replace "AI" with "big tech profitmaxxing using this new tech".

I however, don't want to throw the baby out with the bathwater: https://blog.senko.net/how-i-want-to-use-ai

wuhhh · 2026-05-30T16:54:43 1780160083

That the frontier models and subs are all big tech is probably what bothers me most about “AI” right now, but I’m bullish on advancements in the capabilities of local models. I suspect and hope that, in time, the field will level and we will have very capable local, offline models and the landscape will be much as it is now with subscription compute in the cloud for enterprise and self host / local first for indies / hackers etc.

wesapien · 2026-05-30T20:28:04 1780172884

I agree. It's a finance capitalism issue. The people who build data centers get free reign on the use energy, land, water and the atmosphere while everyone will bear the cost. Its another upward transfer of wealth. Another wealth pump as Peter Turchin describes it. This could end whatever democratic process is left. The West will be just like China but worse. Slower trains and more violent. At least in China, Political power is in control of capital.

morislz · 2026-05-30T17:00:08 1780160408

Well, but the data centers needed for AI are on a much different scale than what "big tech profitmaxxing" used to need. I also agree with the author and you. Morally, I also cannot support the toll it takes on the environment, workers, and society in general. However, what's the option? Either be part of it or get laid off. Build an AI startup or be employeed in one and get that money or well I really cannot imagine a third path that's both financially viable and keeps you relevant in the next decade.

sambuccid · 2026-05-30T20:40:54 1780173654

I guess the third path could be to use it as less as possible, hopefully finding a job that doesn't enforces its use. Still learn how to use it in case that becomes your only viable option in a few years time. And don't forget how to program by hand, for che case where in a few years time AI didn't improve as much(or just costs too much) and we discover there is a lot of messy AI codeto fix. In that case you might be able to keep your moral stance and still get paid reasonably

senko · 2026-05-30T16:24:22 1780158262

I do believe there's going to be a lot of left-behinds, as a sort-of digital rust belt. Even though, as an industry, we've always been in the business of automating and replacing ourselves, the shift will hit too quick.

I lay none of the blame on AI the technology, and all the blame on AI as a mindset and excuse.

Layoffs are not due to AI, but it's a convenient excuse: "more productive, don't need people, we're firing on all cylinders and yeah, firing 20% of the workforce while we're at it". Everything else being equal, the "more productive so we earn 20%" counterfactual makes more sense - but of course, not everything else is equal.

Treadmill speed will increase, no doubt. We haven't lowered working hours from 40 to 20 when computers 2x'd us all, we for sure won't lower them now.

We'll manage the nondeterministic imperfections, but boy, will there be bumps on the road.

What I fear most: AI will give us all more power. This includes profitmaxxing no-holds-barred corpos, from preseed startups hustling 996-style to big multinacionals. Even now, with locked-down devices and subscriptions for everything and owning things replaced with "owning a limited nontransferrable revokable end-user license", it's not good. AI is going to multiply that.

Damn, now I need a drink.

xerox13ster · 2026-05-31T15:43:34 1780242214

Corruption belt

senko · 2026-05-30T11:16:12 1780139772

> And things like AC and clothes dryers are taken for granted.

Not sure where you get your impression of Europe, but if you feel amenities like these are not standard, it’s a few decades out of date.

North Europeans traditionally didn’t need AC, but everywhere where it gets hot - which is everywhere now - they got them installed. Very few buildings with integrated HVAC systems for the entire buildings tho, mostly independent units.

Markoff · 2026-05-30T12:34:24 1780144464

yeah it's funny, actually in poor Bulgaria pretty much every apartment has AC, so AC is certainly not anything to brag about, even the poorest people have it

clothes dryers are just plain stupid waste of space, consume lot of energy/money, I've had washing machine with dryer, pretty much never used dryer after seeing how long it takes to dry the clothes while wasting electricity, new washing machine I bought without dryer

senko · 2026-05-30T11:07:40 1780139260

European here. Yes, houses are smaller, apartments can be comparatively tiny. Street parking can be a challenge.

However: I got stores, cinemas, cafes, restaurants within walking distance. My kids can roam around in the neighbourhood without someone calling social service on me. I can walk anywhere in the city at any hour day or night without someone robbing me. I can cheaply purchase free range eggs and organic vegetables. Tap water is fine, actually excellent. 30hour commute is considered too long. Coast is mere 3h away, people come from all over the planet to enjoy it, I spend 5 weeks a year there, just chillin and enjoying life. I get fast, cheap internet, and order groceries, do my taxes and doctors appointment online.

Tell me again how I’m suffering without poorly insulated detached houses, HSA, spam calls, an SUV to drive myself to the bakery, school shooting drills, healthcare bills, homeless people rejected by society, and that circus you have for a government right now?

siren2026 · 2026-05-31T07:06:01 1780211161

Agree with you. Americans have been brainwashed in thinking they absolutely need a 3000 sqft house, 2 SUVs and pay for the absolute best private schools for kids. It's the ultimate rat race.

As someone that has lived in both Europe and America, the quality of life you get in America for the amount of money you spend is hilariously bad. It is easy to make money though.

In Europe the quality of life you get for cheap is by default excellent. It is so difficult to make money though,

drnick1 · 2026-05-31T17:45:21 1780249521

> In Europe the quality of life you get for cheap is by default excellent.

I am been to Europe on many occasions and homes (actual homes, not cramped apartments) are not cheap at all relative to incomes. I am sure the peace of mind provided by universal healthcare and generous welfare programs is nice, but that's not how you build a strong economy. Incentives are distorted when you don't need a (good) job to live well enough. You get mediocrity, lacluster growth, poor customer service, and the other things Europe is known for. That's why you see people from all over the world come to America to build their businesses.

senko · 2026-05-31T20:35:29 1780259729

Europe definitely has a lot of problems! Some are similar, others are different than in the US.

Just comparing income, even on purchasing-parity basis, doesn't cut it.

senko · 2026-05-28T22:46:37 1780008397

> Looking back it feels like GOOG, FB, TSLA etc. all went IPO at reasonable valuations

Yeah, looking back. At the time, I distinctly remember people were going batshit over the insane FB valuation. It wasn't at all obvious it was justified.

Hindsight is 20/20.