HN2new | past | comments | ask | show | jobs | submit | rahimnathwani's commentslogin

There seem to be a lot of different Q4s of this model: https://www.reddit.com/r/LocalLLaMA/s/kHUnFWZXom

I'm curious which one you're using.


Unsloth Dynamic. Don't bother with anything else.

For anyone else trying to run this on a Mac with 32GB unified RAM, this is what worked for me:

First, make sure enough memory is allocated to the gpu:

  sudo sysctl -w iogpu.wired_limit_mb=24000
Then run llama.cpp but reduce RAM needs by limiting the context window and turning off vision support. (And turn off reasoning for now as it's not needed for simple queries.)

  llama-server \
    -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
    --jinja \
    --no-mmproj \
    --no-warmup \
    -np 1 \
    -c 8192 \
    -b 512 \
    --chat-template-kwargs '{"enable_thinking": false}'
You can also enable/disable thinking on a per-request basis:

  curl 'http://localhost:8080/v1/chat/completions' \
  --data-raw '{"messages":[{"role":"user","content":"hello"}],"stream":false,"return_progress":false,"reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"chat_template_kwargs": { "enable_thinking": true }}'|jq .
If anyone has any better suggestions, please comment :)

Shouldn't you be using MLX because it's optimised for Apple Silicon?

Many user benchmarks report up to 30% better memory usage and up to 50% higher token generation speed:

https://reddit.com/r/LocalLLaMA/comments/1fz6z79/lm_studio_s...

As the post says, LM Studio has an MLX backend which makes it easy to use.

If you still want to stick with llama-server and GGUF, look at llama-swap which allows you to run one frontend which provides a list of models and dynamically starts a llama-server process with the right model:

https://github.com/mostlygeek/llama-swap

(actually you could run any OpenAI-compatible server process with llama-swap)


I didn't know about llama-swap until yesterday. Apparently you can set it up such that it gives different 'model' choices which are the same model with different parameters. So, e.g. you can have 'thinking high', 'thinking medium' and 'no reasoning' versions of the same model, but only one copy of the model weights would be loaded into llama server's RAM.

Regarding mlx, I haven't tried it with this model. Does it work with unsloth dynamic quantization? I looked at mlx-community and found this one, but I'm not sure how it was quantized. The weights are about the same size as unsloth's 4-bit XL model: https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-4bit/tr...


FYI UD quants of 3.5-35BA3B are broken, use bartowski or AesSedai ones.

They've uploaded the fix. If those are still broken something bad has happened.

UD-Q4_K_XL?

"When you're experiencing the story, in the back of your mind, you always know that there is someone who created the story to tell you some kind of message."

I might know that, but I usually don't care.


Used books in France appear extremely cheap because new ones can't be discounted. By law (I think) retailers only allowed to offer a maximum 5% off.

Used books are exempt from the law entirely, so they're priced by pure market forces.

In countries like the US or UK, a recently-published book might already be 40% off list price, so used copies may not be as much of a bargain.


Right but everyone else is talking about latency, not throughput.

You could try minimax 2.5 via openrouter.

MiniMax has an incredibly affordable coding plan for $10/month. It has a rolling five hour limit of 100 prompts. 100 prompts doesn't sound like much, but in typical AI company accounting fashion, 1 prompt is not really 1 prompt. I have yet to come even close to hitting the limit with heavy use.

Hugging Face now provides instructions for using local models in Pi:

https://x.com/victormustar/status/2026380984866710002


I saw this today on X. It's built on top of the pi coding agent (same one used by openclaw). It uses pi-web-ui ( https://github.com/badlogic/pi-mono/tree/main/packages%2Fweb...).

What surprised me was that it works with oauth tokens from consumer AI subscriptions, and even free ones like Antigravity. Some of this is inherited from pi, but the author did some extra work to build a proxy to allow auth to work even from within the Excel sidebar.

I'm not sure how it works and what the role of the backend is. But I hope to test it side by side with 'Claude for Excel', which I also happened to install today.

(I tried both on something basic just to check they were installed correctly.)


> and even free ones like Antigravity

48 hours later on hn: "Google just banned my 27 years old account for absolutely NO REASON and the customer service is useless!!!"


Is pi better than opencode?

opencode is great, but I'm a big pi fan - pi has a minimal core that is hackable+extensible by design and is aware of its own docs. So your coding agent can mod your coding agent exactly as you like. https://pi.dev/

I'm not sure we always want 'works'. Sometimes different 'expressions' of the same work are different enough that they don't have the same value.

For example, compare the most recent edition of 'Straight and crooked thinking' with the one published in 1930.


I don't know that work, but I agree with you in general because of forewords etc. Or even appendices. And translations by different translators.

I "grew up with" a specific translation of Lord of the Rings into Norwegian, for example. There are two. They are very different. But the editions also differ in whether they include the appendices, whose illustrations are used, and more.


>They are very different.

Are we talking material plot or characterisation changes?


No, but many of the names are different, and stylistically they are very different. Depending on whether a translation tries to be fairly literal, or sound as if it is written for the language it is translated to the way the result feels will be very different.

An example is the name Bilbo Baggins. In the "canonical" Norwegian translation, he's become Bilbo Lommelun. "Lomme" means pocket, and "lun" means snug, warm, or comfortable. It's not literal, but it fits the nature of hobbits well while referencing the "bag" in Baggins", and the connotations comes immediately in Norwegian without having try to deconstruct the name.

In this case, I think the newer "canonical" translation is generally considered unambiguously the best, but people often have favourite translations. E.g. my favourite Scandinavian translation of Walt Whitman's Leaves of Grass isn't even Norwegian, but an old Danish translation which sounds much "softer" (it's hard to explain)


I know Norwegian also has two different written standards, found an example that demonstrates it:

>English: I will not tell anyone the secret.

>Bokmål: Jeg skal ikke fortelle hemmeligheten til noen.

>Nynorsk: Eg skal ikkje fortelja løyndomen til nokon.

Source: https://www.visitnorway.com/typically-norwegian/norwegian-la...


Yeah, so really there are at least 3 translations of Lord of the Rings, to continue that example, and I was being a typical Bokmål user and ignored Nynorsk.

The title differences are also a good illustration of how different it can be:

Bokmål: Kampen om Ringen, Ringenes Herre (the first one is literally "the battle for the ring")

Nynorsk: Ringdrotten


The most obvious example of this is the innumerable[0] versions of the Christian bible.

[0] Before anyone says it, I'm sure some bible nerd has numbered them, it's hyperbole.


I think the point is, you want a single work when searching.

Then click on the item and drill down into editions sorted by year, or whatever.

But when you're doing search, it's terrible UX to be flooding it with tens of editions mixed in with other things with similar titles.


Here's another Alpha parent responding to that review: https://naimoli.com/peter/posts/2xlearning/

Summary: the '2× learning' claims are overblown.


Whether a link requires login or not is irrelevant when everyone has the same password: https://x.com/RahimNathwani/status/1900199324279333115

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: