More

throwdbaaway · 2026-03-01T04:32:04 1772339524

I don't quite get the low temperature coupled with the high penalty. We get thinking loop due to low temperature, and we then counter it with high penalty. That seems backward.

For Qwen3.5 27B, I got good result with --temp 1.0 --top-p 1.0 --top-k 40 --min-p 0.2, without penalty. It allows the model to explore (temp, top-p, top-k) without going off the rail (min-p) during reasoning. No loop so far.

CamperBob2 · 2026-03-01T05:03:12 1772341392

The guidelines are a little hard to interpret. At https://huggingface.co/Qwen/Qwen3.5-27B Qwen says to use temp 0.6, pres 0.0, rep 1.0 for "thinking mode for precise coding tasks" and temp 1.0, pres 1.5, rep 1.0 for "thinking mode for general tasks." Those parameters are just swinging wildly all over the place, and I don't know if printing potato 100 times is considered to be more like a "precise coding task" or a "general task."

When setting up the batch file for some previous tests, I decided to split the difference between 0.6 and 1.0 for temperature and use the larger recommended values for presence and repetition. For this prompt, it probably isn't a good idea to discourage repetition, I guess. But keeping the existing parameters worked well enough, so I didn't mess with them.

throwdbaaway · 2026-03-01T04:12:08 1772338328

We are all reasonable people here, and while you are (mostly) correct, I think we can all agree that Anthropic documentation sucks. If I have to infer from the doc:

* Haiku 4.5 by default doesn't think, i.e. it has a default thinking budget of 0.

* By setting a non-zero thinking budget, Haiku 4.5 can think. My guess is that Claude Code may set this differently for different tasks, e.g. thinking for Explore, no thinking for Compact.

* This hybrid thinking is different from the adaptive thinking introduced in Opus 4.6, which when enabled, can automatically adjust the thinking level based on task difficulty.

throwdbaaway · 2026-03-01T03:47:23 1772336843

For 27B, just get a used 3090 and hop on to r/LocalLLaMA. You can run a 4bpw quant at full context with Q8 KV cache.

throwdbaaway · 2026-03-01T03:02:21 1772334141

I would say 27B matches with Sonnet 4.0, while 397B A17B matches with Opus 4.1. They are indeed nowhere near Sonnet 4.5, but getting 262144 context length at good speed with modest hardware is huge for local inference.

Will check your updated ranking on Monday.

throwdbaaway · 2026-03-01T01:53:24 1772330004

Can you describe a bit more how this works? I suppose the speed remains about the same, while the experience is more pleasant?

(Big fan of SQLAlchemy)

joseda-hg · 2026-03-02T14:46:40 1772462800

Not the user you're responding to, but I feel like I do something similar

I describe what I want roughly on the level I could still code it by hand, to the level of telling Claude to create specific methods, functions and classes (And reminding it to use them, because models love pointless repetition)

Is it faster? Sure, being this specific has the added benefit of greatly reduced hallucinations (Still, depends on the model, Gemini is still more prone to want to do more things, even when uncalled for)

I also don't need to fine comb everything, Logic and interaction I'll check, but basic everyday stuff is usually already pretty well explained in the repo and the model usually picks up on it

throwdbaaway · 2026-02-18T00:18:48 1771373928

From a quick testing on simple tasks, adaptive thinking with sonnet 4.6 uses about 50% more reasoning tokens than opus 4.6.

Let's see how long it will take for DeepSeek to crack this.

throwdbaaway · 2026-02-16T03:13:21 1771211601

If you ask someone knowledgeable at r/LocalLLaMA about an inference configuration that can increase TG by *up to* 2.5x, in particularly for a sample prompt that reads "*Refactor* this module to use dependency injection", then the answer is of course speculative decoding.

You don't have to work for a frontier lab to know that. You just have to be GPU poor.

throwdbaaway · 2026-02-09T17:13:49 1770657229

As mentioned by the sibling comment from godelski, it is about the lack of precision, not the lack of determinism. After all, we already got https://thinkingmachines.ai/blog/defeating-nondeterminism-in..., which is not even an issue for single user local inference.

Question: Have you tried using LLM as a compiler?

Well, I sort of did, as a fun exercise. I came up with a very elaborate ~5000 tokens prompt, such that when fed with a ~500 tokens function, I will get back a ~600 tokens rewritten function.

The prompt contains 10+ examples, such that the model will learn the steps from the context. Then, it will start by going through a series of yes/no questions, to decide what's the correct rewrite pattern to apply. The tricky part here is the lack of precision, such that the "else" clause has to be reserved for the condition that is the hardest to communicate clearly in English. Then it will extract the part that needs to be rewritten and introspect the formatting, again with a series of simple questions. Lastly, it will proceed, confidently, with the rewrite.

With this, I did some testing with 50+ randomly chosen functions, and I could get back the exact same rewritten functions, from about 20 models that are good in coding, down to the newlines and indentations. With a strong model, there might only be 1~2 output tokens in the whole test where the probability was less than 80%, so the lack of batch invariance wasn't even a problem. (temperature=0 usually messes up logprobs, go with top_k=1 or top_p=0.01)

So input + English = output, works for multiple models from multiple companies.

But what's the point of writing so much English, in hope that it leaves no room for ambiguity? For now, I will stick with mitchellh's style of (occasional) LLM assisted programming, jumping in to write the code when precision is needed.

throwdbaaway · 2026-02-06T06:18:46 1770358726

Not using Hot Aisle for inference?

latchkey · 2026-02-06T06:25:18 1770359118

We're literally full. Just a few 1x GPUs available right now.

So far, I haven't been happy with any of the smaller coding models, they just don't compare to claude/codex.

throwdbaaway · 2026-02-03T14:56:28 1770130588

I call this the Groundhog Day loop

embedding-shape · 2026-02-03T15:05:50 1770131150

That's a strange name, why? It's more like a "iterate and improve" loop, "Groundhog Day" to me would imply "the same thing over and over", but then you're really doing something wrong if that's your experience. You need to iterate on the initial prompt if you want something better/different.

throwdbaaway · 2026-02-03T20:51:49 1770151909

I thought "iterate and improve" was exactly what Phil did.