More

cpldcpu · 2026-04-26T14:17:09 1777213029

Great to see this!

Worth mentioning that Huggingface already offers a similar service. And they are also European:

https://huggingface.co/docs/inference-providers/index

https://huggingface.co/inference/models

jubilanti · 2026-04-26T21:30:02 1777239002

Huggingface is about as European as Google, IBM, or Visa: they all have some offices in Europe.

Tepix · 2026-04-26T20:42:27 1777236147

Huggingface isn‘t european just because they have offices there.

muzzy19 · 2026-04-26T14:40:00 1777214400

Nice, but I could't find pay-as-you-go plan

breisa · 2026-04-26T15:41:16 1777218076

They only offer pay-as-you-go. The $9/month plan includes $2 credits and then its payg without markup.

cpldcpu · 2026-04-26T15:42:20 1777218140

https://huggingface.co/docs/inference-providers/pricing

It's well buried though. Does not seem to be a focus of theirs.

cpldcpu · 2026-04-25T15:01:04 1777129264

I had to give it a try.

Claude, the ole cheater, recognized what the file was, downloaded the psid from the web, found a wasm sid player and built a website around it:

https://claude.ai/public/artifacts/df6cdcae-08dc-452b-ba19-f...

https://claude.ai/share/4dd36c16-bc62-445a-b423-ad4637f06432

GPT-5.5 built a lot of python scripts to extract the music data. Strudel implementation failed, but I then asked it to build a website:

https://ubiquitous-vacherin-8e7993.netlify.app/

This is a translation of the music into javascript based on the assembler source.

Really impressive on both accounts. Some iterations were requied for both.

cpldcpu · 2026-04-11T20:39:54 1775939994

Yes, marks of AI all over the place. Also the SVGs.

>No solution written, 100% score.

Its weird. Turns out that hardest problem for LLMs to really tackle is long-form text.

basch · 2026-04-11T20:59:11 1775941151

Maybe in one shot.

In theory I would expect them to be able to ingest the corpus of the new yorker and turn it into a template with sub-templates, and then be able to rehydrate those templates.

The harder part seems to be synthesizing new connection from two adjacent ideas. They like to take x and y and create x+y instead of x+y+z.

Quarrel · 2026-04-12T03:58:22 1775966302

Most of the good major models are already very capable of changing their writing style.

Just give them the right writing prompt. "You are a writer for the Economist, you need to write in the house style, following the house style rules, writing for print, with no emoji .." etc etc.

The large models have already ingested plenty of New Yorker, NYT, The Times, FT, The Economist etc articles, you just need to get them away from their system prompt quirks.

ainch · 2026-04-12T19:35:12 1776022512

I think that should be true, but doesn't hold up in practice.

I work with a good editor from a respected political outlet. I've tried hard to get current models to match his style: filling the context with previous stories, classic style guides and endless references to Strunk & White. The LLM always ends up writing something filtered through tropes, so I inevitably have to edit quite heavily, before my editor takes another pass.

It feels like LLMs have a layperson's view of writing and editing. They believe it's about tweaking sentence structure or switching in a synonym, rather than thinking hard about what you want to say, and what is worth saying.

I also don't think LLMs' writing capabilities have improved much over the last year or so, whereas coding has come on leaps and bounds. Given that good writing is a matter of taste which is beyond the direct expertise of most AI researchers (unlike coding), I doubt they'll improve much in the near future.

basch · 2026-04-19T06:45:31 1776581131

You're ignoring what I said. They work better when turning it into a two step process. Step 1 create a template. Step 2 execute the template.

>The large models have already ingested plenty of New Yorker, NYT, The Times, FT, The Economist etc articles

And that ends up diluting them. Going back and doing another pass on only a subset would give them stronger voice. At some threshold, scanning information brings it to average and a return to the mean, instead of increasing the information. It's a giant table of word associations, it can regress.

benob · 2026-04-12T05:22:50 1775971370

No, the failure is the human written prompt

not_that_d · 2026-04-12T08:23:29 1775982209

You know, after a while this excuse is not valid anymore.

roywiggins · 2026-04-12T21:24:29 1776029069

If they're that hard to prompt maybe it's easier just to write the blog posts yourself.

sidpatil · 2026-04-11T21:01:03 1775941263

Someone here mentioned a whole ago that the labs deliberately haven't tried to train these characteristics out of their models, because leaving them in makes it easier to identify, and therefore exclude, LLM-generated text from their training corpus.

blymphony · 2026-04-11T22:24:03 1775946243

But it's odd that these characteristics are the same across models from different labs. I find it hard to believe that researchers across competing companies are coordinating on something like that.

cpldcpu · 2026-04-06T08:04:32 1775462672

Love it! Great idea for the dataset.

cpldcpu · 2026-03-22T15:24:54 1774193094

cpldcpu · 2026-02-22T16:37:22 1771778242

They mentioned that they using strong quantization (iirc 3bit) and that the model was degradeted from that. Also, they don't have to use transistors to store the bits.

amelius · 2026-02-22T18:41:43 1771785703

I think they are talking about the transistors that apply the weights to the inputs.

mirekrusin · 2026-02-22T20:24:17 1771791857

gpt-oss is fp4 - they're saying they'll next try mid size one, I'm guessing gpt-oss-20b then large one, i'm guessing gpt-oss-120b as their hardware is fp4 friendly

cpldcpu · 2026-02-22T07:42:52 1771746172

I wonder how well this works with MoE architectures?

For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.

With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.

At that point we are back to a chiplet approach...

pests · 2026-02-22T08:17:02 1771748222

For comparison I wanted to write on how Google handles MoE archs with its TPUv4 arch.

They use Optical Circuit Switches, operating via MEMS mirrors, to create highly reconfigurable, high-bandwidth 3D torus topologies. The OCS fabric allows 4,096 chips to be connected in a single pod, with the ability to dynamically rewire the cluster to match the communication patterns of specific MoE models.

The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also contains 2 SparseCores which specialize handling high-bandwidth, non-contiguous memory accesses.

Of course this is a DC level system, not something on a chip for your pc, but just want to express the scale here.

*ed: SpareCubes to SparseCubes

brainless · 2026-02-22T09:30:44 1771752644

If each of the Expert models were etched in Silicon, it would still have massive speed boost, isn't it?

I feel printing ASIC is the main block here.

cpldcpu · 2026-02-22T07:37:14 1771745834

It could simply be bit serial. With 4 bit weights you only need four serial addition steps, which is not an issue if the weight are stored nearby in a rom.

cpldcpu · 2026-01-12T15:49:22 1768232962

May be worth pointing out, that this is not the first residual connection innovation to be in production.

Gemma 3n is also using a low-rank projection of the residual stream called LAuReL. Google did not publicize this too much, I noted it when poking around in the model file.

https://arxiv.org/pdf/2411.07501v3

https://old.reddit.com/r/LocalLLaMA/comments/1kuy45r/gemma_3...

Seems to be what they call LAuReL-LR in the paper, with D=2048 and R=64

taykolasinski · 2026-01-12T16:16:33 1768234593

This is a fantastic catch. I hadn't realized Gemma 3n was already shipping with a variant of this in production.

It feels like we are entering the era of residual stream engineering. For a long time, the standard x + F(x) additive backbone was treated as untouchable. Now, between mHC (weighted scaling) and LAuReL (low-rank projections), labs are finally finding stable ways to make that signal path more dynamic.

I'm curious if the Low-Rank constraint in LAuReL acts as a natural stabilizer against the gradient explosion I saw with unconstrained hyper-connections.

Thanks for the paper link, definitely reading that tonight.

cpldcpu · 2026-01-12T17:46:46 1768240006

Thanks! Would be quite interesting to see how this fares compared to mHC.

I noted that LAuReL is cited in the mHC paper, but they refer to it as "expanding the width of the residual stream", which is rather odd.

cpldcpu · 2026-01-03T23:54:19 1767484459

All of them use ASML lithography, including CXMT.

They are, of course, a bit slower in EUV adoption. But its already there:

https://www.tomshardware.com/pc-components/dram/micron-sampl...

https://www.techinsights.com/blog/samsung-d1z-lpddr5-dram-eu...