More

averne_ · 2026-05-29T16:49:53 1780073393

The blog makes it clear that "standard" GPU here is in opposition to purpose-built hardware like Cerebras. The selling point is reaching the same order of magnitude in generative speed as those approaches.

averne_ · 2026-05-29T16:47:05 1780073225

I tried with some simple prompts (fibonacci, linked list manipulation) and it worked nicely.

averne_ · 2026-03-20T18:14:33 1774030473

I wrote the Vulkan ProRes backend. The bitstream decoder was implemented from scratch, for a number of reasons.

First, the original code was reverse-engineered, before Apple published an SMPTE document describing the bitstream syntax. Second, I tried my best at optimizing the code for GPU hardware. And finally, I wanted take the learning opportunity :)

And to answer the parent's question, the shaders are written in pure GLSL. For instance, this is the ProRes bitstream decoder in question: https://code.ffmpeg.org/FFmpeg/FFmpeg/src/branch/master/liba...

sylware · 2026-03-20T22:33:15 1774045995

glsl: this is the really bad part, as this is a definitive nono.

Should have been a plaind and simple C coded generator of SPIR-V byte code.

averne_ · 2025-12-24T23:07:16 1766617636

Not really. https://codecs.multimedia.cx/2022/12/ffhistory-fabrice-bella...

>Fabrice won International Obfuscated C Code Contest three times and you need a certain mindset to create code like that—which creeps into your other work. So despite his implementation of FFmpeg was fast-working, it was not very nice to debug or refactor, especially if you’re not Fabrice

averne_ · 2025-12-02T09:15:01 1764666901

Not OP but I also often to listen to ambient while programming. A couple recommendations would be "Music for Nine Post Cards" and other works by Hiroshi Yoshimura, and "Music for 18 musicians" and others by Steve Reich.

In fact, the use of loops described in this article reminded me of what Reich called "phases", basically the same concept of emerging/shifting melodic patterns between different samples.

averne_ · 2025-10-07T12:01:35 1759838495

New physics in this context means previously unknown effects or mechanisms, or even a new theory/framework for an already understood phenomenon. Using "physics" in this way is common amongst academics.

IAmBroom · 2025-10-07T17:33:01 1759858381

Do you have two aliases on HN, or are you simply presuming to speak for the OP?

averne_ · 2025-09-30T19:53:17 1759261997

The main reason a wafer scale chip works there is because their cores are extremely tiny, and silicon area that gets fused off in the event of a defect is much lower than on NVIDIA chips, where a whole SM can get disabled. AFAIU this approach is not easily applicable to complex core designs.

averne_ · 2025-09-14T10:16:39 1757844999

The NVidia driver also has userland submission (in fact it does not support kernel-mode submission at all). I don't think it leads to a significant simplification or not of the userland code, basically a driver has to keep track of the same thing it would've submitted to an ioctl. If anything there are some subtleties that require careful consideration.

The major upside is removing the context switch on a submission. The idea is that an application only talks to the kernel for queue setup/teardown, everything else happens in userland.

sylware · 2025-09-15T08:18:35 1757924315

Yep. Future of GPU hardware programming? The one we will have to "standard"-ized à la RISC-V for CPUs?

The thing are the vulkan "fences", namely the GPU to CPU notifications. Probably hardware interrupts which will have to be forwarded by the kernel to the userland for an event ring buffer (probably a specific event file descriptor). There are alternatives though: we could think of userland polling/spinning on some cpu-mapped device memory content for notification or we could go one "expensive" step further which would "efficiently" remove the kernel for good here but would lock a CPU core (should be fine nowdays with our many cores CPUs): something along the line of a MONITOR machine instruction, basically a CPU core would halt until some memory content is written, with the possibility for another CPU core to un-halt it (namely spurious un-halting is expected).

Does nvidia handle their GPU to CPU notifications without the kernel too?

sylware · 2025-09-15T12:07:02 1757938022

eewww... my bad, we would need a timeout on the CPU core locking go back to the kernel.

Well, polling? erk... I guess a event file descriptor is in order, and that nvidia is doing the same.

averne_ · 2025-08-26T12:30:43 1756211443

It actually doesn't make much difference: https://chipsandcheese.com/i/138977378/decoder-differences-a...

chasil · 2025-08-26T14:22:19 1756218139

I had not realized that Apple did not implement any of the 32-bit ARM environment, but that cuts the legs out of this argument in the article:

"In Anandtech’s interview, Jim Keller noted that both x86 and ARM both added features over time as software demands evolved. Both got cleaned up a bit when they went 64-bit, but remain old instruction sets that have seen years of iteration."

I still say that x86 must run two FPUs all the time, and that has to cost some power (AMD must run three - it also has 3dNow).

Intel really couldn't resist adding instructions with each new chip (MMX, PAE for 32-bit, many more on this shorthand list that I don't know), which are now mostly baggage.

theevilsharpie · 2025-08-26T19:04:32 1756235072

> I still say that x86 must run two FPUs all the time, and that has to cost some power (AMD must run three - it also has 3dNow).

Legacy floating-point and SIMD instructions exposed by the ISA (and extensions to it) don't have any bearing on how the hardware works internally.

Additionally, AMD processors haven't supported 3DNow! in over a decade -- K10 was the last processor family to support it.

chasil · 2025-08-31T02:27:36 1756607256

80-bit x87 has no bearing on SSE implementation.

Right. Not.

daeken · 2025-08-26T13:13:07 1756213987

Oh wow, I need to dig way deeper into this but wonderful resource - thanks!

averne_ · 2025-08-22T19:29:33 1755890973

Do you have a link for that? I'm the guy working on the Vulkan ProRes decoder mentionned as "in review" in this changelog, as part of a GSoC project.

I'm curious wrt how a WebGPU implementation would differ from Vulkan. Here's mine if you're interested: https://github.com/averne/FFmpeg/tree/vk-proresdec

dtf · 2025-08-22T19:39:51 1755891591

I don't have a link to hand right now, but I'll try to put one up for you this weekend. I'm very interested in your implementation - thanks, will take a good look!

Initially this was just a vehicle for me to get stuck in and learn some WebGPU, so no doubt I'm missing lots of opportunities for optimisation - but it's been fun as much as frustrating. I leaned heavily on the SMPTE specification document and the FFMPEG proresdec.c implementation to understand and debug.

averne_ · 2025-08-22T19:46:03 1755891963

No problem, just be aware there's a bunch of optimizations I haven't had time to implement yet. In particular, I'd to remove the reset kernel, fuse the VLD/IDCT ones, and try different strategies and hw-dependent specializations for the IDCT routine (AAN algorithm, packed FP16, cooperative matrices).