The blog makes it clear that "standard" GPU here is in opposition to purpose-built hardware like Cerebras. The selling point is reaching the same order of magnitude in generative speed as those approaches.
I wrote the Vulkan ProRes backend. The bitstream decoder was implemented from scratch, for a number of reasons.
First, the original code was reverse-engineered, before Apple published an SMPTE document describing the bitstream syntax. Second, I tried my best at optimizing the code for GPU hardware. And finally, I wanted take the learning opportunity :)
>Fabrice won International Obfuscated C Code Contest three times and you need a certain mindset to create code like that—which creeps into your other work. So despite his implementation of FFmpeg was fast-working, it was not very nice to debug or refactor, especially if you’re not Fabrice
Not OP but I also often to listen to ambient while programming. A couple recommendations would be "Music for Nine Post Cards" and other works by Hiroshi Yoshimura, and "Music for 18 musicians" and others by Steve Reich.
In fact, the use of loops described in this article reminded me of what Reich called "phases", basically the same concept of emerging/shifting melodic patterns between different samples.
New physics in this context means previously unknown effects or mechanisms, or even a new theory/framework for an already understood phenomenon. Using "physics" in this way is common amongst academics.
The main reason a wafer scale chip works there is because their cores are extremely tiny, and silicon area that gets fused off in the event of a defect is much lower than on NVIDIA chips, where a whole SM can get disabled. AFAIU this approach is not easily applicable to complex core designs.
The NVidia driver also has userland submission (in fact it does not support kernel-mode submission at all). I don't think it leads to a significant simplification or not of the userland code, basically a driver has to keep track of the same thing it would've submitted to an ioctl. If anything there are some subtleties that require careful consideration.
The major upside is removing the context switch on a submission. The idea is that an application only talks to the kernel for queue setup/teardown, everything else happens in userland.
Yep. Future of GPU hardware programming? The one we will have to "standard"-ized à la RISC-V for CPUs?
The thing are the vulkan "fences", namely the GPU to CPU notifications. Probably hardware interrupts which will have to be forwarded by the kernel to the userland for an event ring buffer (probably a specific event file descriptor). There are alternatives though: we could think of userland polling/spinning on some cpu-mapped device memory content for notification or we could go one "expensive" step further which would "efficiently" remove the kernel for good here but would lock a CPU core (should be fine nowdays with our many cores CPUs): something along the line of a MONITOR machine instruction, basically a CPU core would halt until some memory content is written, with the possibility for another CPU core to un-halt it (namely spurious un-halting is expected).
Does nvidia handle their GPU to CPU notifications without the kernel too?
I had not realized that Apple did not implement any of the 32-bit ARM environment, but that cuts the legs out of this argument in the article:
"In Anandtech’s interview, Jim Keller noted that both x86 and ARM both added features over time as software demands evolved. Both got cleaned up a bit when they went 64-bit, but remain old instruction sets that have seen years of iteration."
I still say that x86 must run two FPUs all the time, and that has to cost some power (AMD must run three - it also has 3dNow).
Intel really couldn't resist adding instructions with each new chip (MMX, PAE for 32-bit, many more on this shorthand list that I don't know), which are now mostly baggage.
I don't have a link to hand right now, but I'll try to put one up for you this weekend. I'm very interested in your implementation - thanks, will take a good look!
Initially this was just a vehicle for me to get stuck in and learn some WebGPU, so no doubt I'm missing lots of opportunities for optimisation - but it's been fun as much as frustrating. I leaned heavily on the SMPTE specification document and the FFMPEG proresdec.c implementation to understand and debug.
No problem, just be aware there's a bunch of optimizations I haven't had time to implement yet. In particular, I'd to remove the reset kernel, fuse the VLD/IDCT ones, and try different strategies and hw-dependent specializations for the IDCT routine (AAN algorithm, packed FP16, cooperative matrices).
reply