cnschn's comments

cnschn · on June 12, 2023

I had the exact same reaction!

cnschn · on Dec 8, 2021

Just curious, what lower level options are there? Inline assembly in OpenCL?

dragontamer · on Dec 8, 2021

I never touched it myself, but AMD once exposed the "HSA" interface, which is the AMD GPU execution engine.

ROCm / HIP and OpenCL are built on top of that HSA level. I don't think any docs exist for it, but you can see a ton of references to HSA stuff if you browse the ROCm source code.

--------

It sounds like the parent post discusses details about how AMD GPUs "pick" the next kernel to run. I've been told that the AMD GPU is very advanced at this, but no adequate interface has ever been exposed to the programmer.

diamondlovesyou · on Dec 8, 2021

Correct; Rocm/HIP/OpenCL are all built on HSA APIs, with a few AMD specific extensions.

AMDGPU's command processor is exposed to you (the CP is what invokes "kernels" GPU side): create signals (essentially just an atomic u64 in host memory, with a few extra bells and whistles to support interrupts) and use those in one of the barrier packet types in a HSA device queue. With these (with one sad caveat that work packets can't have deps themselves :( ) you can enqueue most computation graphs and the CP will handle waiting for signals without any CPU involvement. Plus, GPU kernels can also concurrently write to this queue (though you can't create signals GPU side...)

AMDGPUs are like shared memory machines, which I think is really cool.

dragontamer · on Dec 8, 2021

> you can enqueue most computation graphs

Can those command graphs loop?

I know this interface is undocumented... but I've had an idea for a GPU-language akin to Java or Lisp memory-management. The gist is that kernel_X() can execute, but may fail in any new() or malloc() command.

In such a case, I'd want the compute-graph to loop: while (kernel_X fails due to out-of-memory){ garbage_collect(); try kernel_X() again}.

-------

Not that I have the time to experiment with something like this, but I guess I've been curious to know if that sort of thing can even work.

diamondlovesyou · on Dec 8, 2021

> Can those command graphs loop?

Not directly (barrier packets wait for 0 only, plus the queue packets aren't preserved), but kernels can write to any dispatch queue themselves, so you can get the same effect at the end of your "loop body".

> I know this interface is undocumented... but I've had an idea for a GPU-language akin to Java or Lisp memory-management. The gist is that kernel_X() can execute, but may fail in any new() or malloc() command.

Memory allocation isn't special and allocators can be layered: you can allocate memory ahead of time and then just run the allocation algorithms GPU side. I wrote a Rust framework which cross compiles code/MIR on demand; you can in theory have a Rust allocator and use it to allocate GPU/CPU memory from either GPU/CPU. The only part the GPU can't do (directly) is invoke syscalls, which you can probably guess is the part needed to allocate virtual memory from the OS.

But as long as your allocator has enough spare virtual memory, it shouldn't need to do a syscall. And if you /really/ needed the ability GPU side, technically with signals you can actually just ask the CPU to allocate the virtual memory on the GPU's behalf and have the GPU spin until the allocation is "complete". Or with compiler support: automatically make the workgroup/kernel async and resume execution by enqueuing another kernel, but that sort of thing is kinda hard :).

Btw, the Rust framework is here: https://github.com/geobacter-rs/geobacter. I mostly work on it in my spare time, a scarce resource these days, so I admit it has some scuff.

> In such a case, I'd want the compute-graph to loop: while (kernel_X fails due to out-of-memory){ garbage_collect(); try kernel_X() again}.

Pretty much. Even garbage collection can (theoretically lol) happen on the GPU.

dragontamer · on Dec 8, 2021

> Pretty much. Even garbage collection can (theoretically lol) happen on the GPU.

Oh, that's the plan. Semispace collection is very clearly a problem that can be solved in parallel: https://en.wikipedia.org/wiki/Cheney%27s_algorithm

That's just a breadth-first traversal over the fromspace. That's like... GPU programming 101 level material there. Its obviously parallel.

That's why Java/Lisp is the model I'm using, because they use semispace malloc / semispace garbage collection. 100% GPU-side malloc / garbage collection.

Nothing that I'm working on for real, mostly just theory-craft. But fun to think about on my spare time. I would expect that semispace garbage collection and allocation of memory would be very efficient on GPUs, and serve as the basis of some higher-level abstraction.

-------

The "while loop" would need a custom compiler to emit the trampoline / continuation, so that the kernel knows how to "restart" itself in cases where the malloc() fails and garbage collection was run.

Kernels exiting serves as the innate synchronization point, the "synchronized stop" in the stop-the-world garbage collection schemes.

If I could write a routine that saves off every "malloc" as a possible "continuation" point (possibly saving that information in a queue-data structure or stack-data structure of some kind), then it probably would work.

diamondlovesyou · on Dec 8, 2021

I use LLVM directly. The LLVM AMDGPU target machine is supported by AMD and they use it internally in HIP/OpenCL.

I don't think OpenCL should be used going forward; it not really platform independent. And SPIR-V... kinda sucks tbh. Plus, where's my single source stuff (a la CUDA)?