Hacker News .hnnew | past | comments | ask | show | jobs | submit | ckitching's commentslogin

Prettymuch. Compilers can do a lot more than people give them credit for. At least AMD document their hardware so it is actually possible to know low-level details. PTX can obfuscate that surprisingly badly for nvidia targets.


The CUDA C APIs are supported as much in C as in C++ using SCALE!

Cuda-fortran is not currently supported by scale since we haven't seen much use of it "in the wild" to push it up our priority list.


It doesn't matter though. NVIDIA distributes tons of libraries built atop CUDA that you cannot distribute or use on AMD chips legally. Cutlass, CuBLAS, NCCL, etc.


SCALE doesn't use cuBlas and friends. For those APIs, it uses either its own implementations of the functions, or delegates to an existing AMD library (such as rocblas).

It wouldn't even be technically possible for SCALE to distribute and use cuBlas, since the source code is not available. I suppose maybe you could do distribute cuBlas and run it through ZLUDA, but that would likely become legally troublesome.


> SCALE doesn't use cuBlas and friends. For those APIs, it uses either its own implementations of the functions, or delegates to an existing AMD library (such as rocblas).

And this is the problem. I guarantee you NVIDIA has more engineers working on cuBLAS et al than AMD does.

The NVIDIA moat is not CUDA the language or CUDA the library. It's CUDA the ecosystem. That means things like all the high performance libraries; all the high performance libraries with clustering support (does AMD even have a clustering solution like NVLink -- everyone forgets that NVIDIA also does high speed networking); all the high perf appliances (everyone also forgets that NVIDIA sells entire systems, not GPUS); all the high perf servers (Triton inference server, etc). We can go on.

I commend the project volunteers for what they've done, but I would recommend getting VC money and competing directly with NVIDIA.


Correct, which one of the main moats Nvidia has when it comes to training


NCCL is open source and permissively licensed.


Sure and it works with NVIDIA GPUs only AFAIK.


Not any more ;)


Greetings, I work on SCALE.

It appears we implemented `--threads` but not `-t` for the compiler flag. Oeps. In either case, the flag has no effect at present, since fatbinary support is still in development, and that's the only part of the process that could conceivably be parallelised.

That said: clang (and hence the SCALE compiler) tends to compile CUDA much faster than nvcc does, so this lack of the parallelism feature is less problematic than it might at first seem.

NVTX support (if you want more than just "no-ops to make the code compile") requires cooperation with the authors of profilers etc., which has not so far been available

bfloat16 is not properly supported by AMD anyway: the hardware doesn't do it, and HIP's implementatoin just lies and does the math in `float`. For that reason we haven't prioritised putting together the API.

cublasLt is a fair cop. We've got a ticket :D.


Hi, why do you believe that bfloat16 is not supported? Can you please provide some references (specifically the part about the hardware "doesn't do it")?

For the hardware you are focussing on (gfx11), the reference manual [2] and the list of LLVM gfx11 instructions supported [1] describe the bfloat16 vdot & WMMA operations, and these are in fact implemented and working in various software such as composable kernels and rocBLAS, which I have used (and can guarantee they are not simply being run as float). I've also used these in the AMD fork of llm.c [3]

Outside of gfx11, I have also used bfloat16 in CDNA2 & 3 devices, and they are working and being supported.

Regarding cublasLt, what is your plan for support there? Pass everything through to hipblasLt (hipify style) or something else?

Cheers, -A

[1] https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX11.html [2] https://www.amd.com/content/dam/amd/en/documents/radeon-tech... [3] http://github.com/anthonix/llm.c


> Hi, why do you believe that bfloat16 is not supported?

Apologies, I appear to be talking nonsense. I conflated bfloat16 with nvidia's other wacky floating point formats. This is probably my cue to stop answering reddit/HN comments and go to bed. :D

So: ahem: bfloat16 support is basically just missing the fairly boring header.

> Regarding cublasLt, what is your plan for support there? Pass everything through to hipblasLt (hipify style) or something else?

Prettymuch that, yes. Not much point reimplementing all the math libraries when AMD is doing that part of the legwork already.


OK, so in the case of llm.c, if you're just including the HIP headers, using hipblasLt, etc, what would be the benefit of using scale instead of hipify?


Seems like a big benefit would come from not forking the codebase into two versions!


It works exactly as well as other AMDGPU-related software (HIP etc.) works inside Docker.

There are some delightful AMD driver issues that make certain models of GPU intermittently freeze the kernel when used from docker. That was great fun when building SCALE's CI system :D.


Would love to give it a try! Thanks for answering my question.


[I work on SCALE]

CUDA has a couple of extra problems beyond just any other programming language:

- CUDA is more than a language: it's a giant library (for both CPU and GPU) for interacting with the GPU, and for writing the GPU code. This needed reimplementing. At least for the device-side stuff we can implement it in CUDA, so when we add support for other GPU vendors the code can (mostly) just be recompiled and work there :D. - CUDA (the language) is not actually specified. It is, informally, "whatever nvcc does". This differs significantly from what Clang's CUDA support does (which is ultimately what the HIP compiler is derived from).

PTX is indeed vastly annoying.


The openmp device runtime library was originally written in cuda. I ported that to hip for amdgpu, discovered the upstream hip compiler wasn't quite as solid as advertised, then ported it to openmp with some compiler intrinsics. The languages are all essentially C++ syntax with some spurious noise obfuscating llvm IR. The libc effort has gone with freestanding c++ based on that experience and and we've now mostly fixed the ways that goes wrong.

You might also find raw c++ for device libraries saner to deal with than cuda. In particular you don't need to jury rig the thing to not spuriously embed the GPU code in x64 elf objects and/or pull the binaries apart. Though if you're feeding the same device libraries to nvcc with #ifdef around the divergence your hands are tied.


> You might also find raw c++ for device libraries saner to deal with than cuda.

Actually, we just compile all the device libraries to LLVM bitcode and be done with it. Then we can write them using all the clang-dialect, not-nvcc-emulating, C++23 we feel like, and it'll still work when someone imports them into their c++98 CUDA project from hell. :D



You're right that most people only use a small subset of cuda: we prioritied support for features based on what was needed for various open-source projects, as a way to try to capture the most common things first.

A complete API comparison table is coming soon, I belive. :D

In a nutshell: - DPX: Yes. - Shuffles: Yes. Including the PTX versions, with all their weird/wacky/insane arguments. - Atomics: yes, except the 128-bit atomics nvidia added very recently. - MMA: in development, though of course we can't fix the fact that nvidia's hardware in this area is just better than AMD's, so don't expect performance to be as good in all cases. - TMA: On the same branch as MMA, though it'll just be using AMD's async copy instructions.

> mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.

We plan to publish a compatibility table of which instructons are supported, but a list of the instructions used to produce each PTX instruction is not in general meaningful. The inline PTX handler works by converting the PTX block to LLVM IR at the start of compilation (at the same time the rest of your code gets turned into IR), so it then "compiles forward" with the rest of the program. As a result, the actual instructions chosen vary on a csae-by-case basis due to the whims of the optimiser. This design in principle produces better performance than a hypothetical solution that turned PTX asm into AMD asm, because it conveniently eliminates the optimisation barrier an asm block typically represents. Care, of course, is taken to handle the wacky memory consistency concerns that this implies!

We're documenting which ones are expected to perform worse than on NVIDIA, though!


Have you seen anyone productively using TMA on Nvidia or async instructions on AMD? I’m currently looking at a 60% throughput degradation for 2D inputs on H100: https://github.com/ashvardanian/scaling-democracy/blob/a8092...


> You're right that most people only use a small subset of cuda

This is true first and foremost for the host-side API. From my StackOverflow and NVIDIA forums experience - I'm often the first and only person to ask about any number of nooks and crannies of the CUDA Driver API, with issues which nobody seems to have stumbled onto before; or at least - not stumbled and wrote anything in public about it.


Oh yes, we found all kinds of bugs in Nvidia's cuda implementation during this project :D.

There's a bunch of pretty obscure functions in the device side apis too: some esoteric math functions, old simd "intrinsics" that are mostly irrelevant with modern compilers, etc.


[I work on SCALE]

Mapping inline ptx to AMD machine code would indeed suck. Converting it to LLVM IR right at the start of compilation (when the initial IR is being generated) is much simpler, since it then gets "compiled forward" with the rest of the code. It's as if you wrote C++/intrinsics/whatever instead.

Note that nvcc accepts a different dialect of C++ from clang (and hence hipcc), so there is in fact more that separates CUDA from hip (at the language level) than just find/replace. We discuss this a little in [the manual](https://docs.scale-lang.com/manual/dialects/)

Handling differences between the atomic models is, indeed, "fun". But since CUDA is a programming language with documented semantics for its memory consistency (and so is PTX) it is entirely possible to arrange for the compiler to "play by NVIDIA's rules".


Huh. Inline assembly is strongly associated in my mind with writing things that can't be represented in LLVM IR, but in the specific case of PTX - you can only write things that ptxas understands, and that probably rules out wide classes of horrendous behaviour. Raw bytes being used for instructions and for data, ad hoc self modifying code and so forth.

I believe nvcc is roughly an antique clang build hacked out of all recognition. I remember it rejecting templates with 'I' as the type name and working when changing to 'T', nonsense like that. The HIP language probably corresponds pretty closely to clang's cuda implementation in terms of semantics (a lot of the control flow in clang treats them identically), but I don't believe an exact match to nvcc was considered particularly necessary for the clang -x cuda work.

The ptx to llvm IR approach is clever. I think upstream would be game for that, feel free to tag me on reviews if you want to get that divergence out of your local codebase.


I certainly would not attempt this feat with x86 `asm` blocks :D. PTX is indeed very pedestrian: it's more like IR than machine code, really. All the usual "machine-level craziness" that would otherwise make this impossible is just unrepresentable in PTX (though you do run into cases of "oopsie, AMD don't have hardware for this so we have to do something insane").


It's a beautiful answer to a deeply annoying language feature. I absolutely love it. Yes, inline asm containing PTX definitely should be burned off at the compiler front end, regardless of whether it ultimately codegens as PTX or something else.

I'm spawned a thread on the llvm board asking if anyone else wants that as a feature https://discourse.llvm.org/t/fexpand-inline-ptx-as-a-feature... in the upstream. That doesn't feel great - you've done something clever in a proprietary compiler and I'm suggesting upstream reimplement it - so I hope that doesn't cause you any distress. AMD is relatively unlikely to greenlight me writing it so it's probably just more marketing unless other people are keen to parse asm in string literals.


nvcc is nowhere near that bad these days, it supports most C++ code directly (for example, I've written kernels that include headers like <span> or <algorithm> and they work just fine).


NVCC is doing much better than before in terms of "broken C++". There was indeed a time when lots of modern C++ just didn't work.

Nowadays the issues are more subtle and nasty. Subtle differences in overload resolution. Subtle differences in lambda handling. Enough to break code in "spicy" ways when you try to port it over.


What do you think the source of this is? My understanding was that Nvidia is basically adopting the clang frontend wholesale now so I'm curious where it differs.


The LLVM manual touches on some of the basics of why: https://llvm.org/docs/CompileCudaWithLLVM.html#dialect-diffe...


Hi! Spectral engineer here!

SCALE does not use any part of ZLUDA. We have modified the clang frontend to convert inline PTX asm block to LLVM IR.

To put in a less compiler-engineer-ey way: for any given block of PTX, there exists a hypothetical sequence of C++/CUDA code you could have written to achieve the same effect, but on AMD (perhaps using funky __builtin_... functions if the code includes shuffles/ballots/other-weird-gpu-stuff). Our compiler effectively converts the PTX into that hypothetical C++.

Regarding memory consistency etc.: NVIDIA document the "CUDA memory consistency model" extremely thoroughly, and likewise, the consistency guarantees for PTX. It is therefore sufficient to ensure that we use operations at least as synchronising as those called for in the documented semantics of the language (be it CUDA or PTX, for each operation).

Differing consistency _between architectures_ is the AMDGPU backend's problem.


Ah I was reading the 'deeper dive' section on my phone and missed it was a comparison, not a warning, thank you

I'm curious how something like this example would translate:

===

Mapping lower-level ptx patterns to higher-level AMD constructs like __ballot, and knowing it's safe

```

  #ifdef INLINEPTX
  inline uint ptx_thread_vote(float rSq, float rCritSq) {
      uint result = 0;
      asm("{\n\t"
           ".reg .pred cond, out;\n\t"
           "setp.ge.f32 cond, %1, %2;\n\t"
           "vote.sync.all.pred out, cond, 0xffffffff;\n\t"
           "selp.u32 %0, 1, 0, out;\n\t"
           "}\n\t"
           : "=r"(result)
           : "f"(rSq), "f"(rCritSq));
      return result;
  }
  #endif
```

===

Again, I'm guessing there might be an equiv simpler program involving AMD's __ballot, but I'm unsure of the true equivalence wrt safety, and it seems like a tricky rewrite as it needs to (afaict) decompile to recover the higher-level abstraction. Normally it's easier to compile down or sideways (translate), and it's not clear to me these primitives are 1:1 for safely doing so.

===

FWIW, this is all pretty cool. We stay away from PTX -- most of our app code is higher-level, whether RAPIDS (GPU dataframes, GPU ML, etc libs), minimal cuda, and minimal opencl, with only small traces of inline ptx. So more realistically, if we had the motivation, we'd likely explore just #ifdef'ing it with something predictable.


I compiled your function with SCALE for gfx1030:

        .p2align        2                               ; -- Begin function _Z15ptx_thread_voteff
        .type   _Z15ptx_thread_voteff,@function
  _Z15ptx_thread_voteff:                  ; @_Z15ptx_thread_voteff
  ; %bb.0:                                ; %entry
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        s_waitcnt_vscnt null, 0x0
        v_cmp_ge_f32_e32 vcc_lo, v0, v1
        s_cmp_eq_u32 vcc_lo, -1
        s_cselect_b32 s4, -1, 0
        v_cndmask_b32_e64 v0, 0, 1, s4
        s_setpc_b64 s[30:31]
  .Lfunc_end1:
        .size   _Z15ptx_thread_voteff, .Lfunc_end1-_Z15ptx_thread_voteff
                                        ; -- End function


What were the safety concerns you had? This code seems to be something like `return __all_sync(rSq >= rCritSq) ? 1 : 0`, right?


It's supposed to be waiting for all threads to vote

I'm not familiar with AMD enough to know if additional synchronization is needed. ChatGPT recommended adding barriers beyond what that gave, but again, I'm not familiar with AMD commands.


Indeed, no extra synchronisation is needed here due to the nature of the hardware (threads in a warp can't get out of sync with each other).

Even on NVIDIA, you could've written this without the asm a discussed above!


Yeah I think, after this snippet was written, cuda added __all_sync as an intrinsic. The divergent code before this was plain-ish cuda, and this snippet ensures they wait on the comparison vote before recurring.

So in the AMD version, the compiler correctly realized the synchronization was on the comparison, so adds the AMD version right before it. That seems like a straightforward transform here.

It'd be interesting to understand the comparison of what Nvidia primitives map vs what doesn't. The above is a fairly simple barrier. We avoided PTX as much as we could and wrote it as simply as we could, I'd expect most of our PTX to port for similar reasons. The story is a bit diff for libraries we call. E.g., cudf probably has little compute-tier ptx directly, but will call nvidia libs, and use weird IO bits like cufile / gpu direct storage.


Just to check here, if you're given something like the following PTX:

  wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16
Do you reverse it back into C++ that does the corresponding FMAs manually instead of using tensor hardware? Or are you able to convert it into a series of __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that emulate the same behavior?


Rather awkwardly, you've asked about an instruction that isn't currently implemented. :D Support for wmma and friends is in development.

But in general the answer to your question is yes: we use AMD-specific builtins where available/efficient to make things work. Otherwise many things would be unrepresentble, not just slow!


What do you do when a builtin doesn't exist?


Add one: it's trivial to add a compiler builtin to carry the instruction from the frontend to the backend if an instruction exists and the backend knows about it.

If there's no instruction, either, you can write a C++ function to replicate the behaviour and codegen a call to it. Since the PTX blocks are expanded during initial IR generation, it all inlines nicely by the end. Of course, such software emulation is potentially suboptimal (depends on the situation).


AMD have "MIOpen" which is basically cuDNN-for-AMD. Ish.


And that thing is left for unreleased on windows for almost a whole year for unknown reason. Even though there is activity on github and build fix frequently. There is just no .exe or .msi for you to download. In fact, the rocm for linux is on major 6 release (which includes miopen). But somehow windows is still on major 5 (don't have miopen) for almost a whole year.

It almost make me wonder. Is there a shady trade somewhere to ask amd never release sdk for Windows to hike the price of nvidia card higher? Why they keep developing these without release it at all?


Since they cancelled the work on zluda and absolutely fail to do anything about other options, I really believe there's some "don't do it or you'll get sued to hell and back" agreement. They can't be so dumb they just miss it by accident.


probably because their focus is on data centers that mostly run linux


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: