One "CUDA core" is indeed one GPU thread. The lane of a GPU SIMD is nothing like CPU SIMD, and can independently branch (even if that branching can be much more expensive than on a CPU).
This is not true, just like a shader core with AMD was not a GPU thread.
For example, the 2900 XT had 320 shader cores, but since it used VILW-5 ISA, that corresponds to 64 GPU threads.
Similarly, an RTX 3080 has 8704 CUDA cores, but there are 2 FP32 ALUs per thread, resulting in 4252 threads, and 68 SMs since, just like Turing, there are 64 threads per SM.
To me, the philosophy of C (and even C++ before things became nuts) is that you should be able to reasonably guess the assembly code resulting from the C code, the idea being that you can write assembly code with much less typing.
These days, things seem to be moving in a more dogmatic direction with the underlying assumption that the vast majority of programmers are bad programmers.
You're fighting a losing battle though. You can guess what shape the assembly looks like but ultimately unless you have measured and identified somewhere where you can make progress your brain simply cannot keep in mind all the little nuances of the optimizations a compiler does.
It's worth adding that the compilers themselves aren't perfect - you can use LLVM's code to try and predict how your loop will perform (LLVM-MCA) and last time I checked it wasn't amazingly accurate.
You can predict a pseudo-assembly output for a mental model of the architecture(s) you're targeting. It doesn't have to be an exact match, register allocation and all.