In short, it's not an accident or incompetence that aspects of current desktop GPU execution models (e.g., thread blocks, scratchpad shared memory) are not exposed in Renderscript. It's a conscious decision to make sure you can get decent performance on not only those GPUs, but ARMv5-v8 CPUs (with and without SIMD instructions), x86, DSPs, etc. Getting good performance on these platforms from a language that does expose these constructs (e.g., CUDA) is still an open research problem (see MCUDA http://impact.crhc.illinois.edu/mcuda.aspx and friends).
Though Renderscript aims to achieve decent performance on a huge variety of platforms, even if they only cared about mobile GPUs, the major contenders (Imagination, ARM, Samsung, Qualcomm, NVIDIA) have wildly different architectures, and a language that is close to the metal on one will present a huge impedance mismatch on the others. Note that things are sufficiently different from desktop GPU design that we're just now seeing SoCs come out that support OpenCL (in hardware, driver support seems to be lagging), and you can't run CUDA on Tegra 4.
Pretty much exactly this. Performance portability is our main concern, and we are willing to trade off some peak performance to get it because of how badly you will hurt yourself on different architectures. We are trying to solve low-hanging problems first before attacking more complex algorithms.
So if I read this correctly, you're effectively trying to solve the hybrid computing problem that everyone else is working on too (and the results so far are pretty disappointing IMO).
To which I have to respond that better is often the enemy of good enough.
I'd personally rather have a relatively OK solution like OpenCL in my hands today than a currently nonexistent ideal solution at some vague point in the future. Smart programmers will overcome hardware limitations all on their own and dumb programmers will trip you up no matter how much you rabbit-proof their fences IMO.
It's easy to see why OpenCL hasn't rolled out fully on mobile GPUs yet: writing and debugging a full OpenCL software stack is very expensive and time-consuming, and there's still not that much real programmer demand for OpenCL on mobile.
As for Renderscript, it's always sounded like a bit of "not invented here" syndrome Google's part -- we've already got CUDA and OpenCL, and RS doesn't really bring much new to the table. They've already deprecated the 3D graphics part of Renderscript in Android 4.1, so perhaps they'll do the same to Renderscript Compute soon.
I'd much rather see Google invest their time in an Android version of something like the Accelerate API from iOS. It would be a lot more generally useful.
I suspect that as soon as Apple exposes OpenCL in any way on IOS, Android will shortly follow. Likewise, if Mozilla exposes WebCL in FireFox, Chrome will shortly follow. What I don't expect is for them to take the lead in doing so.
Say what you want of OpenCL/CUDA, but what other language smoothly subsumes SIMD, multi-threading, and multi-core awareness? I expected it to already be available on smart phones by now. What's taking so long?
If someone with that level of experience can find so many flaws so quickly, why aren't people with that level of domain knowledge brought in when the API is originally being developed? Or, if they are, why isn't there released documentation on why the API isn't as good as they wish it could be?
I worked on CUDA at NVIDIA for over four years and was the primary API designer for a large part of that time. I started on RS at Google in September.
Basically, he gives us too little credit for the execution model (it's young, it's improving very quickly and is not at all designed to emulate anything else that exists today) and assumes that GPU compute has the same tradeoffs on mobile as desktop (it doesn't at all). You'll see more from us soon.
Hi. Author here. Your name is quite famous in the GPGPU community, and it is great to hear that you now work on RSC. My experience does not compare to yours and I do hope my post is seen in a positive light. Would love to discuss the issues in depth sometime.
Anyway if you were to ignore everything in the post except one item, that would be to please fix gather/scatter in RSC. A parallel computing API without proper gather/scatter is simply not very useful, irrespective of whether it is on desktop or mobile.
I will keep following RSC and look forward to the developments you are hinting at.
Desktop: high-end consumer GPUs have about 10-15x the single-precision FLOPs and 4-6x the bandwidth of a single Intel CPU socket. At this point, usually connected via PCIe Gen3. There are two real vendors (NVIDIA and AMD), and what comprises a system is generally the same (CPU + some number of GPUs).
Mobile: GPU has 3-5x the FLOPs of the CPU and no bandwidth advantage because of the shared memory pool between CPU and GPU. GPUs have very wide ranges of functionality. Even the CPUs behave very differently (Krait in Nexus 4 sometimes chews through code that the A15 in Nexus 10 chokes on and vice-versa). What comprises a system varies tremendously--CPU, CPU + GPU, CPU + GPU + other processors, etc.
A developer shouldn't be expected to have to tune for 20 different processors and system architectures in order to ship an app on the Android market. That's the problem we're trying to solve, not simply exposing access to GPU compute.
I suspect that at least some of these "flaws" are intentional, and are meant to make programming easier, at the expense of some performance.
For example, three of the poster's points (not allowing device property querying, not allowing the programmer to choose where a kernel runs, and not exposting local memory to the programmer) all make programming easier, though they also disallow some types of performance tuning.
One big potential reason for doing GPGPU on a mobile device is to get better energy efficiency per gigaflop, rather than to get huge overall performance like on a desktop GPGPU. In this context, squeezing out all possible performance may not be as important.
I think that Renderscript is not meant as replacement for native C++ code. Rather it's an platform independent and easy way to give a programmer more performance power (beyond Java).
I guess that if you need real performance or more control you'll have go the NDK route anyways. But if you just want to write another Instagram clone then Renderscript is the way to go.
In short, it's not an accident or incompetence that aspects of current desktop GPU execution models (e.g., thread blocks, scratchpad shared memory) are not exposed in Renderscript. It's a conscious decision to make sure you can get decent performance on not only those GPUs, but ARMv5-v8 CPUs (with and without SIMD instructions), x86, DSPs, etc. Getting good performance on these platforms from a language that does expose these constructs (e.g., CUDA) is still an open research problem (see MCUDA http://impact.crhc.illinois.edu/mcuda.aspx and friends).
Though Renderscript aims to achieve decent performance on a huge variety of platforms, even if they only cared about mobile GPUs, the major contenders (Imagination, ARM, Samsung, Qualcomm, NVIDIA) have wildly different architectures, and a language that is close to the metal on one will present a huge impedance mismatch on the others. Note that things are sufficiently different from desktop GPU design that we're just now seeing SoCs come out that support OpenCL (in hardware, driver support seems to be lagging), and you can't run CUDA on Tegra 4.