In short, it's not an accident or incompetence that aspects of current desktop GPU execution models (e.g., thread blocks, scratchpad shared memory) are not exposed in Renderscript. It's a conscious decision to make sure you can get decent performance on not only those GPUs, but ARMv5-v8 CPUs (with and without SIMD instructions), x86, DSPs, etc. Getting good performance on these platforms from a language that does expose these constructs (e.g., CUDA) is still an open research problem (see MCUDA http://impact.crhc.illinois.edu/mcuda.aspx and friends).
Though Renderscript aims to achieve decent performance on a huge variety of platforms, even if they only cared about mobile GPUs, the major contenders (Imagination, ARM, Samsung, Qualcomm, NVIDIA) have wildly different architectures, and a language that is close to the metal on one will present a huge impedance mismatch on the others. Note that things are sufficiently different from desktop GPU design that we're just now seeing SoCs come out that support OpenCL (in hardware, driver support seems to be lagging), and you can't run CUDA on Tegra 4.
Pretty much exactly this. Performance portability is our main concern, and we are willing to trade off some peak performance to get it because of how badly you will hurt yourself on different architectures. We are trying to solve low-hanging problems first before attacking more complex algorithms.
So if I read this correctly, you're effectively trying to solve the hybrid computing problem that everyone else is working on too (and the results so far are pretty disappointing IMO).
To which I have to respond that better is often the enemy of good enough.
I'd personally rather have a relatively OK solution like OpenCL in my hands today than a currently nonexistent ideal solution at some vague point in the future. Smart programmers will overcome hardware limitations all on their own and dumb programmers will trip you up no matter how much you rabbit-proof their fences IMO.
In short, it's not an accident or incompetence that aspects of current desktop GPU execution models (e.g., thread blocks, scratchpad shared memory) are not exposed in Renderscript. It's a conscious decision to make sure you can get decent performance on not only those GPUs, but ARMv5-v8 CPUs (with and without SIMD instructions), x86, DSPs, etc. Getting good performance on these platforms from a language that does expose these constructs (e.g., CUDA) is still an open research problem (see MCUDA http://impact.crhc.illinois.edu/mcuda.aspx and friends).
Though Renderscript aims to achieve decent performance on a huge variety of platforms, even if they only cared about mobile GPUs, the major contenders (Imagination, ARM, Samsung, Qualcomm, NVIDIA) have wildly different architectures, and a language that is close to the metal on one will present a huge impedance mismatch on the others. Note that things are sufficiently different from desktop GPU design that we're just now seeing SoCs come out that support OpenCL (in hardware, driver support seems to be lagging), and you can't run CUDA on Tegra 4.