HN2new | past | comments | ask | show | jobs | submitlogin

I remember reading a blog post of the peculiarities of GPU programming and the post noting that for most modern graphics cards (at the time) if you can keep your computable data in chunks no bigger than 64kb a piece you can expect to see enormous performance gains even on top of what you'll see by using openCL/CUDA because of a physical memory limit on the actual GPU itself.

I also remember thinking that a 64kb row size for DynamoDB was very odd.

I wonder if these things are at all related.



You are taking about shared memory in CUDA, local memory in OpenCL. When you are reading from the same location over and over again (most notable cases are linear algebra functions, filtering in signal processing), reading from DRAM is going to be costly. This is solved on the CPUs by having multiple layers of caches.

Early generation of NVIDIA gpus did not an automatic Caching mechanism or could not for CUDA, I forget) that could help solve this issue. But they did have memory available locally on each compute unit where you could manually read / write data into. This helped reduce the overall read/write overhead.

Even when the newer generations have the caches, it is beneficial to use this shared / local memory. Even when the shared / local memory limits are hit, there are alternatives like Textures in CUDA, Images in OpenCL that are slightly slower, but significantly better than reading from DRAM.


Only relation being that 64k is 2^16 I imagine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: