I know that this thread is old, but I think I can bring some relevant information that answers to the question asked.
Continuum Analytics has a package that contains libraries that resolves the CUDA computing for you. Basically you instrument your code that needs to be parallelized (within a function) with a decorator and you need to import a library. Thus, you don't need any knowledge about CUDA instructions.
Information can be found on NVIDIA page
https://developer.nvidia.com/anaconda-accelerate
or you can go directly to the Continuum Analytics' page
https://store.continuum.io/cshop/anaconda/
There is a 30 day trial period and a free licence for academics.
I use this extensively and accelerates my code between 10 to 50 times.