Portable CUDA on NVIDIA, AMD, Intel GPUs and CPUs

I’ve just uploaded a new version of the open source CUDAfy.NET SDK that targets Linux.  For those that do not know, CUDAfy.NET is an easy to use .NET wrapper that brings the NVIDIA CUDA programming model and the power of GPGPU to the world of C#, VB, F# and other .NET users.  Anyone with a brief understanding of the CUDA runtime model will be able to pick up and run with CUDAfy within a few minutes.

One criticism of CUDA is that it only targets NVIDIA GPUs.  OpenCL is often regarded as a better alternative, being supported by both NVIDIA and AMD on their GPUs, but also on CPUs by Intel and AMD.  That should have been the end of the story for CUDA if it were not for the fact that OpenCL remains even in its 1.2 version, a bit of a dog’s breakfast in comparison to CUDA.  I guess that is what it gets by trying to be all things to all men.  Although in theory platform agnostic to get the best out of it you need to take care of a number of often vendor specific issues and if you need higher level maths libraries such as FFTs and BLAS then it gets tougher.  Ultimately if you’ve been reared on CUDA moving to OpenCL is not a nice proposition.


CUDAfy .NET brings together a rare mix of bed fellows

Because CUDAfy is a wrapper we were able to overcome this limitation.  The same code with some small restrictions can now target either OpenCL or CUDA.  The restrictions include no FFT, BLAS, SPARSE and RAND libraries, and no functions in structures. Other than this you can use either OpenCL or CUDA notation for accessing thread, block and grid id and dimensions, same launch parameters, same handling of shared, constant and global memory, and same host interface (the methods you use for copying data etc).  Now with the release of the CUDAfy Linux library you can run the same application on both Windows and Linux, 32-bit or 64-bit, and use AMD GPUs, NVIDIA GPUs, Intel CPU/GPUs, AMD APU/CPU/GPUs and also some Altera FPGA cards such as those from BittWare and Nallatech.  To run such an app you need only have the relevant device drivers and OpenCL libraries installed, and you’ll need Mono installed on Linux.  If you have an NVIDIA or AMD GPU in the system and up to date drivers then it should just work.  For developing you’ll need the relevant CUDA or OpenCL SDK.  NVIDIA, Intel, AMD and Altera all have theirs.  An advantage of OpenCL is that the compiler is built in, you do not need Visual C++ or gcc on the machine to make new GPU functions.

The aim we have in mind is to cater for the increasing number of embedded or mobile devices with GPU capabilities.  NVIDIA is yet to support CUDA on their Tegra, but there are already ARM and other devices that have OpenCL libraries.  Examples include the Google Nexus 4 phone and Nexus 10 tablet.  The embedded market may often be less visible (excuse pun) than their more obvious phone and tablet counterparts but since they are hidden in so many machines their numbers are huge.  Power consumption is always an issue which is why being able to use the GPU for general purpose processing instead of just graphics is so important.  Being able to take advantage of that in a straightforward way is therefore vital and CUDAfy .NET can be a solution for such devices running Windows or Linux.


Google Nexus 10 has OpenCL libraries onboard

CUDAfy .NET is an open source project hosted at codeplex and is licensed under LGPL.  You can freely link to the library from commercial applications.  Any changes you make to the source code must be re-submitted.  Alternatively there is a commercial license available that allows you to do the things you cannot with LGPL like embed (modified) versions into your application and it includes support.  All donations received for CUDAfy .NET go to Harmony Through Education, a charity helping handicapped children in the third world.

Below is an example application written in C# that can be targeted for either CUDA or OpenCL by simply changing an enumerator.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;

namespace CudafyByExample
    public class add_loop_long
        public const int N = 32 * 1024;

        public static void Execute()
            // Translate all members with the Cudafy attribute in the given type to CUDA and compile.
            CudafyModule km = CudafyTranslator.Cudafy(typeof(add_loop_long)); // or = Cudafy(eArchitecture.OpenCL, typeof(add_loop_long));

            // Get the first CUDA device and load the module generated above.
            GPGPU gpu = CudafyHost.GetDevice(eGPUType.Cuda, 0); // or eGPUType.OpenCL

            // Declare some arrays like normal
            int[] a = new int[N];
            int[] b = new int[N];
            int[] c = new int[N];

            // Allocate the memory on the GPU of same size as specified arrays
            int[] dev_a = gpu.Allocate<int>(a);
            int[] dev_b = gpu.Allocate<int>(b);
            int[] dev_c = gpu.Allocate<int>(c);

            // Fill the arrays 'a' and 'b' on the CPU
            for (int i = 0; i < N; i++)
                a[i] = i;
                b[i] = 2 * i;

            // Copy the arrays 'a' and 'b' to the GPU
            gpu.CopyToDevice(a, dev_a);
            gpu.CopyToDevice(b, dev_b);

            // Launch the method 'add' 128 times in parallel
            gpu.Launch(128, 1).add(dev_a, dev_b, dev_c);

            // Copy the array 'c' back from the GPU to the CPU
            gpu.CopyFromDevice(dev_c, c);

            // Verify that the GPU did the work we requested
            bool success = true;
            for (int i = 0; i < N; i++)
                if ((a[i] + b[i]) != c[i])
                    Console.WriteLine("{0} + {1} != {2}", a[i], b[i], c[i]);
                    success = false;
            if (success)
                Console.WriteLine("We did it!");

            // free the memory allocated on the GPU

            // free the memory we allocated on the CPU
            // Not necessary, this is .NET

        public static void add(GThread thread, int[] a, int[] b, int[] c)
            // Get the id so we know what data this thread should process
            int tid = thread.blockIdx.x;
            while (tid < N)
                c[tid] = a[tid] + b[tid];
                tid += thread.gridDim.x;

To make good use of a GPU you need to think of processes where there are a lot of tasks of similar nature to be handled.  There are the typical well known tasks for physics, geology, image processing, life sciences, astronomy, etc, but even more mundane processes can be accelerated or made more efficient.  Think of pattern matching, going through large arrays of objects, serialization, summing, averaging, reversing and more.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s