That’s Embedded World over for another year

Erik at Embedded World 2015, manning the mighty 4DSP stand

Erik at Embedded World 2015, manning the mighty 4DSP stand

First time at Embedded World for 4DSP as an exhibitor.  We were co-exhibitor on the DSPValley stand.  Very few new leads but that was not expected.  It was more important to meet existing customers and people we are already in discussion with.  To that end it was a great success and we really enjoyed meeting so many people and discussing both technical and business aspects.  It is always good to put a face to the names and while I try to visit customers when I can it is not always feasible or viable or even desirable.  If someone is just researching the market, or are clear on what they want, then they do not always want a business development / salesperson, no matter how nice and well informed (!) to drop past.  Over the next month I will follow up with everyone and see how we can best proceed.

I must say that the chaps from DSPValley did a great job and took on pretty much all the headache of organizing a stand at such an event.  Apart from sending some text and images, we basically just needed to turn up.  That was invaluable given the limited time we had during the last months.  Hats off to Bjorn and his colleagues.

Advertisements

CUDA Launch: Out of Resources Error and Strongly Typed Methods

You may see this cryptic error now and then when developing with CUDAfy.NET.  Typically the reason is down to the passing of the wrong parameters to the device function. This example is going to end in tears:

        [Cudafy]
        public static void Scale(GThread thread, ComplexFloat[] c, float scale)
        {
            int id = thread.get_global_id(0);
            c[id].R = c[id].R * scale;
            c[id].I = c[id].I * scale;
        }
        ...
        int N = 1024;
        gpu.Launch(gridSize, BLOCK_LEN, "Scale", devBuffer, 1.0/N);
        ...

It can be corrected by changing 1.0 to 1.0F to ensure we stay with single floating point. As it stands above the result of the division is double floating point which is 8 bytes. The device function expects single which is 4 bytes. Hence the out of resources message. Kind of makes sense.
A better policy can be to use strong typing in the Launch.

        [Cudafy]
        public static void Scale(GThread thread, ComplexFloat[] c, float scale)
        {
            int id = thread.get_global_id(0);
            c[id].R = c[id].R * scale;
            c[id].I = c[id].I * scale;
        }
        ...
        int N = 1024;
        gpu.Launch(gridSize, BLOCK_LEN, Scale, devBuffer, 1.0F/N);
        ...

The only downside is a slight performance hit due to reflection. The degree of this is nowhere near as high as the more elegantly looking dynamic launching that CUDAfy supports.

        gpu.Launch(gridSize, BLOCK_LEN).Scale(devBuffer, 1.0F/N);

First time this is called you can get hit with about 20ms of dynamic runtime goodness. Subsequent calls appear efficient. You have however also lost your strongly typed safety, which does not appear to worry millions of programmers around the world.

Portable CUDA on NVIDIA, AMD, Intel GPUs and CPUs

I’ve just uploaded a new version of the open source CUDAfy.NET SDK that targets Linux.  For those that do not know, CUDAfy.NET is an easy to use .NET wrapper that brings the NVIDIA CUDA programming model and the power of GPGPU to the world of C#, VB, F# and other .NET users.  Anyone with a brief understanding of the CUDA runtime model will be able to pick up and run with CUDAfy within a few minutes.

One criticism of CUDA is that it only targets NVIDIA GPUs.  OpenCL is often regarded as a better alternative, being supported by both NVIDIA and AMD on their GPUs, but also on CPUs by Intel and AMD.  That should have been the end of the story for CUDA if it were not for the fact that OpenCL remains even in its 1.2 version, a bit of a dog’s breakfast in comparison to CUDA.  I guess that is what it gets by trying to be all things to all men.  Although in theory platform agnostic to get the best out of it you need to take care of a number of often vendor specific issues and if you need higher level maths libraries such as FFTs and BLAS then it gets tougher.  Ultimately if you’ve been reared on CUDA moving to OpenCL is not a nice proposition.

cudafy-multi-platform

CUDAfy .NET brings together a rare mix of bed fellows

Because CUDAfy is a wrapper we were able to overcome this limitation.  The same code with some small restrictions can now target either OpenCL or CUDA.  The restrictions include no FFT, BLAS, SPARSE and RAND libraries, and no functions in structures. Other than this you can use either OpenCL or CUDA notation for accessing thread, block and grid id and dimensions, same launch parameters, same handling of shared, constant and global memory, and same host interface (the methods you use for copying data etc).  Now with the release of the CUDAfy Linux library you can run the same application on both Windows and Linux, 32-bit or 64-bit, and use AMD GPUs, NVIDIA GPUs, Intel CPU/GPUs, AMD APU/CPU/GPUs and also some Altera FPGA cards such as those from BittWare and Nallatech.  To run such an app you need only have the relevant device drivers and OpenCL libraries installed, and you’ll need Mono installed on Linux.  If you have an NVIDIA or AMD GPU in the system and up to date drivers then it should just work.  For developing you’ll need the relevant CUDA or OpenCL SDK.  NVIDIA, Intel, AMD and Altera all have theirs.  An advantage of OpenCL is that the compiler is built in, you do not need Visual C++ or gcc on the machine to make new GPU functions.

The aim we have in mind is to cater for the increasing number of embedded or mobile devices with GPU capabilities.  NVIDIA is yet to support CUDA on their Tegra, but there are already ARM and other devices that have OpenCL libraries.  Examples include the Google Nexus 4 phone and Nexus 10 tablet.  The embedded market may often be less visible (excuse pun) than their more obvious phone and tablet counterparts but since they are hidden in so many machines their numbers are huge.  Power consumption is always an issue which is why being able to use the GPU for general purpose processing instead of just graphics is so important.  Being able to take advantage of that in a straightforward way is therefore vital and CUDAfy .NET can be a solution for such devices running Windows or Linux.

Nexus_10

Google Nexus 10 has OpenCL libraries onboard

CUDAfy .NET is an open source project hosted at codeplex and is licensed under LGPL.  You can freely link to the library from commercial applications.  Any changes you make to the source code must be re-submitted.  Alternatively there is a commercial license available that allows you to do the things you cannot with LGPL like embed (modified) versions into your application and it includes support.  All donations received for CUDAfy .NET go to Harmony Through Education, a charity helping handicapped children in the third world.

Below is an example application written in C# that can be targeted for either CUDA or OpenCL by simply changing an enumerator.

Continue reading

Low Cost, High Performance FPGA and GPGPU Based Data Acquisition System

The Xilinx evaluation boards such as the ML605 (Virtex-6), KC705 (Kintex-7) and VC707 (Virtex-7) give access to high end FPGAs for a relatively low budget.  Actually the cost of these boards is about the same as you’d pay if you wanted to buy just the FPGA on the board itself, so essentially you are getting all the other features such as USB, memory, Ethernet, PCI-Express interface, etc. for free.  A very interesting use for these boards is in combination with analogue to digital or digital to analogue converters (ADCs and DACs).   The FMC form factor has grown rapidly in popularity giving a huge range of high performance modules.  The Xilinx evaluation boards listed above can all take up to two such FMC modules.  This makes an extremely powerful processing platform.  Below you can see a Xilinx KC705 Kintex-7 Evaluation Board and a 4DSP FMC150 ADC/DAC Daughter Module.  We have a lot more FMC modules available.

kc705fmc150 FMC150

Xilinx Evaluation Board Problems

The main disadvantage is that the PCBs are on the large side and when FMCs are mounted and the board is placed in a PCI-Express slot of a standard PC or workstation, it is no longer possible to close the system.  The analogue cables are also positioned in an awkward position, protruding further from the open door.  This makes such systems very fragile and liable to damage should a cable be accidently pulled or an item fall into the open computer.

4DSP-GPUDirect-ML605-Quadro4000

1U Rack Solution

To provide a solution we have come up with a 1U 19” rack platform that can house a Mini-ITX motherboard, Xilinx evaluation board in a PCIe 16x slot via riser, hard drives and FMC modules.  Short analogue cables are used to connect the FMC analogue connectors to the front panel.  This has the secondary advantage of allowing use of larger, more robust and convenient SMA connectors in place of the customary SSMC or MMCX of the FMC modules.  The power supply is an external laptop style unit rated at 120W which is more than sufficient.

IMG_6196-600px               IMG_6153-400x400px

400Mbytes/second Storage to a Single Drive

In the photographs you can see a system with a 4DSP FMC108 8 channel 250MHz ADC.  Using a single SSD we can acquire and store data at more than 400Mbytes/second sustained.  Furthermore we can elect to do some of the processing on the CPU.

OpenCL and General Purpose GPU on Intel CPU

Using latest Intel Core CPUs we have good OpenCL performance.  This allows us to make use of the rapid development possibilities of a GPU and CPU and put say complex image processing here and leave simpler pre-processing tasks to the FPGA.  This flexibility can be a huge benefit during proof of concept stages when algorithms are changing frequently.  In my next post I’ll look at some code and show how we bring FPGA board, Intel CPU and storage together.

OpenCL logo

OpenCL logo (Photo credit: Wikipedia)

Availability

Each 1U chassis is made from aluminium and is customized for the required drillings.  It’s a relatively small amount of work for a very professional result that will protect your evaluation system and even allow you to take it out of the lab.  If you are interested in this platform or something similar for your Xilinx FPGA evaluation board based projects then please contact us.

NVIDIA – Clear as Mud – Hyper-Q and Dynamic Parallelism on Laptops

Until now NVIDIA CUDA’s powerful new Hyper-Q and Dynamic Parallelism features were only available on Tesla Kepler K20 and some Quadro K-series cards.  Geforce cards do not support it.  The main reason for this apparently is that these functions require features only available on some Kepler architectures.  However that is not the full story since the Geforce Titan does supposedly support Hyper-Q and Dynamic Parallelism but not dual copy engines or GPUDirect RDMA as we’ll see below.

Now that is rather disappointing news if you want to experiment with these features on a lower cost platform, but there is an alternative.  Hyper-Q and Dynamic Parallelism will be available on certain laptops with the GK208 chip.  And it is not only expensive laptops.  Likely there will be an Acer model coming in around $700.  Why is this?  Many programmers develop their CUDA algorithms on a laptop and then move to the large expensive (and very likely shared) Tesla workstations later.  I’m not clear on whether dual copy engines will also feature.  Probably not.  Titan also misses this powerful ability to transfer data to and from the GPU concurrently.  For the right algorithm it can make a big difference.  I’ve got a GTX680 here which wipes the floor with the Quadro 4000 in terms of single floating point processing, however when there is a lot of data transfer the advantage can switch around.  Dual copy engines are actually available on almost all NVIDIA cards, only it is not enabled.  In some earlier series cards it could be enabled by hacking the firmware.  In some Kepler cards it and other Tesla features can be enabled by unleashing a soldering iron on your board.

With NVIDIA, accurate information is about as hard to get as water in the desert.  There is more misinformation flying around than during the Iraq war.  Supposedly the GT730M will have GK208.  Except when it has a GK107.  Then there is GPUDirect.  This marketing word when in the (Linux only) GPUDirect RDMA flavour (as opposed to the GPUDirect for Video variant) refers to the ability to make use of a little used PCI-Express features for peer to peer data transfers.  This is a fantastically important feature, allowing data to travel between GPUs, network cards and other PCIe devices without burdening the CPU and system memory.  At GTC2012 this was said to be a Kepler feature and all that was needed was CUDA 5.0 and the latest drivers.  You could then adapt your own device drivers to support this.  In our case that was for an FPGA card, so we could DMA directly to the GPU and thus reduce latency and load on CPU.  Then without warning this support was removed from all but the Tesla K20.  I think.

One piece of info I got from a source at NVIDIA is that they do not want the support headache of having massive numbers of (Geforce) people using advanced features like these.  So they restrict to their own made high end cards.  For GPUDirect RDMA I can see some logic in this, in that you are also reliant on correctly modifying the driver of another card and the motherboard architecture is also important.  Making use of a PCIe bridge is best, like a PLX device.  If you only have the PCIe switch of the CPU then it may not work at all.  So complaints to NVIDIA could actually be the fault of motherboard, motherboard drivers, chipset, 3rd party PCIe device, 3rd party made NVIDIA GPU etc – too many variables.  Unfortunately the Tesla K20 and Titan cards are too power hungry for many non-high performance computing applications and a low cost Geforce would work fine (think embedded systems).

Will GK208 make it to desktop GPUs?  And more so, will it have these K20 features enabled?  Depends who you read.  Here is a GT630 with the GK208?!?! The moral of the story is actually that NVIDIA needs to get clear on what cards support what.  We need clear information and guarantees.  Prancing around smugly announcing the latest super computer populated with their CUDA cards is not enough.  The rest of us mortals are generally left speculating and making fools of ourselves in front of our customers after trying to convince them to leap into GPGPU.  End users are right to be concerned about product road maps, product lifetimes, driver and feature supports.