Portable CUDA on NVIDIA, AMD, Intel GPUs and CPUs

I’ve just uploaded a new version of the open source CUDAfy.NET SDK that targets Linux.  For those that do not know, CUDAfy.NET is an easy to use .NET wrapper that brings the NVIDIA CUDA programming model and the power of GPGPU to the world of C#, VB, F# and other .NET users.  Anyone with a brief understanding of the CUDA runtime model will be able to pick up and run with CUDAfy within a few minutes.

One criticism of CUDA is that it only targets NVIDIA GPUs.  OpenCL is often regarded as a better alternative, being supported by both NVIDIA and AMD on their GPUs, but also on CPUs by Intel and AMD.  That should have been the end of the story for CUDA if it were not for the fact that OpenCL remains even in its 1.2 version, a bit of a dog’s breakfast in comparison to CUDA.  I guess that is what it gets by trying to be all things to all men.  Although in theory platform agnostic to get the best out of it you need to take care of a number of often vendor specific issues and if you need higher level maths libraries such as FFTs and BLAS then it gets tougher.  Ultimately if you’ve been reared on CUDA moving to OpenCL is not a nice proposition.

cudafy-multi-platform

CUDAfy .NET brings together a rare mix of bed fellows

Because CUDAfy is a wrapper we were able to overcome this limitation.  The same code with some small restrictions can now target either OpenCL or CUDA.  The restrictions include no FFT, BLAS, SPARSE and RAND libraries, and no functions in structures. Other than this you can use either OpenCL or CUDA notation for accessing thread, block and grid id and dimensions, same launch parameters, same handling of shared, constant and global memory, and same host interface (the methods you use for copying data etc).  Now with the release of the CUDAfy Linux library you can run the same application on both Windows and Linux, 32-bit or 64-bit, and use AMD GPUs, NVIDIA GPUs, Intel CPU/GPUs, AMD APU/CPU/GPUs and also some Altera FPGA cards such as those from BittWare and Nallatech.  To run such an app you need only have the relevant device drivers and OpenCL libraries installed, and you’ll need Mono installed on Linux.  If you have an NVIDIA or AMD GPU in the system and up to date drivers then it should just work.  For developing you’ll need the relevant CUDA or OpenCL SDK.  NVIDIA, Intel, AMD and Altera all have theirs.  An advantage of OpenCL is that the compiler is built in, you do not need Visual C++ or gcc on the machine to make new GPU functions.

The aim we have in mind is to cater for the increasing number of embedded or mobile devices with GPU capabilities.  NVIDIA is yet to support CUDA on their Tegra, but there are already ARM and other devices that have OpenCL libraries.  Examples include the Google Nexus 4 phone and Nexus 10 tablet.  The embedded market may often be less visible (excuse pun) than their more obvious phone and tablet counterparts but since they are hidden in so many machines their numbers are huge.  Power consumption is always an issue which is why being able to use the GPU for general purpose processing instead of just graphics is so important.  Being able to take advantage of that in a straightforward way is therefore vital and CUDAfy .NET can be a solution for such devices running Windows or Linux.

Nexus_10

Google Nexus 10 has OpenCL libraries onboard

CUDAfy .NET is an open source project hosted at codeplex and is licensed under LGPL.  You can freely link to the library from commercial applications.  Any changes you make to the source code must be re-submitted.  Alternatively there is a commercial license available that allows you to do the things you cannot with LGPL like embed (modified) versions into your application and it includes support.  All donations received for CUDAfy .NET go to Harmony Through Education, a charity helping handicapped children in the third world.

Below is an example application written in C# that can be targeted for either CUDA or OpenCL by simply changing an enumerator.

Continue reading

Advertisements

Low Cost, High Performance FPGA and GPGPU Based Data Acquisition System

The Xilinx evaluation boards such as the ML605 (Virtex-6), KC705 (Kintex-7) and VC707 (Virtex-7) give access to high end FPGAs for a relatively low budget.  Actually the cost of these boards is about the same as you’d pay if you wanted to buy just the FPGA on the board itself, so essentially you are getting all the other features such as USB, memory, Ethernet, PCI-Express interface, etc. for free.  A very interesting use for these boards is in combination with analogue to digital or digital to analogue converters (ADCs and DACs).   The FMC form factor has grown rapidly in popularity giving a huge range of high performance modules.  The Xilinx evaluation boards listed above can all take up to two such FMC modules.  This makes an extremely powerful processing platform.  Below you can see a Xilinx KC705 Kintex-7 Evaluation Board and a 4DSP FMC150 ADC/DAC Daughter Module.  We have a lot more FMC modules available.

kc705fmc150 FMC150

Xilinx Evaluation Board Problems

The main disadvantage is that the PCBs are on the large side and when FMCs are mounted and the board is placed in a PCI-Express slot of a standard PC or workstation, it is no longer possible to close the system.  The analogue cables are also positioned in an awkward position, protruding further from the open door.  This makes such systems very fragile and liable to damage should a cable be accidently pulled or an item fall into the open computer.

4DSP-GPUDirect-ML605-Quadro4000

1U Rack Solution

To provide a solution we have come up with a 1U 19” rack platform that can house a Mini-ITX motherboard, Xilinx evaluation board in a PCIe 16x slot via riser, hard drives and FMC modules.  Short analogue cables are used to connect the FMC analogue connectors to the front panel.  This has the secondary advantage of allowing use of larger, more robust and convenient SMA connectors in place of the customary SSMC or MMCX of the FMC modules.  The power supply is an external laptop style unit rated at 120W which is more than sufficient.

IMG_6196-600px               IMG_6153-400x400px

400Mbytes/second Storage to a Single Drive

In the photographs you can see a system with a 4DSP FMC108 8 channel 250MHz ADC.  Using a single SSD we can acquire and store data at more than 400Mbytes/second sustained.  Furthermore we can elect to do some of the processing on the CPU.

OpenCL and General Purpose GPU on Intel CPU

Using latest Intel Core CPUs we have good OpenCL performance.  This allows us to make use of the rapid development possibilities of a GPU and CPU and put say complex image processing here and leave simpler pre-processing tasks to the FPGA.  This flexibility can be a huge benefit during proof of concept stages when algorithms are changing frequently.  In my next post I’ll look at some code and show how we bring FPGA board, Intel CPU and storage together.

OpenCL logo

OpenCL logo (Photo credit: Wikipedia)

Availability

Each 1U chassis is made from aluminium and is customized for the required drillings.  It’s a relatively small amount of work for a very professional result that will protect your evaluation system and even allow you to take it out of the lab.  If you are interested in this platform or something similar for your Xilinx FPGA evaluation board based projects then please contact us.

NVIDIA – Clear as Mud – Hyper-Q and Dynamic Parallelism on Laptops

Until now NVIDIA CUDA’s powerful new Hyper-Q and Dynamic Parallelism features were only available on Tesla Kepler K20 and some Quadro K-series cards.  Geforce cards do not support it.  The main reason for this apparently is that these functions require features only available on some Kepler architectures.  However that is not the full story since the Geforce Titan does supposedly support Hyper-Q and Dynamic Parallelism but not dual copy engines or GPUDirect RDMA as we’ll see below.

Now that is rather disappointing news if you want to experiment with these features on a lower cost platform, but there is an alternative.  Hyper-Q and Dynamic Parallelism will be available on certain laptops with the GK208 chip.  And it is not only expensive laptops.  Likely there will be an Acer model coming in around $700.  Why is this?  Many programmers develop their CUDA algorithms on a laptop and then move to the large expensive (and very likely shared) Tesla workstations later.  I’m not clear on whether dual copy engines will also feature.  Probably not.  Titan also misses this powerful ability to transfer data to and from the GPU concurrently.  For the right algorithm it can make a big difference.  I’ve got a GTX680 here which wipes the floor with the Quadro 4000 in terms of single floating point processing, however when there is a lot of data transfer the advantage can switch around.  Dual copy engines are actually available on almost all NVIDIA cards, only it is not enabled.  In some earlier series cards it could be enabled by hacking the firmware.  In some Kepler cards it and other Tesla features can be enabled by unleashing a soldering iron on your board.

With NVIDIA, accurate information is about as hard to get as water in the desert.  There is more misinformation flying around than during the Iraq war.  Supposedly the GT730M will have GK208.  Except when it has a GK107.  Then there is GPUDirect.  This marketing word when in the (Linux only) GPUDirect RDMA flavour (as opposed to the GPUDirect for Video variant) refers to the ability to make use of a little used PCI-Express features for peer to peer data transfers.  This is a fantastically important feature, allowing data to travel between GPUs, network cards and other PCIe devices without burdening the CPU and system memory.  At GTC2012 this was said to be a Kepler feature and all that was needed was CUDA 5.0 and the latest drivers.  You could then adapt your own device drivers to support this.  In our case that was for an FPGA card, so we could DMA directly to the GPU and thus reduce latency and load on CPU.  Then without warning this support was removed from all but the Tesla K20.  I think.

One piece of info I got from a source at NVIDIA is that they do not want the support headache of having massive numbers of (Geforce) people using advanced features like these.  So they restrict to their own made high end cards.  For GPUDirect RDMA I can see some logic in this, in that you are also reliant on correctly modifying the driver of another card and the motherboard architecture is also important.  Making use of a PCIe bridge is best, like a PLX device.  If you only have the PCIe switch of the CPU then it may not work at all.  So complaints to NVIDIA could actually be the fault of motherboard, motherboard drivers, chipset, 3rd party PCIe device, 3rd party made NVIDIA GPU etc – too many variables.  Unfortunately the Tesla K20 and Titan cards are too power hungry for many non-high performance computing applications and a low cost Geforce would work fine (think embedded systems).

Will GK208 make it to desktop GPUs?  And more so, will it have these K20 features enabled?  Depends who you read.  Here is a GT630 with the GK208?!?! The moral of the story is actually that NVIDIA needs to get clear on what cards support what.  We need clear information and guarantees.  Prancing around smugly announcing the latest super computer populated with their CUDA cards is not enough.  The rest of us mortals are generally left speculating and making fools of ourselves in front of our customers after trying to convince them to leap into GPGPU.  End users are right to be concerned about product road maps, product lifetimes, driver and feature supports.

Ethics and your Graphics Processing Unit (GPU)

With great power comes great responsibility goes the phrase.  In computing circles I would say that we’d be liable to think more along the lines of With great power comes great electricity bills or With great power comes great cooling problems.  But should we also be more often considering the original intention?  Should we as the engineers wielding the computer power be concerned with how this technology could be abused?  A quick trawl of the internet shows precious little concern for such issues – we are almost all completely entranced by the rush of technical possibilities coming at us.  If we give the matter any concern at all we tend to think in altruistic terms, of the great potential for a safer, more organized, more open, more equal, more efficient and faster world.

In 2011 there was a flood of news reports about how GPUs (Graphics Processing Units) could be used for more than pretty graphics and be used to target such tasks as decryption and password cracking.  This of course was a good way to raise publicity for GPGPU (General Purpose GPU) and create a new market for the likes of NVIDIA and AMD.  Programming tools such as CUDA and OpenCL made leveraging the massively parallel architectures of GPUs for non-graphics tasks much easier.  Decrypting secure data and cracking passwords were apparently well suited to such devices. [1][2][3]

From an engineering point of view it is exciting to understand the challenges of cracking modern encryption methods and to see the effect of password length on complexity.  Up to eight random characters is apparently quite straightforward and can be done within hours.  Going to ten characters can suddenly take decades.  Going beyond this can quickly take millennia.  The mathematical theory behind all this is quite fascinating. [4]

Another area often mentioned in the same breath as GPUs is computational finance.  This is the world of so-called quants.  These chaps are attracted from the fields of science into the world of financial engineering.  High performance computing is used to predict stock market movements, calculate pricing, quantify risk, etc.  The more horsepower we have the more chance we have of outsmarting the competition.  We learn of such things as high frequency trading where automated buying and selling are made so rapidly that sometimes the investment is held only for milliseconds. We learn that latency is critical and by locating a computer centre closer to the exchange and improving data throughput, we can get further advantage. [5]

If you attend a high performance computing conference you will be able to attend any number of talks given by academics or software engineers detailing the astonishing breakthroughs they have made in areas such as these.  But while listening to such information should we not also use our well trained and agile minds to question the greater ramifications of what we are actually doing?

Is the cracking of passwords and accessing confidential information always good?  Is being able to see finance as mathematical models devoid of a bricks and mortar, flesh and blood reality really of sound use?  If we touch on such questions at all then we will hear about system administrators who lose access to vital company information.  We hear about people losing access to personal, irreplaceable documents and photographs.  We hear about security agencies needing access the communications of criminals and terrorists.  In finance, we are told that we will get better liquidity, better market stability and that the competition will bring value.

But do we really believe all this?  Does this kind of computing power really give us a more stable, more efficient financial sector?  We can learn that high frequency trading skims off money from the transactions between investors and businesses, a form of unauthorized taxation as they pre-empt genuine trades.  We witness phenomena such as the flash crash of 2010 as algorithms compound on errors to create chaotic spikes. [6]  We can hide behind the maths of risk to the exclusion of real facts such as the unsustainable house of cards that was sub-prime.  While the banking sector has pretty much recovered, there are legions of ordinary people with reduced pensions, bankrupt businesses, lost savings, without work and facing austerity measures cutting benefits and services. [7]

The impact of being able to break passwords, decrypt secure communication and monitor all internet traffic is altogether more sinister and raises questions about the kind of world we wish to live in.  We naively assume that we have nothing to hide and that such technology is used for our collective safety.  We can intercept terrorist plots and illegal business activities for example.  It is now technically possible to monitor all internet traffic in a small to medium size country [8][9] and within a year or two it will be possible to affordably do so for any country.  The cost of such processing power would apparently come in at less than one modern fighter jet.  Scaling up from the systems already available this is quite believable. [10]

When discussing such issues with an engineer friend, he claimed he was not worried because likely it would not be possible to monitor the data in any useful way.  This underestimates the ingenuity and rate of progress of the computing world.  Google serves a significant proportion of the world’s internet users and already tracks a massive range of statistics of these people.  Gmail scans emails and based on content displays adverts.  Search history, links clicked and more are stored.  Google have already showed that the theory works fine.

As people, as engineers, we like to assume that the technology we create will be used for good.  We tout the so-called Twitter and Facebook revolutions as examples of how technology is opening up the world and allowing repressed peoples to overthrow corrupt despots[11], but we fail to see how the same technologies allow those same despots to monitor their own people.  We do not hear often how the Iranian government used Facebook to identify the protestors and their families. [12]  We do not also hear often about how easy it is to track the internet activities of our heroic freedom campaigners.  And what should happen if we should become disillusioned with our own governments?  Would we be allowed to democratically protest and oust them, or would the technological might and subsequent rule of law be used against us under a flimsy patriotic pretext?

And where will it all stop?  If our on-line and mobile communications can be monitored what about our off-line personal discussions with our fellow freedom fighters / terrorists (delete depending on view point)?  As Google Glass makes its entrance, backed with on-the-fly translation[13] and facial recognition [14] then we see that technically speaking we could also monitor everything we do, see, hear and say.  If we are not careful we will find that we are engaged in a technology enabled race to the bottom of morality, a desperate fight to protect a way of life that we’d already lost.

References

1.       https://securityledger.com/new-25-gpu-monster-devours-passwords-in-seconds/
2.       http://erratasec.blogspot.nl/2011/06/password-cracking-mining-and-gpus.html#.UZ90kLVmh8E
3.       http://www.cyint.in/products_decryptiontools.htm
4.       http://en.wikipedia.org/wiki/Password_cracking
5.       http://en.wikipedia.org/wiki/High_frequency_trading
6.       http://en.wikipedia.org/wiki/2010_Flash_Crash
7.       http://www.motherjones.com/mojo/2013/05/bank-record-profits-fdic-unemployment-housing
8.       http://surveillance.rsf.org/en/amesys/
9.       http://www.defenceweb.co.za/index.php?option=com_content&view=article&id=18932&catid=74&Itemid=30
10.     “Freedom and the Future of the Internet”, Julian Assange, 2012. http://emilkirkegaard.dk/en/?p=3429
11.     http://en.wikipedia.org/wiki/Twitter_Revolution
12.     “The Net Delusion”, Evgeny Morozov, 2012, http://www.publicaffairsbooks.com/morozovch1.pdf, p10
13.     http://www.huffingtonpost.com/2012/07/23/google-glass-inspired-specs-auto-translate_n_1695008.html
14.     http://www.guardian.co.uk/technology/2013/jun/03/google-glass-facial-recognition-ban