I read that Huffman coding does not work on GPU but this paper claims otherwise - cuda

I have read in several places that building a huffman encoder in GPU is not very efficient because the algorithm is sequential. But this paper offers a possible implementation and claims it to be faster than CPU http://tesla.rcub.bg.ac.rs/~taucet/docs/papers/PAVLE-AnaBalevic09.pdf .
Please advice if the results of the paper are incorrect

It looks like an interesting approach but I'll just offer one caveat: there is very little information about the baseline CPU implementation, but it is most likely single threaded and may not be particularly optimised. It's human nature for people to want to make their optimised implementation look as good as possible, so they tend to use a mediocre baseline benchmark in order to give an impressive speed up ratio. For all we know it may be that a suitably optimised multi-threaded implementation on the CPU could match the GPGPU performance, in which case the GPGPU implementation would not be so impressive. Before investing a lot of effort in a GPGPU implementation I would want to first exhaust all the optimisation possibilities on the CPU (perhaps even using the parallel algorithm as described in the paper, maybe exploit SIMD, threading, etc), since a CPU implementation that meets your performance requirements would be a lot more portable and useful than a solution tied to a particular GPU architecture.

You are right - Huffman algorithm is sequential, though it's not a bottleneck for high speed encoding. Please have a look at the following session on GTC 2012. This is real solution, not just an example.
You can find there some benchmarks for CPU and GPU concerning Huffman encoding and decoding. Huffman encoding on GPU is much faster than on CPU. JPEG decoding on GPU could be much slower in comparison with CPU only in the case when there are no restart markers in JPEG image.
If you need Huffman not for JPEG, then one should use two-pass algorithm. At first pass one can collect statistics and do encoding on the second pass. Both passes could be done in parallel, so it's better to use GPU instead of CPU.
There are a lot of papers saying that GPU is not suitable for Huffman. It just means that there were a lot of attempts to solve the problem. The idea for solution is quite simple: do Huffman encoding for small chunks of data and try to process these chunks in parallel.

Related

When does it make sense to use a GPU?

I have code doing a lot of operations with objects which can be represented as arrays.
When does it make to sense to use GPGPU environments (like CUDA) in an application? Can I predict performance gains before writing real code?
The convenience depends on a number of factors. Elementwise independent operations on large arrays/matrices are a good candidate.
For your particular problem (machine learning/fuzzy logic), I would recommend reading some related documents, as
Large Scale Machine Learning using NVIDIA CUDA
and
Fuzzy Logic-Based Image Processing Using Graphics Processor Units
to have a feeling on the speedup achieved by other people.
As already mentioned, you should specify your problem. However, if large parts of your code involve operations on your objects that are independent in a sense that object n does not have to wait for the results of the operations objects 0 to n-1, GPUs may enhance performance.
You could go to CUDA Zone to get yourself a general idea about what CUDA can do and do better than CPU.
https://developer.nvidia.com/category/zone/cuda-zone
CUDA has already provided lots of performance libraries, tools and ecosystems to reduce the development difficulty. It could also help you understand what kind of operations CUDA are good at.
https://developer.nvidia.com/cuda-tools-ecosystem
Further more, CUDA provided benchmark report on some of the most common and representative operations. You could find if your code can benefit from that.
https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/CUDA_5.0_Math_Libraries_Performance.pdf

ArrayFire versus raw CUDA programming?

I am quite new to GPU programming, but since I have a computationally intensive task I have turned to the GPU for possible performance gains.
I tried rewriting my program with ArrayFire Free version. It is indeed faster than my CPU routine with multi-threading enabled, but not to the degree I expected (that is, < 100% speedup), and the returned results are not quite right (< 1% error compared to CPU routine, assuming the CPU routine's results are correct).
My task is mainly element-wise float-32 maths operations on large matrices (300MB-500MB size), with little if-thens/switch-cases etc. I guess the performance bottleneck is likely the bandwidth between CPU and GPU memory since there is a lot of data-reading, etc. The GPU I tested is a GeForce 580GTX with 3GB of video memory.
Is there still some significant room for optimization if I write raw CUDA code (with CUBLAS etc. and average optimization) instead of using ArrayFire for my task? I read some NVIDIA optimization guides; it seems that there is some memory-access tricks there for faster data-access and reducing bank-conflicts. Does ArrayFire use these general tricks automatically or not?
Thanks for the post. Glad to hear initial results were giving some speedup. I work on ArrayFire and can chime in here on your questions.
First and foremost, code is really required here for anyone to help with specificity. Can you share the code you wrote?
Second, you should think about CUDA and ArrayFire in the following way: CUDA is a way to program the GPU that provides you with the ability to write any GPU code you want. But there is a huge difference between naive CUDA code (often slower than the CPU) and expert, time-staking, hand-optimized CUDA code. ArrayFire (and some other GPU libraries like CUBLAS) have many man-years of optimizations poured into them, and are typically going to give better results than most normal people will have time to achieve on their own. However, there is also variability in how well someone uses ArrayFire (or other libraries). There are variables that can and should be tweaked in the usage of ArrayFire library calls to get the best performance. If you post your code, we can help share some of those here.
Third, ArrayFire uses CUBLAS in the functions that rely on BLAS, so you're not likely to see much difference using CUBLAS directly.
Fourth, yes, ArrayFire uses all the optimizations that are available in the NVIDIA CUDA Programming Guide for (e.g. faster data-transfer and reducing memory bank conflicts like you mention). That's where the bulk of ArrayFire development is focused, on optimizing those sorts of things.
Finally, the data discrepancies you noticed are likely due to that nature of CPU vs GPU computing. Since they are different devices, you will often see slightly different results. It's not that the CPU gives better results than the GPU, but rather that they are both working with finite amounts of precision in slightly different ways. If you're using single-precision instead of double, you might consider that. Posting code will let us help on that too.
Happy to expand my answer once code is posted.

GPGPU Applications other than Image processing?

I am in search for few cpu applications which can be ported to gpgpu for better efficiency.
Else where can gpgpu be used other than image processing area ?
This is actually for my graduate project.
The specialized processing architectures of GPU compute engines are useful for just about any data crunching problem where you have:
a non-trivial amount of data
a non-trivial computation to perform on every element of that data,
the input data needed to compute each output element fits in GPU memory, or can be choreographed to arrive in GPU memory when it is needed.
It helps if the computation can be performed independently on all data elements at the same time, but this is not strictly required.
Image processing happens to be one example of that scenario - a finite (but large) number of pixels to process, and many image algorithms can be executed on each pixel in parallel.
Other examples include: generalized signal analysis, such as processing audio signals. Image processing is just a specialized form of signal analysis. Pattern recognition, where much of the challenge is to separate the signal from the noise. Voice recognition, anyone? 3 dimensional surface matching, such as figuring out the shapes of organic compounds based on the flex angles of their chemical bonds, or figuring out if two organic compounds are likely to interact in interesting ways (eg, bioreceptors). Physical modelling of all kinds (collision simulations, seismic analysis, etc). And of course, cryptography, where you can always spend more compute time going over the same data again and again.
GPU compute engines are not well suited for problems where the volume of data significantly dwarfs the computations to be performed. GPUs work well on stuff in memory. Moving data into or out of GPU memory is often the most expensive step of an entire computation, so you want to make sure you have enough computation going on to "make up" for the cost of loading the data into memory. If the data is too big to fit into memory you have to adopt distributed computing tactics.
For example, calculating a primary key index of a petabyte database probably isn't a great fit for a GPU since most of the effort will probably be spent just getting the data off the hard disk into memory. The index computation itself is fairly trivial, which doesn't make for a very interesting GPU win, and while I'm sure the data could be carved up into chunks and the chunks indexed independently by a boatload of GPU cores, variability in the data will likely prevent the GPU from operating at its full capacity. (GPU code works best when all "oarsmen" (processor cores / threads) are pulling in the same direction - uniform execution on separate data) While database indexing might see some benefit by using a GPU approach, it certainly won't be as big of a performance improvement over CPU baseline as something better suited to the GPU execution model constraints - like signal processing.
Brute force crypto attacks? MD5 has been done by Whitepixel and SHA-256 has been done by all the Bitcoin miners. On the other hand, I am not aware of any GPU implementations of bcrypt() or scrypt(), but an academic working in the area is probably a better person to ask.
Easiest way to figure out what kinds of applications are well-suited for GPGPU is to look at the speedups other groups have achieved. Here are a couple of links with that information:
NVIDIA's case studies
AccelerEyes' case studies
Looks like military/aerospace, life science, energy, finance, manufacturing, media, and a few other smaller industries have examples of strong speedups.

Is GPGPU a hack?

I had started working on GPGPU some days ago and successfully implemented cholesky factorization with good performacne and I attended a conference on High Performance Computing where some people said that "GPGPU is a Hack".
I am still confused what does it mean and why they were saying it hack. One said that this is hack because you are converting your problem into a matrix and doing operations on it. But still I am confused that does people think it is a hack or if yes then why?
Can anyone help me, why they called it a hack while I found nothing wrong with it.
One possible reason for such opinion is that the GPU was not originally intended for general purpose computations. Also programming a GPU is less traditional and more hardcore and therefore more likely to be perceived as a hack.
The point that "you convert the problem into a matrix" is not reasonable at all. Whatever task you solve with writing code you choose reasonable data structures. In case of GPU matrices are likely the most reasonable datastructures and it's not a hack but just a natural choice to use them.
However I suppose that it's a matter of time for GPGPU becoming widespread. People just have to get used to the idea. After all who cares which unit of the computer runs the program?
On the GPU, having efficient memory access is paramount to achieving optimal performance. This often involves restructuring or even choosing entirely new algorithms and data structures. This is reason why GPU programming can be perceived as a hack.
Secondly, adapting an existing algorithm to run on the GPU is not in and of itself science. The relatively low scientific contribution of some GPU algorithm-related papers has led to a negative perception of GPU programming as strictly "engineering".
Obviously, only the person who said that can say for certain why he said it, but, here's my take:
A "Hack" is not a bad thing.
It forces people to learn new programming languages and concepts. For people who are just trying to model the weather or protein folding or drug reactions, this is an unwelcome annoyance. They didn't really want to learn FORTRAN (or whatever) in the first place, and now the have to learn another programming system.
The programming tools are NOT very mature yet.
The hardware isn't as reliable as CPUs (yet) so all of the calculations have to be done twice to make sure you've got the right answer. One reason for this is that GPUs don't come with error-correcting memory yet, so if you're trying to build a supercomputer with thousands of processors, the probability of a cosmic ray flipping a bit in you numbers approaches certainty.
As for the comment "you are converting your problem into a matrix and doing operations on it", I think that shows a lot of ignorance. Virtually ALL of high-performance computing fits that description!
One of the major problems in GPGPU for the past few years and probably for the next few is that programming them for arbitrary tasks was not very easy. Up until DX10 there was no integer support among GPUs and branching is still very poor. This is very much a situation where in order to get maximum benefit you have to write your code in a very awkward manner to extract all sorts of efficiency gains from the GPU. This is because you're running on hardware that is still dedicated to processing polygons and textures, rather than abstract parallel tasks.
Obviously, thats my take on it and YMMV
The GPGPU harks back to the days of the math co-processor. A hack is a shortcut to solving a long winded problem. GPGPU is a hack just like NAT on top of IPV4 is a hack. Computational problems just like networks are getting bigger as we try to do more, GPGPU is an useful interim solution, whether it stays outside the core CPU chip and has separate cranky API or gets sucked into the CPU via API or manufacture is up to the path finders.
I suppose he meant that using GPGPU forced you to restructure your implementation, so that it fitted the hardware, not the problem domain. Elegant implementation should fit the latter.
Note, that the word "hack" may have several different meanings:
http://www.urbandictionary.com/define.php?term=hack

Compression library using Nvidia's CUDA [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Does anyone know a project which implements standard compression methods (like Zip, GZip, BZip2, LZMA,...) using NVIDIA's CUDA library?
I was wondering if algorithms which can make use of a lot of parallel tasks (like compression) wouldn't run much faster on a graphics card than with a dual or quadcore CPU.
What do you think about the pros and cons of such an approach?
We have finished first phase of research to increase performance of lossless data compression algorithms.
Bzip2 was chosen for the prototype, our team optimized only one operation - Burrows–Wheeler transformation, and we got some results: 2x-4x speed up on good compressible files. The code works faster on all our tests.
We are going to complete bzip2, support deflate and LZMA for some real life tasks like: HTTP traffic and backups compression.
blog link:
http://www.wave-access.com/public_en/blog/2011/april/22/breakthrough-in-cuda-data-compression.aspx
Not aware of anyone having done that and made it public. Just IMHO, it doesn't sound very promising.
As Martinus points out, some compression algorithms are highly serial. Block compression algorithms like LZW can be parallelized by coding each block independently. Ziping a large tree of files can be parallelized at the file level.
However, none of these is really SIMD-style parallelism (Single Instruction Multiple Data), and they're not massively parallel.
GPUs are basically vector processors, where you can be doing hundreds or thousands of ADD instructions all in lock step, and executing programs where there are very few data-dependent branches.
Compression algorithms in general sound more like an SPMD (Single Program Multiple Data) or MIMD (Multiple Instruction Multiple Data) programming model, which is better suited to multicore cpus.
Video compression algorithms can be accellerated by GPGPU processing like CUDA only to the extent that there is a very large number of pixel blocks that are being cosine-transformed or convolved (for motion detection) in parallel, and the IDCT or convolution subroutines can be expressed with branchless code.
GPUs also like algorithms that have high numeric intensity (the ratio of math operations to memory accesses.) Algorithms with low numeric intensity (like adding two vectors) can be massively parallel and SIMD, but still run slower on the gpu than the cpu because they're memory bound.
Typically compression algorithms cannot make use of parallel tasks, it is not easy to make the algorithms highly parallelizeable. In your examples, TAR is not a compression algorithm, and the only algorithm that might be highly parallelizeable is BZIP because it is a block compression algorithm. Each block can be compressed separately, but this would require lots and lots of memory. LZMA does not work in parallel either, when you see 7zip using multiple threads this is because 7zip splits the data stream into 2 different streams that each are compressed with LZMA in a separate thread, so the compression algorithm itself is not paralllel. This splitting only works when the data permits this.
Encryption algorithms have been quite successful in this area, so you might want to look into that. Here is a paper related to CUDA and AES encryption:http://www.manavski.com/downloads/PID505889.pdf
We're making an attempt at porting bzip2 to CUDA. :) So far (and with only rough tests done), our Burrows-Wheeler Transform is 30% faster than the serial algorithm. http://bzip2.github.com
30% is nice, but for applications like backups it's not enough by a long shot.
My experience is that the average data stream in such instances gets 1.2-1.7:1 compression using gzip and ends up limited to an output rate of 30-60Mb/s (this is across a wide range of modern (circa 2010-2012) medium-high-end CPUs.
The limitation here is usually the speed at which data can be fed into the CPU itself.
Unfortunately, in order to keep a LTO5 tape drive happy, it needs a raw (uncompressable) data rate of around 160Mb/s. If fed compressable data it requires even faster data rates.
LTO compression is clearly a lot faster, but somewhat inefficient (equivalent to gzip -1 - it's good enough for most purposes). LTO4 drives and upwards usually have built in AES-256 encryption engines which can also maintain these kinds of speeds.
What this means for my case is that I'd need a 400% or better impreovement in order to consider it worthwhile.
Similar considerations apply across LANs. At 30Mb/s, compression is a hinderance on Gb-class networks and the question is whether to spend more on networking or on compression... :)