Does FFMPEG utilize CUDA or any other hardware acceleration yet? - cuda

Simple question, but I am having trouble finding the answer.
We are deciding on a transcoding engine (preferrably open source) and it looks to me that FFMPEG does not utilize hardware acceleration, but I am not sure.
I believe ffmpeg uses libavcodec, the same library used in countless other products, such as Handbrake. I can't believe they don't support hardware acceleration, therefore, my question.

libavcodec has API which allows clients to implement hardware decoding. I don't think Handbrake supports it.
This does not use CUDA kernels or any other kind of SIMD language, all of which are useless for the task. It uses dedicated decoder hardware packaged with the GPU (or newer CPU). CUDA happens to provide an API to access this, which is what "CUDA support" means.

As far as I know ffmpeg does not utilize CUDA, if you are curious about something that does - CoreAVC Video Decoder has such an option in their H.264 decoder.

I use Loiloscope. It has featured CUDA accelerated transcodes since it's first release.

If you call avcodec_find_decoder() to get decoder, FFmpeg will not utilize hardware acceleration to decode. Instead, calling avcodec_find_decoder_by_name() with specific hardware decoder will gain the GPU utilization. For example:
AVCodec *avcodec_h264dec = avcodec_find_decoder_by_name("h264_cuvid");

Related

Fallback support nvidia libraries

I'm planning to use GPU to do an application with intensive matrix manipulation. I want to use the CUDA NVIDIA support. My only doubt is: is there any fallback support? I mean: if I use these libraries I've got the possibility to run the application in non-CUDA environment (without gpu support, of course)? I'd like to have the possibility to debug the application without the constraint to use that environment. I didn't find this information, any tips?
There is no fallback support built into the libraries (e.g. CUBLAS, CUSPARSE, CUFFT). You would need to have your code develop a check for an existing CUDA environment, and if it finds none, then develop your own code path, perhaps using alternate libraries. For example, CUBLAS functions can be mostly duplicated by other BLAS libraries (e.g. MKL). CUFFT functions can be largely replaced by other FFT libraries (e.g. FFTW).
How to detect a CUDA environment is covered in other SO questions. In a nutshell, if your application bundles (e.g. static-links) the CUDART library, then you can run a procedure similar to that in the deviceQuery sample code, to determine what GPUs (if any) are available.

Aren't NPP functions completely optimized?

I developed a naive function for mirroring an image horizontally or vertically using CUDA C++.
Then I came to know that NVIDIA Performance Primitives Library also offers a function for image mirroring .
Just for the sake of comparison, I timed my function against NPP. Surprisingly, my function outperformed (although by a small margin, but still...).
I confirmed the results several times by using Windows timer, as well as CUDA Timer.
My question is that: Aren't NPP functions completely optimized for NVIDIA GPUs?
I'm using CUDA 5.0, GeForce GTX460M (Compute 2.1), and Windows 8 for development.
I risk getting no votes by posting this answer. :)
NVIDIA continuously works to improve all of our CUDA libraries. NPP is a particularly large library, with 4000+ functions to maintain. We have a realistic goal of providing libraries with a useful speedup over a CPU equivalent, that are are tested on all of our GPUs and supported OSes, and that are actively improved and maintained. The function in question (Mirror), is a known performance issue that we will improve in a future release. If you need a particular function optimized, your best way to get it prioritized is to file an RFE bug (Request for Enhancement) using the bug submission form that is available to NVIDIA CUDA registered developers.
As an aside, I don't think any library can ever be "fully optimized". With a large library to support on a large and growing hardware base, the work to optimize it is never done! :)
We encourage folks to continue to try and outdo NVIDIA libraries, because overall it advances the state of the art and benefits the computing ecosystem.

Tools to help parallelize H.264?

I am working with H.264 Decoder using Jm reference software. I am looking for some parallelization tools for parallelizing the reference code of H.264 decoder for multiprocessor mapping.Plz suggest as I am relatively new to this area.
There is no naive way to solve this -- much less a general "automated conversion" approach.
Only a detailed understanding of how H.264 works and careful application of correct parallelization techniques following a correctly parallized algorithm will yield useful results.
H.264, like most Video Formats, relies on temporal data frames and effectively only computes "a running delta", which makes this problem very complex. This is just one of the techniques used to achieve such good compression but the complexity of the format does not stop there: most of the data is related in some fashion! (The more dependenent the data is, the less ideal it is suited for parallel processing.)
I would suggest looking for a (non-reference Open Source) implementation that uses threads, if such an implementation exists. Perhaps look at the codec used by VLC? (In the end I suspect more benefit comes from offloading to special hardware-assist modules such as those bundled with modern ATI or NVidia GPUs.)
If you are really interested in pursuing this, see...
EFFICIENT PARALLELIZATION OF H.264 DECODING WITH MACRO BLOCK LEVEL SCHEDULING
Parallel Scalability of H.264
A Highly Scalable Parallel Implementation of H.264
...and the million other white papers out there (search for "parallel decode h.264").

Using High Level Shader Language for computational algorithms

So, I heard that some people have figured out ways to run programs on the GPU using High Level Shader Language and I would like to start writing my own programs that run on the GPU rather than my CPU, but I have been unable to find anything on the subject.
Does anyone have any experience with writing programs for the GPU or know of any documentation on the subject?
Thanks.
For computation, CUDA and OpenCL are more suitable than shader languages. For CUDA, I highly recommend the book CUDA by Example. The book is aimed at absolute beginners to this area of programming.
The best way I think to start is to
Have a CUDA Card from Nvidia
Download Driver + Toolkit + SDK
Build the examples
Read the Cuda Programming Guide
Start to recreate the cudaDeviceInfo example
Try to allocate memory in the gpu
Try to create a little kernel
From there you should be able to gain enough momentum to learn the rest.
Once you learn CUDA then OpenCL and other are a breeze.
I am suggesting CUDA because is the one most widely supported and tested.

CUDA - Implementing Device Hash Map?

Does anyone have any experience implementing a hash map on a CUDA Device? Specifically, I'm wondering how one might go about allocating memory on the Device and copying the result back to the Host, or whether there are any useful libraries that can facilitate this task.
It seems like I would need to know the maximum size of the hash map a priori in order to allocate Device memory. All my previous CUDA endeavors have used arrays and memcpys and therefore been fairly straightforward.
Any insight into this problem are appreciated. Thanks.
There is a GPU Hash Table implementation presented in "CUDA by example", from Jason Sanders and Edward Kandrot.
Fortunately, you can get information on this book and download the examples source code freely on this page:
http://developer.nvidia.com/object/cuda-by-example.html
In this implementation, the table is pre-allocated on CPU and safe multithreaded access is ensured by a lock function based upon the atomic function atomicCAS (Compare And Swap).
Moreover, newer hardware generation (from 2.0) combined with CUDA >= 4.0 are supposed to be able to use directly new/delete operators on the GPU ( http://developer.nvidia.com/object/cuda_4_0_RC_downloads.html?utm_source=http://forums.nvidia.com&utm_medium=http://forums.nvidia.com&utm_term=Developers&utm_content=Developers&utm_campaign=CUDA4 ), which could serve your implementation. I haven't tested these features yet.
cuCollections is a relatively new open-source library started by NVIDIA engineers aiming at implementing efficient containers on the GPU.
cuCollections (cuco) is an open-source, header-only library of GPU-accelerated, concurrent data structures.
Similar to how Thrust and CUB provide STL-like, GPU accelerated algorithms and primitives, cuCollections provides STL-like concurrent data structures. cuCollections is not a one-to-one, drop-in replacement for STL data structures like std::unordered_map. Instead, it provides functionally similar data structures tailored for efficient use with GPUs.
cuCollections is still under heavy development. Users should expect breaking changes and refactoring to be common.
At the moment it provides a fixed size hashtable cuco::static_map and one that can grow cuco::dynamic_map.
I recall someone developed a straightforward hash map implementation on top of thrust. There is some code for it here, although whether it works with current thrust releases is something I don't know. It might at least give you some ideas.
AFAIK, the hash table given in "Cuda by Example" does not perform too well.
Currently, I believe, the fastest hash table on CUDA is given in Dan Alcantara's PhD dissertation. Look at chapter 6.
BTW, warpcore is a framework for creating high-throughput, purpose-built hashing data structures on CUDA-accelerators. Hashing at the speed of light on modern CUDA-accelerators. You can find it here:
https://github.com/sleeepyjack/warpcore