Tools to help parallelize H.264? - h.264

I am working with H.264 Decoder using Jm reference software. I am looking for some parallelization tools for parallelizing the reference code of H.264 decoder for multiprocessor mapping.Plz suggest as I am relatively new to this area.

There is no naive way to solve this -- much less a general "automated conversion" approach.
Only a detailed understanding of how H.264 works and careful application of correct parallelization techniques following a correctly parallized algorithm will yield useful results.
H.264, like most Video Formats, relies on temporal data frames and effectively only computes "a running delta", which makes this problem very complex. This is just one of the techniques used to achieve such good compression but the complexity of the format does not stop there: most of the data is related in some fashion! (The more dependenent the data is, the less ideal it is suited for parallel processing.)
I would suggest looking for a (non-reference Open Source) implementation that uses threads, if such an implementation exists. Perhaps look at the codec used by VLC? (In the end I suspect more benefit comes from offloading to special hardware-assist modules such as those bundled with modern ATI or NVidia GPUs.)
If you are really interested in pursuing this, see...
EFFICIENT PARALLELIZATION OF H.264 DECODING WITH MACRO BLOCK LEVEL SCHEDULING
Parallel Scalability of H.264
A Highly Scalable Parallel Implementation of H.264
...and the million other white papers out there (search for "parallel decode h.264").

Related

comparing h.264 encoding decoding performance

I am beginner of video codec. not an video codec expert
I just want to know base on the same criteria, Comparing H254 encoding/decoding which is more efficiency.
Thanks
Decoding is more efficient. To be useful, decoding must run in real time, where encoding does not (except in videophone / conferencing applications).
How much more efficient? An encoder can generate motion vectors. The more compute power used on generating those motion vectors, the more accurate they are. And, the more accurate they are, the more bandwidth is available for the difference frames, so the quality goes up.
So, the kind of encoding used to generate video for streaming or distribution on DVD or BD discs can run many times slower than real time on server farms. But decoding for that kind of program is useless unless it runs in real time.
Even in the case of real-time encoding it takes more power (actual milliwatts, compute cycles, etc) than decoding.
It's true of H.264, H.265, VP8, VP9, and other video codecs.

Normal Cuda Vs CuBLAS?

Just of curiosity. CuBLAS is a library for basic matrix computations. But these computations, in general, can also be written in normal Cuda code easily, without using CuBLAS. So what is the major difference between the CuBLAS library and your own Cuda program for the matrix computations?
We highly recommend developers use cuBLAS (or cuFFT, cuRAND, cuSPARSE, thrust, NPP) when suitable for many reasons:
We validate correctness across every supported hardware platform, including those which we know are coming up but which maybe haven't been released yet. For complex routines, it is entirely possible to have bugs which show up on one architecture (or even one chip) but not on others. This can even happen with changes to the compiler, the runtime, etc.
We test our libraries for performance regressions across the same wide range of platforms.
We can fix bugs in our code if you find them. Hard for us to do this with your code :)
We are always looking for which reusable and useful bits of functionality can be pulled into a library - this saves you a ton of development time, and makes your code easier to read by coding to a higher level API.
Honestly, at this point, I can probably count on one hand the number of developers out there who actually implement their own dense linear algebra routines rather than calling cuBLAS. It's a good exercise when you're learning CUDA, but for production code it's usually best to use a library.
(Disclosure: I run the CUDA Library team)
There's several reasons you'd chose to use a library instead of writing your own implementation. Three, off the top of my head:
You don't have to write it. Why do work when somebody else has done it for you?
It will be optimised. NVIDIA supported libraries such as cuBLAS are likely to be optimised for all current GPU generations, and later releases will be optimised for later generations. While most BLAS operations may seem fairly simple to implement, to get peak performance you have to optimise for hardware (this is not unique to GPUs). A simple implementation of SGEMM, for example, may be many times slower than an optimised version.
They tend to work. There's probably less chance you'll run up against a bug in a library then you'll create a bug in your own implementation which bites you when you change some parameter or other in the future.
The above isn't just relevent to cuBLAS: if you have a method that's in a well supported library you'll probably save a lot of time and gain a lot of performance using it relative to using your own implementation.

I read that Huffman coding does not work on GPU but this paper claims otherwise

I have read in several places that building a huffman encoder in GPU is not very efficient because the algorithm is sequential. But this paper offers a possible implementation and claims it to be faster than CPU http://tesla.rcub.bg.ac.rs/~taucet/docs/papers/PAVLE-AnaBalevic09.pdf .
Please advice if the results of the paper are incorrect
It looks like an interesting approach but I'll just offer one caveat: there is very little information about the baseline CPU implementation, but it is most likely single threaded and may not be particularly optimised. It's human nature for people to want to make their optimised implementation look as good as possible, so they tend to use a mediocre baseline benchmark in order to give an impressive speed up ratio. For all we know it may be that a suitably optimised multi-threaded implementation on the CPU could match the GPGPU performance, in which case the GPGPU implementation would not be so impressive. Before investing a lot of effort in a GPGPU implementation I would want to first exhaust all the optimisation possibilities on the CPU (perhaps even using the parallel algorithm as described in the paper, maybe exploit SIMD, threading, etc), since a CPU implementation that meets your performance requirements would be a lot more portable and useful than a solution tied to a particular GPU architecture.
You are right - Huffman algorithm is sequential, though it's not a bottleneck for high speed encoding. Please have a look at the following session on GTC 2012. This is real solution, not just an example.
You can find there some benchmarks for CPU and GPU concerning Huffman encoding and decoding. Huffman encoding on GPU is much faster than on CPU. JPEG decoding on GPU could be much slower in comparison with CPU only in the case when there are no restart markers in JPEG image.
If you need Huffman not for JPEG, then one should use two-pass algorithm. At first pass one can collect statistics and do encoding on the second pass. Both passes could be done in parallel, so it's better to use GPU instead of CPU.
There are a lot of papers saying that GPU is not suitable for Huffman. It just means that there were a lot of attempts to solve the problem. The idea for solution is quite simple: do Huffman encoding for small chunks of data and try to process these chunks in parallel.

Does FFMPEG utilize CUDA or any other hardware acceleration yet?

Simple question, but I am having trouble finding the answer.
We are deciding on a transcoding engine (preferrably open source) and it looks to me that FFMPEG does not utilize hardware acceleration, but I am not sure.
I believe ffmpeg uses libavcodec, the same library used in countless other products, such as Handbrake. I can't believe they don't support hardware acceleration, therefore, my question.
libavcodec has API which allows clients to implement hardware decoding. I don't think Handbrake supports it.
This does not use CUDA kernels or any other kind of SIMD language, all of which are useless for the task. It uses dedicated decoder hardware packaged with the GPU (or newer CPU). CUDA happens to provide an API to access this, which is what "CUDA support" means.
As far as I know ffmpeg does not utilize CUDA, if you are curious about something that does - CoreAVC Video Decoder has such an option in their H.264 decoder.
I use Loiloscope. It has featured CUDA accelerated transcodes since it's first release.
If you call avcodec_find_decoder() to get decoder, FFmpeg will not utilize hardware acceleration to decode. Instead, calling avcodec_find_decoder_by_name() with specific hardware decoder will gain the GPU utilization. For example:
AVCodec *avcodec_h264dec = avcodec_find_decoder_by_name("h264_cuvid");

CUDA - Implementing Device Hash Map?

Does anyone have any experience implementing a hash map on a CUDA Device? Specifically, I'm wondering how one might go about allocating memory on the Device and copying the result back to the Host, or whether there are any useful libraries that can facilitate this task.
It seems like I would need to know the maximum size of the hash map a priori in order to allocate Device memory. All my previous CUDA endeavors have used arrays and memcpys and therefore been fairly straightforward.
Any insight into this problem are appreciated. Thanks.
There is a GPU Hash Table implementation presented in "CUDA by example", from Jason Sanders and Edward Kandrot.
Fortunately, you can get information on this book and download the examples source code freely on this page:
http://developer.nvidia.com/object/cuda-by-example.html
In this implementation, the table is pre-allocated on CPU and safe multithreaded access is ensured by a lock function based upon the atomic function atomicCAS (Compare And Swap).
Moreover, newer hardware generation (from 2.0) combined with CUDA >= 4.0 are supposed to be able to use directly new/delete operators on the GPU ( http://developer.nvidia.com/object/cuda_4_0_RC_downloads.html?utm_source=http://forums.nvidia.com&utm_medium=http://forums.nvidia.com&utm_term=Developers&utm_content=Developers&utm_campaign=CUDA4 ), which could serve your implementation. I haven't tested these features yet.
cuCollections is a relatively new open-source library started by NVIDIA engineers aiming at implementing efficient containers on the GPU.
cuCollections (cuco) is an open-source, header-only library of GPU-accelerated, concurrent data structures.
Similar to how Thrust and CUB provide STL-like, GPU accelerated algorithms and primitives, cuCollections provides STL-like concurrent data structures. cuCollections is not a one-to-one, drop-in replacement for STL data structures like std::unordered_map. Instead, it provides functionally similar data structures tailored for efficient use with GPUs.
cuCollections is still under heavy development. Users should expect breaking changes and refactoring to be common.
At the moment it provides a fixed size hashtable cuco::static_map and one that can grow cuco::dynamic_map.
I recall someone developed a straightforward hash map implementation on top of thrust. There is some code for it here, although whether it works with current thrust releases is something I don't know. It might at least give you some ideas.
AFAIK, the hash table given in "Cuda by Example" does not perform too well.
Currently, I believe, the fastest hash table on CUDA is given in Dan Alcantara's PhD dissertation. Look at chapter 6.
BTW, warpcore is a framework for creating high-throughput, purpose-built hashing data structures on CUDA-accelerators. Hashing at the speed of light on modern CUDA-accelerators. You can find it here:
https://github.com/sleeepyjack/warpcore