Is there something like Hadoop, but based on GPU? - cuda

Is there something like Hadoop, but based on GPU? I would like to do some research on distributed computing. Thank you for your help!
Yik,

Mars: A framework for GPU MapReduce comes to mind. Is that what you're thinking? There are other examples using GPUs by a traditional Hadoop system, and running Hadoop on a GPU entirely.

Related

Does JETSON NANO support RAPIDS?

Is it possible to run data science tools like RAPIDS on a JETSON NANO? After some searching, I am still not very clear... also, if it does, will data analysis run faster on it than on a CPU? Any insights will be appreciated. Thanks.
No, sadly, the Jetson Nano doesn't support RAPIDS. The Nano's GPU uses an older architecture GPU, Maxwell, which RAPIDS does not support (Pascal or better). The TX2, Jetson NX, and the Xaviers, which have a compatible GPU architecture, are the only ones that community members have seen success with. I know that there or some community members attempting a port of one or two of the RAPIDS libraries to the Nano, but that is a personal effort by them and YMMV.

classification model with low inference time

Could anyone suggest a classification model with inference time may be less than a second?
I have trained mobile net and squeeze net but both take around 2 secs for a single image inference.
Any suggestions would be really helpful.Thanks in advance.
The speed depends on the hardware specifications. To run this model even faster, you can try installing 'OpenVino' (works only for intel hardware), which helps to run the model 3x faster. Refer the below comparison from Intel.
Performance improvement using Openvino software
You have free version of 'Openvino' for CPU. This software works even for GPU inbuilt devices.
Check more details in the below link:
https://software.intel.com/en-us/openvino-toolkit

Resume Training Caffe using Different GPU?

Please forgive what may be an insane question.
My setup: two machines, one with a GTX1080, one with an RX Vega 64. Retraining/fine tuning a bvlc_googlenet model on the GTX1080.
If I build Caffe for the Vega 64, then can I take a snapshot from the GTX1080 machine and restart training on the Vega 64? Would this work in the sense that the training would continue in a normal manner?
What if I moved the GTX1080 snapshot to a Volta V100 in AWS? Would this work?
I know Caffe will to some degree abstract the hardware, but I don't know how well it can do that. I need the GTX 1080 for something else...
Thanks in advance!
To my knowledge, this should work without a problem. Weight files and training snapshots are just blobs of numbers, that you should be able to resume on other hardware (e.g. CPU/GPU), a different machine with a different operating system, or between 32 and 64 bit processes.

Using High Level Shader Language for computational algorithms

So, I heard that some people have figured out ways to run programs on the GPU using High Level Shader Language and I would like to start writing my own programs that run on the GPU rather than my CPU, but I have been unable to find anything on the subject.
Does anyone have any experience with writing programs for the GPU or know of any documentation on the subject?
Thanks.
For computation, CUDA and OpenCL are more suitable than shader languages. For CUDA, I highly recommend the book CUDA by Example. The book is aimed at absolute beginners to this area of programming.
The best way I think to start is to
Have a CUDA Card from Nvidia
Download Driver + Toolkit + SDK
Build the examples
Read the Cuda Programming Guide
Start to recreate the cudaDeviceInfo example
Try to allocate memory in the gpu
Try to create a little kernel
From there you should be able to gain enough momentum to learn the rest.
Once you learn CUDA then OpenCL and other are a breeze.
I am suggesting CUDA because is the one most widely supported and tested.

best way of using cuda

There are ways of using cuda:
auto-paralleing tools such as PGI workstation;
wrapper such as Thrust(in STL style)
NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.