How TVM is different from MLIR? - deep-learning

As per my understanding, both TVM and MLIR are used as compiler infrastructure for deep learning neural networks. Is my understanding correct?.
And Which would be better if we are building a compiler for custom hardware that runs deep learning inferences?

I found this discussion helpful:
https://discuss.tvm.apache.org/t/google-lasted-work-mlir-primer/1721

Related

Overview for Deep Learning Networks

I am fairly new to Deep Learning and get quite overwhelmed by the many different Nets and their field of application. Thus, I want to know if there is some kind of overview which kind of different networks exist, what there key-features are and what kind of purpose they have.
For example I know abut LeNet, ConvNet, AlexNet - and somehow they are the same but still differ?
There are basically two types of neural networks, supervised and unsupervised learning. Both need a training set to "learn". Imagine training set as a massive book where you can learn specific information. In supervised learning, the book is supplied with answer key but without the solution manual, in contrast, unsupervised learning comes without answer key or solution manual. But the goal is the same, which is that to find patterns between the questions and answers (supervised learning) and questions (unsupervised learning).
Now we have differentiate between those two, we can go into the models. Let's discuss about supervised learning, which basically has 3 main models:
artificial neural network (ANN)
convolutional neural network (CNN)
recurrent neural network (RNN)
ANN is the simplest of all three. I believe that you have understand it, so we can move forward to CNN.
Basically in CNN all you have to do is to convolve our input with feature detectors. Feature detectors are matrices which have the dimension of (row,column,depth(number of feature detectors). The goal of convolving our input is to extract informations related to spatial data. Let's say you want to distinguish between cats and dogs. Cats have whiskers but dogs does not. Cats also have different eyes than dogs and so on. But the downside is, the more convolution layers will result in slower computation time. To mitigate that, we do some kind of processing called pooling or downsampling. Basically, this reduce the size of feature detectors while minimizing lost features or information. Then the next step would be flattening or squashing all those 3d matrix into (n,1) dimension so you can input it into ANN. Then the next step is self explanatory, which is normal ANN. Because CNN is inherently able to detect certain features, it mostly(maybe always) used for classification, for example image classification, time series classification, or maybe even video classification. For a crash course in CNN, check out this video by Siraj Raval. He's my favourite youtuber of all time!
Arguably the most sophisticated of all three, RNN is bestly described as neural networks that have "memory" by introducing "loops" within them which allow information to persist. Why is this important? As you are reading this, your brain use previous memory to comprehend all of this information. You don't seem to rethink everything from scratch again and this is what traditional neural networks do, which is to forget everything and re-learn again. But native RNN aren't effective so when people talk about RNN they mostly refer to LSTM which stands for Long Short-Term Memory. If that seems confusing to you, Cristopher Olah will give you in depth explanation in a very simple way. I advice you to check out his link for complete understanding about how RNN, especially LSTM variant
As for unsupervised learning, I'm so sorry that I haven't got the time to learn them, so this is the best I can do. Good luck and have fun!
They are the same type of Networks. Convolutional Neural Networks. The problem with the overview is that as soon as you post something it is already outdated. Most of the networks you describe are already old, even though they are only a few years old.
Nevertheless you can take a look at the networks supplied by caffe (https://github.com/BVLC/caffe/tree/master/models).
In my personal view the most important concepts in deep Learning are recurrent networks (https://keras.io/layers/recurrent/), residual connections, inception blocks (see https://arxiv.org/abs/1602.07261). The rest are largely theoretical concepts, which would not fit in a stack overflow answer.

A simple Convolutional neural network code

I am interested in convolutional neural networks (CNNs) as a example of computationally extensive application that is suitable for acceleration using reconfigurable hardware (i.e. lets say FPGA)
In order to do that I need to examine a simple CNN code that I can use to understand how they are implemented, how are the computations in each layer taking place, how the output of each layer is being fed to the input of the next one. I am familiar with the theoretical part (http://cs231n.github.io/convolutional-networks/)
But, I am not interested in training the CNN, I want a complete, self contained CNN code that is pre-trained and all the weights and biases values are known.
I know that there are plenty of CNN libraries, i.e. Caffe, but the problem is that there is no trivial example code that is self contained. even for the simplest Caffe example "cpp_classification" many libraries are invoked, the architecture of the CNN is expressed as .prototxt file, other types of inputs such as .caffemodel and .binaryproto are involved. openCV2 libraries is invoked too. there are layers and layers of abstraction and different libraries working together to produce the classification outcome.
I know that those abstractions are needed to generate a "useable" CNN implementation, but for a hardware person who needs a bare-bone code to study, this is too much of "un-related work".
My question is: Can anyone guide me into a simple and self-contained CNN implementation that I can start with?
I can recommend tiny-cnn. It is simple, lightweight (e.g. header-only) and CPU only, while providing several layers frequently used within the literature (as for example pooling layers, dropout layers or local response normalization layer). This means, that you can easily explore an efficient implementation of these layers in C++ without requiring knowledge of CUDA and digging through the I/O and framework code as required by framework such as Caffe. The implementation lacks some comments, but the code is still easy to read and understand.
The provided MNIST example is quite easy to use (tried it myself some time ago) and trains efficiently. After training and testing, the weights are written to file. Then you have a simple pre-trained model from which you can start, see the provided examples/mnist/test.cpp and examples/mnist/train.cpp. It can easily be loaded for testing (or recognizing digits) such that you can debug the code while executing a learned model.
If you want to inspect a more complicated network, have a look at the Cifar-10 Example.
This is the simplest implementation I have seen: DNN McCaffrey
Also, the source code for this by Karpathy looks pretty straightforward.

best way of using cuda

There are ways of using cuda:
auto-paralleing tools such as PGI workstation;
wrapper such as Thrust(in STL style)
NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.

Financial applications on GPGPU

I want to know what sort of financial applications can be implemented using a GPGPU. I'm aware of Option pricing/ Stock price estimation using Monte Carlo simulation on GPGPU using CUDA. Can someone enumerate the various possibilities of utilizing GPGPU for any application in Finance domain,
There are many financial applications that can be run on the GPU in various fields, including pricing and risk. There are some links from NVIDIA's Computational Finance page.
It's true that Monte Carlo is the most obvious starting point for many people. Monte Carlo is a very broad class of applications many of which are amenable to the GPU. Also many lattice based problems can be run on the GPU. Explicit finite difference methods run well and are simple to implement, many examples on NVIDIA's site as well as in the SDK, it's also used in Oil & Gas codes a lot so plenty of material. Implicit finite difference methods can also work well depending on the exact nature of the problem, Mike Giles has a 3D ADI solver on his site which also has other useful finance stuff.
GPUs are also good for linear algebra type problems, especially where you can leave the data on the GPU to do reasonable work. NVIDIA provide cuBLAS with the CUDA Toolkit and you can get cuLAPACK too.
Basically, anything that requires a lot of parallel mathematics to run. As you originally stated, Monte Carlo simultation of options that cannot be priced with closed-form solutions are excellent candidates. Anything that involves large matrixes and operations upon them will be ideal; after all, 3D graphics use alot of matrix mathematics.
Given that many trader desktops sometimes have 'workstation' class GPUs in order to drive several monitors, possibly with video feeds, limited 3D graphics (volatility surfaces, etc) it would make sense to run some of the pricing analytics on the GPU, rather than pushing the responsibility onto a compute grid; in my experience the compute grids are frequently struggling under the weight of EVERYONE in the bank trying to use them, and some of the grid computing products leave alot to be desired.
Outside of this particular problem, there's not a great deal more that can be easily achieved with GPUs, because the instruction set and pipelines are more limited in their functional scope compared to a regular CISC CPU.
The problem with adoption has been one of standardisation; NVidia had CUDA, ATI had Stream. Most banks have enough vendor lock-in to deal with without hooking their derivative analytics (which many regard as extremely sensitive IP) into a gfx card vendor's acceleration technology. I suppose with the availability of OpenCL as an open standard this may change.
F# is used a lot in finance, so you might check out these links
http://blogs.msdn.com/satnam_singh/archive/2009/12/15/gpgpu-and-x64-multicore-programming-with-accelerator-from-f.aspx
http://tomasp.net/blog/accelerator-intro.aspx
High-end GPUs are starting to offer ECC memory (a serious consideration for financial and, eh, military applications) and high-precision types.
But it really is all about Monte Carlo at the moment.
You can go to workshops on it, and from their descriptions see that it'll focus on Monte Carlo.
A good start would be probably to check NVIDIA's website:
CUDA's Finance Showcases
CUDA's Finance Tutorials
Using a GPU introduces limitations to architecture, deployment and maintenance of your app.
Think twice before you invest efforts in such solution.
E.g. if you're running in virtual environment, it would require all physical machines to have GPU hardware installed and a special vGPU hardware and software support + licenses.
What if you decide to host your service in the cloud (e.g. Azure, Amazon)?
In many cases it is worth building your architecture in advance to support scale out and be flexible and scalable (with some overhead of course) rather than scale up and squeeze as much as you can from your hardware.
Answering the complement of your question: anything that involves accounting can't be done on GPGPU (or binary floating point, for that matter)

OpenCL examples with benchmarks

I'm looking for some introductory examples to OpenCL which illustrate the types of applications that can experience large (e.g., 50x-1000x) increases in speed. Cuda has lots of nice examples, but I haven't found the same thing for OpenCL.
A nice example might be global optimization of complex functions via particle swarms, simulated annealing, evolutionary algorithms, ant colony optimization, etc.
The algorithms you are describing are neither simple nor introductory from the perspective of GPU programming. The reason CUDA has examples in these areas is that it has been around long enough for people to have developed these examples. There is currently no publicly available version of OpenCL that runs on GPUs. Both ATI and NVIDIA are offering beta versions of their OpenCL drivers, but ATI's supports only CPU computation and NVIDIA's requires signing an NDA to get. Simply put, OpenCL has not been around long enough for comprehensive examples like these to have been developed and demonstrated.
That said, gaining access to NVIDIA's OpenCL drivers is not difficult. You can find out how to do so on their forums here. I assume that the OpenCL distribution contains some sample programs to help you get started.
This also means that it's an excellent opportunity for you to develop some of these benchmarks and post your results. Then people will refer to your work rather than you referring to their work. I wouldn't expect too many surprises though. OpenCL performance should be roughly on par with CUDA performance once it becomes widely available and supported.
The are some great examples in the SDK from nvidia:
http://developer.nvidia.com/object/get-opencl.html
Our team has been working on OpenCL algorithms and acceleration and we would like to suggest the article
http://www.cmsoft.com.br/index.php?view=article&catid=1:latest-news&id=247:opencl-simulated-annealing
as a sample implementation of Simulated Annealing algorithm for minimization.
You could try the following two books:
Programming Massively Parallel Processors ... A Hands-on Approach (NVIDIA)(chapter 1 and 2)
The OpenCL Programming Book ... Parallel Programming for MultiCore CPU and GPU (History components
Both go in detail as to explain why the development was made and where the true bonusses can be found.
Not sure about benchmarking though , haven't had any luck there myself either.