Speed and Memory problems when Deploying Deep Learning Model [closed] - deep-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I need to build a deep learning model for image classifying. I need to train a deep learning model on those data and then deploying it on real machines.
In conclusion, My main problems are:
Images are very big, which leads CUDA to memory issues. what shall I do to prevent my model running out of memory limit?
Besides, I need a very fast inference, because the model will be used on real deploy environment. The speed is very important for timely response.
I need to solve both the 2 problems to deploy my model.

I think it is important to reduce the size of the images. reshape them if necessary, which can significantly reduce the memory cost.

I think you can try different batch size. Becasue batch size is directly related to training and inference speed of deep learning. But I think better GPU machine card is more important for image classifying with deep learning network.

I think you need better GPU card as deep learning is machine hungry.

Related

Yolo implementation in ordinary Python script [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to implement YOLO from scratch. I have seen codes available in github but I want to try from scratch. Is it possible to implement YOLO in ordinary Python script without using dark flow? I am planning to implement it in keras.
All Kinds of neural networks can be implemented on python from scratch. If you really want to do so you can. You can use numpy library and scipy libraries for the easy calculations with vectors and matrices.
What you are going to do is a time-consuming task. Will not be easy. But if you try hard, you could do it. And don't forget to share the code with us.
First, you will need to get a basic understanding of the YOLO network. I would suggest reading the research papers. Original YOLO paper and the second paper discuss many details about the network and how it works. It will give you a better understanding of the network and how it is working and will be helpful when debugging your own network.
The third paper is easier than the other two. It will only explain the modifications that they have done. So, in order to get a full understanding of the network, you still have to read all three research papers.
Original Yolo Paper
Yolo9000 (Yolo version 2)
Yolov3
After you have downloaded the YOLO, you will find a file called yolo.cfg. You can open that file in a notepad.
At the top of the file, they have defined some hyperparameters. You can know the meaning of those parameters by reading the papers.
After that, they have described their YOLO network as caffe people do in their prototxt files. It is not exactly as same as the prototxt file, but you can get the idea. It would be very helpful when building your own network.
They have written the YOLO network in such a way that the network changes a lot when it changes the mode from training to testing. You can find all that information in their research papers. Keep that in your mind too.
Happy Coding !!!

Differentiating b/w memory bound and compute bound CUDA kernels [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am trying to write a static analyzer for differentiating between data intensive and computation intensive CUDA kernels. As much as I have researched on this topic, there is not much literature present on it. One of the ways to accomplish this is to calculate the CGMA ratio of the kernel. If it is 'too high', then it might be compute intensive, otherwise, memory intensive.
The problem with the above method is that I can't seem to decide upon a threshold value for the ratio. That is, above what value should it be classified as compute intensive. One way is to use the ratio of CUDA cores and load/store units as threshold. What does SO think?
I came across this paper in which they are calculating a parameter called 'memory intensity'. First, they calculate a parameter called the activity factor, which is then used to calculate memory intensity. Please find the paper here. You can find memory intensity on page no: 6.
Does there exist any better approach? I am kind-of stuck in my research due to this, and desperately need help.

GPGPU performance in high-level languages [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
For my science fair project I have to write a computationally-intensive algorithm that is well suited to parallelization. I have read about OpenCL and CUDA and it seems they are mainly used from C/C++. While it would not be that difficult for me to pick up a bit of C to write a simple main, I was wondering how big the performance hit would be if I used Java or Python bindings for my GPU computation? Specifically, I was more interested in the performance hit using CUDA because that's the framework I'm planning on using.
In general, every time you add an abstraction layer you're loosing performance but, in the case of CUDA this is not completely true because, whether using Python or Java you'll end up writing your CUDA kernels on C/Fortran, so the performance in the GPU side will be the same as using C/Fortran (check some pyCUDA examples here)
The bad news it that Java and Python will never achieve the performance of compiled languages such as C on certain tasks, see this SO answer for a more detailed discussion about this topic. Here is a good discussion about C versus Java, also on SO.
There are many questions and discussions about performance comparison between interpreted and compiled languages, so I encourage you to read some of them.

What is the absolutely fastest way to output a signal to external hardware in modern PC? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I was wondering, what is the absolutely fastest way (lowest latency) to produce external signal (for example CMOS state change from 0 to 1 on electrical wire connected to other device etc.) from PC, counting from the moment, where CPU assembler program knows that signal must be produced.
I know that network device, usb, VGA monitor output have some large latency comapred to other interfaces (SATA, PCI-E). Wich of interfaces or what hardware modification can provide a near-0 latency in output from let's suppose assembler program?
I don't know if it is really the fastest interface you can provide, because that also depends on your definition of "external", but http://en.wikipedia.org/wiki/InfiniBand certainly comes close to what your question aims at. Latency is 200 nanoseconds and below in certain scenarios ...

How do you estimate a EAI project using Function point? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
How do you estimate a EAI project using Function point?
FP analysis is inappropriate for integration projects of any sort as it presupposes that you can specify the application up-front. Most of the work in any integration project of non-trivial complexity is reverse-engineering the nuances of the environment. Typically the environment will not be exhaustively documented in the sort of cases you would expect to use an EAI system in.
By the time you have actually done this level of reverse engineering to the point of having a complete specification you have done most of the work in the project - the actual development is fairly short and sweet by comparison. Therefore the function point analysis is only providing an estimate for a small part of the system.
As an aside, much of the work I do is data warehouse systems in Commercial insurance companies, where extensive prototyping and reconciliation exercises to produce detailed specification documents are actually quite appropriate to the environment. Typically this takes longer than actually developing the production system as most of the data issues are resolved in the prototyping work. EAI systems have a similar class of implementation issues.
Well given that FP counting is based on storage and end user interface, not sure if its even meaningful for EAI (from what little I remember).
I would say you can't, at least not in a useful way. FP counting is generally viewed as a dubious practice of varying accuracy, doing it to an integration project would just add more fuzzyness.