Can we use a different CPU architecture(and backend) for training(calibration) and inference of the quantized pytorch model?
The only post on this subject that I've found states:
static quantization must be performed on a machine with the same
architecture as your deployment target. If you are using FBGEMM, you
must perform the calibration pass on an x86 CPU; if you are using
QNNPACK, calibration needs to happen on an ARM CPU
But there is nothing about this in the official documentation.
The information in the link you posted is correct. You should use the same backend in both the cases. This is also mentioned in the official documentation-
"When preparing a quantized model, it is necessary to ensure that qconfig and the engine used for quantized computations match the backend on which the model will be executed."
Find it here
https://pytorch.org/docs/stable/quantization.html
Related
Recently, I am reading the code of cuGraph. I notice that it is mentioned that Louvain and Katz algorithms support multi-GPU. However, when I read the C++ code of Louvain, I cannot find code that is related to multi-GPU. Specifically, according to a prior post, multi-GPU can be implemented by calling cudaSetDevice. I cannot find this function in the code of Louvain, however. Am I missing anything?
cuGraph supports multi-GPU by leveraging Dask. I encourage you to read the Dask cuGraph documentation that shows an example using PageRank.
For a Louvain example, I recommend looking at the docstring of the cugraph.dask.louvain function.
For completeness, under the hood cuGraph is using RAFT to manage underlying NCCL and UCX communication.
I have a TFLite model deployed to a Raspberry Pi. I'm using a Coral USB Accelerator to speed up inference, which contains an Edge TPU. I'm interested in experimenting with the impact of using different dataflows on the energy efficiency of this deployment.
Does anyone know how I could specify a particular dataflow, such as row-stationary or output-stationary, when accelerating a TFLite model using an Edge TPU?
For reference: https://people.csail.mit.edu/emer/papers/2017.05.ieee_micro.dnn_dataflow.pdf
According to the Coral support team:
"As far as we know, there is not a specific way available to run the model with a particular data-flow.
It is recommended to visit the model compatibility page at https://coral.ai/docs/edgetpu/models-intro/ to understand how a TFLite model is mapped to the EdgeTPU."
Let's suppose that my chip doesn't support any API like keras, tensorflow or sklearn; however I need to implement a deep learning model in python.
Is it possible to make my training and testing model in python, then, I want to call the best model results for prediction with C++?
Where I mus save the resulted best model in order to be called in the next steps? Must I save it in the chip? Did I need to install tensorflow and keras in my chip in this case?
TERMINOLOGY
You seem to be confused about terminology. Here's a somewhat simplified overview.
Your chip is the hardware (CPU or GPU), and will include circuitry to support its instruction set (move data to/from local memory, perform math and logic operations, etc.). A CPU/GPU chip that cannot support your ML software is hard to visualize, and would not support Python or C++, either. The chip comes on a board, which includes a lot of peripheral connections, secondary memory, etc.
Then your operating system (basic software) is installed on the hardware. This OS manages resources: jobs, processes, memory allocation, etc. If there's a failure in support, it would be here, not in the chip. Finally, you install your desired applications (software tools, programs, etc.) as additions to the OS.
C++ and Python are two high-level languages, popular applications. These languages support Tensorflow and Keras (machine learning frameworks) and SciKit (scientific / statistical package; sklearn is the package name you import).
DIRECT ANSWER
Yes, you can write your NN in Python. Yes, you can call it from C++. Python depends on C/C++ libraries; there is a viable interface between the two.
There is no particular method you must use to save your model and call it later: if you're writing your own model in Python, you get to decide the storage format and location. All you need is to have your Python and C++ programs "agree" on the format. Since you're writing them both, then you can choose whatever works for you.
RECOMMENDATION
Don't write these yourself, unless you really want the exercise. Instead, install a framework (TensorFlow, Caffe, Neon, Torch, MXNet, Keras, ...). Then, simply follow the given tutorials to learn how to build, save, and restore your model.
Can anyone please explain or refer me some good source about what is a CUDA context? I searched CUDA developer guide and I was not satisfied with it.
Any explanation or help will be great.
The cuda API exposes features of a stateful library: two consecutive calls relate one-another. In short, the context is its state.
The runtime API is a wrapper/helper of the driver API. You can see in the driver API that the context is explicitly made available, and you can have a stack of contexts for convenience. There is one specific context which is shared between driver and runtime API (See primary context)).
The context holds all the management data to control and use the device. For instance, it holds the list of allocated memory, the loaded modules that contain device code, the mapping between CPU and GPU memory for zero copy, etc.
Finally, note that this post is more from experience than documentation-proofed.
essentially, a data structure that holds information relevant to mantaining a consistent state between the calls that you make, e.g. (open) (execute) (close)
This is so that the functions that you invoke can send the signals in the right direction even if you don't specifically tell them what that direction is.
I am looking for options to serve parallel predictions using caffe model from GPU. Since GPU comes with limited memory, what are the options available to achieve parallelism by loading the net only once?
I have successfully wrapped my segmentation net with tornado wsgi + flask. But at the end of the day, this is most equivalent serving from a single process. https://github.com/BVLC/caffe/blob/master/examples/web_demo/app.py.
Is having my own copy of net for each process a strict requirement, since the net is read-only after the training is done? Is it possible to rely on fork for parallelism?
I am working on a sample app which serves result from segmentation model. It utilizes copy on write and loads the net in the master once and serve memory references for the forked children. I am having trouble starting this setup in a web server setting. I get a memory error when I try to initialize the model. The web server I am using here is uwsgi.
Have anyone achieved parallelism by loading the net only once (since GPU memory is limited) and achieved parallelism for serving layer? I would be grateful if any one of you can point me in the right direction.