Memory requirements for FFT on STM32F103C8 - fft

I have a limited system and would like to implement FFT within STM32F103C8 without any extra memory buffers.
So I want to know how many memories are needed to implement if I have 2592x1944x8bit size's one image?
Actually, I want to have a process such as
Origial image ---> FFT ---> Blur ---> IFFT ---> Modified image
What is the memory requirements for FFT on STM32F103C8 ?

So I want to know how many memories are needed to implement if I have 2592x1944x8bit size's one image?
Much more than you have. This isn't going to work out.
2592x1944 # 8bpp is roughly 5 MB. Your microcontroller has 20 KB of RAM, which isn't even enough to store eight lines of your image.

Related

Using VAE-GAN architecture on larger images

I'm using a VAE-GAN architecture that was originally used on low res images (mnist, faces) to train on audio spectrograms which are much higher res. Does anyone have recommendations for what to change in the architecture to make this work?
A few things I can think of -- increasing kernel size, number of layers/nodes. But it is already quite slow to train.
Any ideas appreciated!

How to choose the hyperparameters and strategy for neural network witg small dataset?

I'm currently doing semantic segmentation ,However I have really small dataset,
I only have around 700 images with data augmentation,for example,flipping could
make it 2100 images.
Not sure if it's quite enough for my task(semantic segmentation with four
classes).
I want to use batch normalization,and mini batch gradient descent
What's really make me scratch my head is that if the batch size is too small,
the batch normalization doesn't work well ,but with larger batch size,
it seems equivalent to full batch gradient descent
I wonder if there's something like standard ratio between #of samples and batch
size?
Let me first address the second part of your question "strategy for neural network with small dataset". You may want to take a pretrained network on a larger dataset, and fine tune that network using your smaller dataset. See, for example, this tutorial.
Second, you ask about the size of the batch. Indeed, the smaller batch will make the algorithm to wander around the optimum as in classical stochastic gradient descent, the sign of which is noisy fluctuations of your losses. Whereas with a larger batch size there is typically a more "smooth" trajectory towards optimum. In any case, I suggest that you use an algorithm with momentum such as Adam. That would aid the convergence of your training.
Heuristically, the batch size can be kept as large as your GPU memory can fit. If the amount of GPU memory is not sufficient, then the batch size is reduced.

What is the concept of mini-batch for FCN (semantic segmentation)?

What is the concept of mini-batch when we are sending one image to FCN for semantic segmentation?
The default value in data layers is batch_size: 1. That means every forward and backward pass, one image is sent to the network. So what will be the mini-batch size? Is it the number of pixels in an image?
The other question is what if we send few images together to the net? Does it affect the convergence? In some papers, I see the number of 20 images.
Thanks
The batch size is the number of images sent through the network in a single training operation. The gradient will be calculated for all the sample in one swoop, resulting in large performance gains through parallelism, when training on a graphics card or cpu cluster.
The batch sizes has multiple effects on training. First it provides more stable gradient updates by averaging the gradient in the batch. This can both be beneficial and detrimental. In my experience it was more beneficial then detrimental, but others have reported other results.
To exploit parallelism the batch size is mostly a power of 2. So either 8, 16, 32, 64 or 128. Finally the batch size is limited by VRAM in the graphics card. The card needs to store all the images and results in all the nodes of the graph and additionally all the gradients.
This can blow up very fast. In this case you need to reduce the batch size or the network size.

How to process a task of arbitrary size using CUDA?

I'm starting to learn CUDA, and have to dive straight into a project, so I currently am lacking a solid theoretical background; I'll be picking it up along the way.
While I understand that the way the hardware is built requires the programmer to deal with thread blocks and grids, I haven't been able to find an answer to the following questions in my introductory book:
What happens when the task size is greater than the amount of threads a GPU can process at a time? Will the GPU then proceed through the array the same way a CPU would, i.e. sequentially?
Thus, should I worry if the amount of thread blocks that a given task requires exceeds the amount that can simultaneously run on the GPU? I've found a notion of the "thread block limit" so far, and it's obviously higher that what a GPU can be processing at a given moment in time, thus, is that the real (and only) limit I should be concerned with?
Other than choosing the right block size for the given hardware, are there any problems to consider when setting up a kernel for execution? I'm at loss regarding launching a task of arbitrary size. Even considered going OpenCL instead of CUDA because there appears to be no explicit block size calculation involved when launching a kernel to execute over an array.
I'm fine with this being closed as duplicate in case it is, just be sure to point at the original question.
The number of thread blocks can be arbitrary. The hardware can handle them sequentially if the number is large. This link gives you a basic view.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#scalable-programming-model
On the other hand you could use limited number of threads to handle task of arbitrary sizes by increasing the work per thread. This link shows you how to do that and why it is better.
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
You may want to read the following two for a full answer.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

How does the speed of CUDA program scale with the number of blocks?

I am working on Tesla C1060, which contains 240 processor cores with compute capability 1.3. Knowing that each 8 cores are controlled by a single multi-processor, and that each block of threads is assigned to a single multi-processor, then I would expect that launching a grid of 30 blocks, should take the same execution time as one single block. However, things don't scale that nicely, and I never got this nice scaling even with 8 threads per block. Going to the other extreme with 512 threads per block, I get approximately the same time of one block, when the grid contains a maximum of 5 blocks. This was disappointing when I compared the performance with implementing the same task parallelized with MPI on an 8-core CPU machine.
Can some one explain that to me?
By the way, the computer actually contains two of this Tesla card, so does it distribute blocks between them automatically, or do I have to take further steps to ensure that both are fully exploited?
EDIT:
Regarding my last question, if I launch two independent MPI processes on the same computer, how can I make each work on a different graphics card?
EDIT2: Based on the request of Pedro, here is a plot depicting the total time on the vertical access, normalized to 1 , versus the number of parallel blocks. The number of threads/block = 512. The numbers are rough, since I observed quite large variance of the times for large numbers of blocks.
The speed is not a simple linear relation with the number of blocks. It depends on bunch of stuffs. For example, the memory usage, the number of instruction excuted in a block, etc.
If you want to do multi-GPU computing, you need to modify your code, otherwise you can only use one GPU card.
It seems to me that you have simply taken a C program and compiled it in CUDA without much tought.
Dear friend, this is not the way to go. You have to design your code to take advantage of the fact that CUDA cards have a different internal architecture than regular CPUs. In particular, take the following into account:
memory access pattern - there is a number of memory systems in a GPU and each requires consideration on how to use it best
thread divergence problems - performance will only be good if most of your threads follow the same code path most of the time
If your system has 2 GPUs, you can use both to accelerate some(suitable) problems. The thing is that the memory area of the two are split and not easily 'visible' by each other - you have to design your algorithm to take this into account.
A typical C program written in pre-GPU era will often not be easily transplantable unless originally written with MPI in mind.
To make each CPU MPI thread work with a different GPU card you can use cudaSetDevice()