Difference between CUDA out of memory and misaligned address - caffe

I found the error when training using Caffe. Seems both of the errors are connected with memory because when I reduce the batch size, both the errors are solved.
Just, seems the misaligned address comes first before out of memory.
For example, when I put batch size = 32 (or more), I got out of memory error. When I put batch size = 16, I got misaligned address error. When I put batch size = 8 (or less), there is no error.
Actually what is the difference between those errors and the meaning of the errors?
I use CUDA 8.0 and CuDNN 5.1

Related

CUDA kernel launched from Nsight Compute gives inconsistent results

I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by:
Validating with test data over 100 runs (just in case)
Using cuda-memcheck (memcheck, synccheck, racecheck, initcheck)
Yet, the results printed into the terminal while the application is getting profiled using Nsight Compute differs from run to run. I am curious if the difference is a cause for concern, or if this is the expected behavior.
Note: The application also gives correct & consistent results while getting profiled bu nvprof.
I followed up on the NVIDIA forums but will post here as well for tracking:
What inconsistencies are you seeing in the output? Nsight Compute runs a kernel multiple times to collect all of its information. So things like print statements in the kernel will show up multiple times. Could it be related to that or is it a value being calculated differently? One other issue is with Unified Memory (UVM) or zero copy memory Nsight Compute is not able to restore those values before each replay. Are you using that in your application? If so, the application replay mode could help. It may be worth trying to see if anything changes.
I was able to resolve the issue by addressing my shared memory initializations. Since Nsight Compute runs a kernel multiple times as #Jackson stated, the effects of uninitialized memory were amplified (I was performing atomicAdd into uninitialized memory).

Why my cpu seems to lose the ability to decode

I meet this problem when finishing the lab of my OS course. We are trying to implement a kernel with the function of system call (platform: QEMU/i386).
When testing the kernel, problem occurred that after kernel load user program to memory and change the CPU state from kernel mode to user mode using 'iret' instruction, CPU works in a strange way as following.
%EIP register increased by 2 each time no matter how long the current instrution is.
no instruction seems to be execute, for no other registers change meantime.
Your guest has probably ended up executing a block of zeroed out memory. In i386, zeroed memory disassembles to a succession of "add BYTE PTR [rax],al" instructions, each of which is two bytes long (0x00 0x00), and if rax happens to point to memory which reads as zeroes, this will effectively be a 2-byte-insn no-op, which corresponds to what you are seeing. This might happen because you set up the iret incorrectly and it isn't returning to the address you expected, or because you've got the MMU setup wrong and the userspace program isn't in the memory where you expect it to be, for instance.
You could confirm this theory using QEMU's debug options (eg -d in_asm,cpu,exec,int,unimp,guest_errors -D qemu.log will log a lot of execution information to a file), which should (among a lot of other data) show you what instructions it is actually executing.

Is it possible to reduce the number of images in batch in convert_imageset.cpp to tackle out of memory of GPU?

I was trying to run fcn on my data in caffe. I was able to convert my image sets into lmdb by convert_imageset builtin function caffe. However, once I wanted to train the net, it gave me the following error:
Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
.....
Aborted (core dumped)
I went through many online resources to solve the memory failure, but most of them suggesting reducing batch size. Even, I reduced the size of images to 256x256. I could not tackle this issue yet.
I checked the memory of GPU by this command nvidia-smi, and the model is Nvidia GT 730 and the memory is 1998 MiB. Since the batch size in train_val.prototxt is 1, I can not do anythin in train_val.prototxt. So my questions are:
By looking at log file in Terminal,I realized that whenever convert_imageset converting the data into LMDB, it is taking 1000 image in a group. Is it possible I change this number in line 143 and 151 of convert_imageset.cpp to a smaller (for example 2; to take two image at a time), recompile caffe, and then convert images to lmdb by using convert_imageset? Does it make sense?
If the answer to question 1 is yes, how can I compile caffe again,
should I remove build folder and again do caffe installation from
scratch?
How caffe process the LMDB data? Is it like taking a batch of those 1000 images showing while running convert_imagenet?
Your help is really appreciated.
Thanks...
AFAIK, there is no effect for the number of entries committed to lmdb at each transaction (txn->Commit();) on the cuda out of memory.
If you do want to re-compile caffe for whatever reason, simply run make clean. This will clear everything and let you re-compile from scratch.
Again, AFAIK, caffe access lmdb batch_size images at a time regardless of the size of the transaction size used when writing the dataset.
Are you sure batch_size is set to 1 for both TRAIN and TEST phases?

C++/CLI memory allocation with new throws an exception

I have (I believe) a very classic problem with memory allocation using "new".
Here is the piece of code I use:
float * _normals = NULL;
try{
_normals = new float[_x*_y*_z*3];
}catch(System::Exception ^ e){
Windows::Forms::MessageBox::Show("Normals:\n" + e->Message);
if(e->InnerException != nullptr && e->InnerException->Message != nullptr)
Windows::Forms::MessageBox::Show("Details:\n" + e->InnerException->Message);
_file->Close();
return 0;
}
I don't know if you can tell from this piece of code but this is a mixed of managed and unmanaged code program. I don't know if that matters.
Now, when I run this code and try to allocate, say, 256*256*128*3 floats it runs normally. When I go with 492*492*442 floats then it throws a "External component has thrown an exception" exception. This is around 1.2GB, right. My system has 6GB of ram and free around 3GB. Can you tell the problem from this information? Can I handle it? I read somewhere about program memory space. Maybe the program memory is not enough?(I don't know anything around that matter, If you can enlighten me)
Please ask if you need more information.
Thank you in advance
Address space for a 32-bit Windows program (Windows is implied by C++-CLI) running on a 64-bit operating system is either
2 GB by default
4 GB if linked with /LARGEADDRESSAWARE. This flag can also be added later by editbin.
Your problem is address space fragmentation. Just because you've only allocated, say 100MB, doesn't mean that you can allocate another 1.9GB chunk in a 2GB address space. Your new allocation needs to have contiguous addresses.
If, for example, a DLL used by your non-LAA process had a load-address at 0x40000000, then you could allocate a 1GB block below it, or an almost-1GB block above it, but you could never allocate a single block larger than 1GB.
The easiest solution is to compile as 64-bit. Even though the address space will still be fragmented, the open spaces between allocations will be much larger and not cause you problems.

How to debug: CUDA kernel fails when there are many threads?

I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. My GPU algorithm runs many parallel breadth first searches (BFS) of a tree (constant). The threads are independed except reading from a constant array and the tree. In each thread there can be some malloc/free operations, following the BFS algorithm with queues (no recursion). There N threads; the number of tree leaf nodes is also N. I used 256 threads per block and (N+256-1)/256 blocks per grid.
Now the problem is the program works for less N=100000 threads but fails for more than that. It also works in CPU or in GPU thread by thread. When N is large (e.g. >100000), the kernel crashes and then cudaMemcpy from device to host also fails. I tried Nsight, but it is too slow.
Now I set cudaDeviceSetLimit(cudaLimitMallocHeapSize, 268435456); I also tried larger values, up to 1G; cudaDeviceSetLimit succeeded but the problem remains.
Does anyone know some common reason for the above problem? Or any hints for further debugging? I tried to put some printf's, but there are tons of output. Moreover, once a thread crashes, all remaining printf's are discarded. Thus it is hard to identify the problem.
"CUDA Driver 5.5, runtime 5.0" -- that seems odd.
You might be running into a windows TDR event. Based on your description, I would check that first. If, as you increase the threads, the kernel begins to take more than about 2 seconds to execute, you may hit the windows timeout.
You should also add proper cuda error checking to your code, for all kernel calls and CUDA API calls. A windows TDR event will be more easily evident based on the error codes you receive. Or the error codes may steer you in another direction.
Finally, I would run your code with cuda-memcheck in both the passing and failing cases, looking for out-of-bounds accesses in the kernel or other issues.