undefined behavior std::vector - stl

#include <iostream>
#include <string>
#include <vector>
int main()
{
std::string name;
std::vector<double> v(5, 1);
std::cout<<v.capacity()<<std::endl;
v[1000000]= 10.;
std::cout<<v[1000000]<<std::endl;
std::cout<<v.capacity()<<std::endl;
return 0;
}
Is this code undefined behavior ? It seems that no allocation is made on the fly so I am wondering how the program is able to handle the item assignment. I am using OSX Monterrey and this prints "10" as "expected".

std::vector<double> v(5, 1);
std::cout<<v.capacity()<<std::endl;
v[1000000]= 10.;
From your question, I'm pretty sure you know this is undefined behavior. Your question is really "why doesn't this crash?"
Undefined behavior means anything can happen. Usually what happens is the app seems to work. The vector will ask std::allocate for ~20 bytes of memory, and std::allocate will ask the operating system for a large chunk of memory, and then std::allocate will give 20 bytes of memory to the vector, and then it will save the rest of the large chunk for whatever next asks for more memory. Then your code assigns 10 to the place in (virtual) memory that's ~4MB past the memory allocated to the vector.
One possibility from there, is that that address in memory is not currently allocated in your process. The OS will detect this and usually it will crash your app. Or it might just give that memory to your app so that it keeps running. So you never really know what will happen.
Another option is that If that address in memory happens to already be allocated in your process, either coincidentally by std::allocate, or coincidentally by something else, then the write succeeds in writing 10 to that place in memory. Hopefully that wasn't important, like networking code, in which case the next network call could send the contents of your memory, including your passwords, to whatever server it happened to be talking to at the time. Or maybe that memory isn't currently used for anything at the moment, and nothing breaks. Or maybe it works a 998 times in a row, and then on the 999th time it erases your files. Unlikely, but possible.
So yes, most of the time "undefined behavior works anyway". But don't do it, you will regret it.

Related

Returning custom cudaError or force copy to host from device

I have a cuda kernel which is called many times, which adds some values to an array of allocated size N. I keep track of the inserted elements with a device variable in which I apply atomicAdd.
When the number of added values approach N, I would like to be able to know it so I can call cudaMalloc again and reallocate the array. The most obvious solution is to do a cudaMemcpy of that device variable every time the kernel is called, and therefore keep track of the size of the array in the host. What I would like to know is if there is a way to be able of ONLY doing the cudaMemcpy to the host when the values are approaching N.
One possible solution I had thought of is if I could set cudaError_t return value to 30 (cudaErrorUnknown), or some custom error, which I could later check. But I havent found how to do it and I guess that its not possible. Is there any way to do what I want and do the cudaMemcpy only when the device finds that its running out of memory?
But I haven't found how to do it and I guess that it's not possible
Error numbers from the runtime are set by the host driver. They are not available to the programmer and they cannot be set in kernels either. So your guess is correct. There are assertions available in device for debugging, and there are ways to cause a kernel to abnormally abort, but the latter will cause context destruction and a loss of the contents of device memory, which I suspect won't help you.
About the best you can do is use a mapped host or managed allocation as a way for the host to keep track of the consumption of allocated memory on the device. Then you don't need to explicitly memcpy and the latency will be minimized. But you will need some sort of synchronization on the running kernel in that case.

What happened when c++ program getting a runtime error?

Say, in C++, if dereferencing a pointer that pointing to a memory that already released, I will get a bad access message and back to OS. Can someone explain what happened there a little bit in detail? It is an interview question on OS/compiler.
Once you delete that memory however, C++ marks it as free and may hand it out to anyone that asks for it.
Because deleting a block of memory does not zero the value of all pointers that point to it. Deleting memory merely makes a note that the memory is available to be allocated for some other purpose. Until that happens, the memory may appear to be intact -- but you can't count on it, and on some compiler/runtime/architecture combinations, your program will behave differently -- it may even crash.

C++/CLI memory allocation with new throws an exception

I have (I believe) a very classic problem with memory allocation using "new".
Here is the piece of code I use:
float * _normals = NULL;
try{
_normals = new float[_x*_y*_z*3];
}catch(System::Exception ^ e){
Windows::Forms::MessageBox::Show("Normals:\n" + e->Message);
if(e->InnerException != nullptr && e->InnerException->Message != nullptr)
Windows::Forms::MessageBox::Show("Details:\n" + e->InnerException->Message);
_file->Close();
return 0;
}
I don't know if you can tell from this piece of code but this is a mixed of managed and unmanaged code program. I don't know if that matters.
Now, when I run this code and try to allocate, say, 256*256*128*3 floats it runs normally. When I go with 492*492*442 floats then it throws a "External component has thrown an exception" exception. This is around 1.2GB, right. My system has 6GB of ram and free around 3GB. Can you tell the problem from this information? Can I handle it? I read somewhere about program memory space. Maybe the program memory is not enough?(I don't know anything around that matter, If you can enlighten me)
Please ask if you need more information.
Thank you in advance
Address space for a 32-bit Windows program (Windows is implied by C++-CLI) running on a 64-bit operating system is either
2 GB by default
4 GB if linked with /LARGEADDRESSAWARE. This flag can also be added later by editbin.
Your problem is address space fragmentation. Just because you've only allocated, say 100MB, doesn't mean that you can allocate another 1.9GB chunk in a 2GB address space. Your new allocation needs to have contiguous addresses.
If, for example, a DLL used by your non-LAA process had a load-address at 0x40000000, then you could allocate a 1GB block below it, or an almost-1GB block above it, but you could never allocate a single block larger than 1GB.
The easiest solution is to compile as 64-bit. Even though the address space will still be fragmented, the open spaces between allocations will be much larger and not cause you problems.

CUDA Shared Memory not exclusive to block while debugging

Basically, I am having a difficult time understand exactly what is going wrong here.
Shared memory does not appear to be behaving in a block exclusive manner while debugging. When running the code normally, nothing is printed. However, if I attempt to debug it, shared memory is shared between blocks and the print statement is reached.
This is an example, obviously this isn't terribly useful code, but it reproduces the issue on my system. Am I doing something wrong? Is this a bug or expected behavior from the debugger?
__global__
void test()
{
__shared__ int result[1];
if (blockIdx.x == 0 && blockIdx.y == 0 && blockIdx.z == 0)
result[0] = 4444;
else
{
if (result[0] == 4444)
printf("This should never print if shared memory is unique\n");
}
}
And to launch it:
test<<<dim3(8,8,1), dim3(8,8,1)>>>();
It is also entirely possible that I have completely misunderstood shared memory.
Thanks for the help.
Other Information:
I am using a GTX 460. Compute_20 and sm_20 are set for the project. I am writing the code in Visual Studio 2010 using nsight 3.0 preview.
There is a subtle but important difference between
shared memory is shared between blocks and the print statement is
reached
and
shared memory is re-used by successive blocks and the print statement is
reached
You are assuming the former, but the latter is what is really happening.
Your code, with the exception of the first block, is reading from uninitialised memory. That, in itself, is undefined behaviour. C++ (and CUDA) don't guarantee that statically declared memory is set to any value when it either comes into, or goes out of scope. You can't expect that result wouldn't have a value of 4444, especially when it is probably stored in the same shared scratch space as a previous block which may have set it to a value of 4444.
The entire premise of the code and this question are flawed and you should draw no conclusions from the result you see other that undefined behaviour is undefined.

Copying an integer from GPU to CPU

I need to copy a single boolean or an integer value from the device to the host after every kernel call (I am calling the same kernel in a for loop). That is, after every kernel call, I need to send an integer or a boolean back to the host. What is the best way to do this?
Should I write the value directly to RAM? Or should I use cudaMemcpy()? Or is there any other way to do this? Would copying just 1 integer after every kernel launch slow down my program?
Let me first answer your last question:
Would copying just 1 integer after every kernel launch slow down my program?
A bit - yes. Issuing the command, waiting for GPU to respond, etc, etc... The amount of data (1 int vs 100 ints) probably doesn't really matter in this case. However, you can still achieve speeds of thousands memory transfers per second. Most likely, your kernel will be slower than this single memory transfer (otherwise, it would be probably better to do the whole task on a CPU)
what is the best way to do this?
Well, I would suggest simply trying it yourself. As you said: you can either use mapped-pinned memory and have your kernel store the value directly to RAM, or use cudaMemcpy. The first one might be better if your kernels still have some work to do after sending the integer back. In that case, the latency of sending it to host could be hidden by the execution of the kernel.
If you use the first method, you will have to call cudaThreadsynchronize() to make sure the kernel ended its execution. Kernel calls are asynchronous.
You can use cudaMemcpyAsync which is also asynchronous, but GPU cannot have kernel running and having cudaMemcpyAsync executed in parallel, unless you use streams.
I never actually tried that, but if your program won't crash if the loop executes too many times, you might try to ignore synchronisation and let it iterate until the special value is seen in RAM. In that solution, the memory transfer might be completely hidden and you would pay an overhead only at the end. You will need however to somehow prevent the loop from iterating too many times, CUDA events may be helpful.
Why not use pinned memory? If your system supports it -- see CUDA C Programming Guide's section on pinned memory.
Copying data to and from the GPU will be much slower than accessing the data from the CPU. If you are not running a significant number of threads for this value then this will result in very slow performance, don't do it.
What you are describing sounds like a serial algorithm, your algorithm needs to be parallelised in order to make it worth doing using CUDA. If you can't rewrite your algorithm to become a single write of multiple data to the GPU, multiple threads, single write of multiple data back to CPU; then your algorithm should be done on CPU.
If you need the value computed in the previous kernel call to launch the next one then is serialized and your choice is to cudaMemcpy(dst,src, size =1, ...);
If all the kernel launch parameters do not depend on the previous launch then you can store all the result of each kernel invocation in GPU memory and then download all the results at once.