What's a good way to zero-out cudaMalloc'd data? - cuda

What's a good way to zero-out cudaMalloc'd data? Assume using cudaMemset or cudaMemsetAsync from the CPU results in synchronization issues with other cuda API calls, forcing you to do something else.
Edit 1
In the first picture below, you can see that Thread 1291997440 has issued a cudaMemcpyAsync and takes a while to execute it. For some reason, this cudaMemcpyAsync seems to be blocking the cudaMemsetAsync, as shown in the second picture below. Note that each CPU thread is queuing up these operations in its own stream. Someone of reputation mentioned offsite that using a kernel instead of using a cudaMemsetAsync call could result in clearing the memory sooner -- that's the reason I pursue this question.
Edit 2
At this point, I've improved the code (by reducing the size of the HtoD and DtoH copies) enough to prevent this issue from showing up. The above pictures are from the previous night. If the comments are 100% true, then there must be some other sliver in the profiling report that I didn't notice. In the newer version of the code, there is no observable difference between using cudaMemsetAsync and calling a kernel to clear out the memory.

Here I show the results of zero'ing out ~5.5GB of data with three different approaches. The code was compiled with -O3 and ran on a V100 with 16 GB of memory.
Approach A: cudaMemset
To establish a baseline, I zero out the the data from the CPU, using cudaMemset. This is very fast, but even the cudaMemsetAsync version can be serialized at run time if lots of cudaMemcpys are in flight.
Result: 6 ms
Approach B: memset
Calling memset may invoked the worst of both the CPU and GPU world. memset has this feeling to it that once you exit it, the data is exactly as you indicated it should be. Of course this isn't true in the presence of other kernels, race conditions, etc. But, that's my guess as to why it's so slow.
Result: 241 ms
Approach C: coalesced writes
Writing in a coalesced manner seems to have the best of both the CPU and GPU worlds. It's as fast as the CPU-issued cudaMemset, and it's also clear to any programmer who makes coalesced writes that of course there are race conditions, etc.
Result: 6 ms
Conclusion
If you can't use cudaMemset[Async] from the CPU, then use coalesced writes with 32 threads or more per block.
Program Output
Starting timer for calling cudaMemset from CPU
Stopping timer for calling cudaMemset from CPU took 0.006015s
Starting timer for calling kernel<80,1> that uses memset
Stopping timer for calling kernel<80,1> that uses memset took 0.393921s
Starting timer for calling kernel<80,2> that uses memset
Stopping timer for calling kernel<80,2> that uses memset took 0.300473s
Starting timer for calling kernel<80,4> that uses memset
Stopping timer for calling kernel<80,4> that uses memset took 0.269686s
Starting timer for calling kernel<80,8> that uses memset
Stopping timer for calling kernel<80,8> that uses memset took 0.241374s
Starting timer for calling kernel<80,16> that uses memset
Stopping timer for calling kernel<80,16> that uses memset took 0.645509s
Starting timer for calling kernel<80,32> that uses memset
Stopping timer for calling kernel<80,32> that uses memset took 0.611437s
Starting timer for calling kernel<80,64> that uses memset
Stopping timer for calling kernel<80,64> that uses memset took 0.611276s
Starting timer for calling kernel<80,128> that uses memset
Stopping timer for calling kernel<80,128> that uses memset took 0.459663s
Starting timer for calling kernel<80,256> that uses memset
Stopping timer for calling kernel<80,256> that uses memset took 0.308788s
Starting timer for calling kernel<80,512> that uses memset
Stopping timer for calling kernel<80,512> that uses memset took 0.595893s
Starting timer for calling kernel<80,1024> that uses memset
Stopping timer for calling kernel<80,1024> that uses memset took 2.552866s
Starting timer for calling kernel<80,1> that performs coalesced writes
Stopping timer for calling kernel<80,1> that performs coalesced writes took 0.136967s
Starting timer for calling kernel<80,2> that performs coalesced writes
Stopping timer for calling kernel<80,2> that performs coalesced writes took 0.068426s
Starting timer for calling kernel<80,4> that performs coalesced writes
Stopping timer for calling kernel<80,4> that performs coalesced writes took 0.039974s
Starting timer for calling kernel<80,8> that performs coalesced writes
Stopping timer for calling kernel<80,8> that performs coalesced writes took 0.017121s
Starting timer for calling kernel<80,16> that performs coalesced writes
Stopping timer for calling kernel<80,16> that performs coalesced writes took 0.008586s
Starting timer for calling kernel<80,32> that performs coalesced writes
Stopping timer for calling kernel<80,32> that performs coalesced writes took 0.006139s
Starting timer for calling kernel<80,64> that performs coalesced writes
Stopping timer for calling kernel<80,64> that performs coalesced writes took 0.006075s
Starting timer for calling kernel<80,128> that performs coalesced writes
Stopping timer for calling kernel<80,128> that performs coalesced writes took 0.006093s
Starting timer for calling kernel<80,256> that performs coalesced writes
Stopping timer for calling kernel<80,256> that performs coalesced writes took 0.006479s
Starting timer for calling kernel<80,512> that performs coalesced writes
Stopping timer for calling kernel<80,512> that performs coalesced writes took 0.006972s
Starting timer for calling kernel<80,1024> that performs coalesced writes
Stopping timer for calling kernel<80,1024> that performs coalesced writes took 0.007354s
Test Implementation
memset_timing.cu
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include "timer.h"
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
#define round_up(x, multiple) (((x + multiple - 1) / multiple) * multiple)
const long COUNT = 80 << 24;
const int GPU_CACHE_LINE_SIZE_IN_BYTES = 32;
const long SIZE_OF_DATA = sizeof(int) * COUNT;
__global__ void clear_scratch_space_kernel(int * data, int blocks, int threads) {
// BOZO: change the code to just error out if we're any of the border cases below
const int idx = blockIdx.x * threads + threadIdx.x;
long size = sizeof(int) * COUNT;
long size_of_typical_chunk = round_up(size / (blocks * threads), GPU_CACHE_LINE_SIZE_IN_BYTES);
// Due to truncation, the threads at the end won't have anything to do. This is a little sloppy but costs us
// hardly anything in performance, so we do the simpler thing.
long this_threads_offset = idx * size_of_typical_chunk;
if (this_threads_offset > SIZE_OF_DATA) {
return;
}
long size_of_this_threads_chunk;
if (this_threads_offset + size_of_typical_chunk >= SIZE_OF_DATA) {
// We are the last thread, so we do a partial write
size_of_this_threads_chunk = SIZE_OF_DATA - this_threads_offset;
} else {
size_of_this_threads_chunk = size_of_typical_chunk;
}
void * starting_address = reinterpret_cast<void *>(reinterpret_cast<char *>(data) + this_threads_offset);
memset((void *) starting_address, 0, size_of_this_threads_chunk);
}
__global__ void clear_scratch_space_with_coalesced_writes_kernel(int * data, int blocks, int threads) {
if (COUNT % (blocks * threads) != 0) {
printf("Adjust the SIZE_OF_DATA so it's divisible by the number of (blocks * threads)\n");
}
const long count_of_ints_in_each_blocks_chunk = COUNT / blocks;
int block = blockIdx.x;
int thread = threadIdx.x;
const long rounds_needed = count_of_ints_in_each_blocks_chunk / threads;
const long this_blocks_starting_offset = block * count_of_ints_in_each_blocks_chunk;
//printf("Clearing %li ints starting at offset %li\n", count_of_ints_in_each_blocks_chunk, this_blocks_starting_offset);
int * this_threads_base_pointer = &data[this_blocks_starting_offset + thread];
for (int round = 0; round < rounds_needed; ++round) {
*this_threads_base_pointer = 0;
this_threads_base_pointer += threads;
}
}
void set_gpu_data_to_ones(int * data_on_gpu) {
cudaMemset(data_on_gpu, 1, SIZE_OF_DATA);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
}
void check_gpu_data_is_zeroes(int * data_on_gpu, char * data_on_cpu) {
cudaMemcpy(data_on_cpu, data_on_gpu, SIZE_OF_DATA, cudaMemcpyDeviceToHost);
for (long i = 0; i < SIZE_OF_DATA; ++i) {
if (data_on_cpu[i] != 0) {
printf("Failed to zero-out byte offset %i in the data\n", i);
}
}
}
int main(void)
{
const long count = COUNT;
int * data_on_gpu;
char * data_on_cpu = (char *) malloc(SIZE_OF_DATA);
if (data_on_cpu == NULL) {
printf("Failed to allocate data on cpu");
}
CUDA_CHECK_RETURN(cudaMalloc(&data_on_gpu, sizeof(int) * count));
{
Timer memset_timer("calling cudaMemset from CPU");
memset_timer.start();
CUDA_CHECK_RETURN(cudaMemset(data_on_gpu, 0, SIZE_OF_DATA));
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
memset_timer.stop_and_report();
}
for (int threads = 1; threads <= 1024; threads *= 2) {
set_gpu_data_to_ones(data_on_gpu);
char buffer[200];
sprintf(buffer, "calling kernel<80,%i> that uses memset", threads);
Timer memset_timer(buffer);
memset_timer.start();
clear_scratch_space_kernel<<<80, threads>>>(data_on_gpu, 80, threads);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
memset_timer.stop_and_report();
check_gpu_data_is_zeroes(data_on_gpu, data_on_cpu);
}
for (int threads = 1; threads <= 1024; threads *= 2) {
set_gpu_data_to_ones(data_on_gpu);
char buffer[200];
sprintf(buffer, "calling kernel<80,%i> that performs coalesced writes", threads);
Timer memset_timer(buffer);
memset_timer.start();
clear_scratch_space_with_coalesced_writes_kernel<<<80, threads>>>(data_on_gpu, 80, threads);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
memset_timer.stop_and_report();
check_gpu_data_is_zeroes(data_on_gpu, data_on_cpu);
}
free(data_on_cpu);
}
/**
* Check the return value of the CUDA runtime API call and exit
* the application if the call has failed.
*/
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
Timer.h
#include <string>
#include <chrono>
class Timer {
public:
Timer(std::string name_, bool allow_output = true);
virtual ~Timer();
void start();
void start_or_restart();
void stop(bool force_no_output = false);
void report(const int count = 0, bool preface_with_spaces = true);
void stop_and_report(const int count = 0);
double duration_in_seconds();
long duration_in_microseconds();
private:
std::string name;
// even though we call report, we still might suppress output since the output is often a type of debugging info
bool allow_output;
std::chrono::time_point<std::chrono::system_clock> start_time;
std::chrono::time_point<std::chrono::system_clock> end_time;
bool started_before = false;
bool currently_rolling = false; // if timer (i.e., the clock) is currently rolling
double duration = -1.0;
};
Timer.cpp
#include <stdexcept>
#include "timer.h"
Timer::Timer(std::string name_, bool allow_output_) {
name = name_;
allow_output = allow_output_;
}
Timer::~Timer() {
}
void Timer::start() {
#ifdef DEBUG
if(started_before) {
printf("Attempting to start same timer twice. Exiting.\n");
throw std::runtime_error("Attempting to start timer that was previously started");
}
#endif
if (allow_output) {
printf("Starting timer for %s\n", name.c_str());
}
start_time = std::chrono::system_clock::now();
currently_rolling = true;
started_before = true;
duration = 0.0;
}
void Timer::start_or_restart() {
if (currently_rolling) {
throw std::runtime_error("Can't start or restart a timer that's already rolling.");
}
if (!started_before && allow_output) {
printf("Starting timer for %s\n", name.c_str());
}
started_before = true;
start_time = std::chrono::system_clock::now();
currently_rolling = true;
if (duration < 0.0) {
duration = 0.0;
}
}
void Timer::stop(bool force_no_output) {
if (!force_no_output) { // Slight style violation: I prefer nested if's over && statements with two && operators
if (allow_output && duration <= 0.0) {
printf("Stopping timer for %s\n", name.c_str());
}
}
end_time = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end_time - start_time;
currently_rolling = false;
duration += elapsed_seconds.count();
}
void Timer::stop_and_report(const int count) {
stop(true);
report(count, false);
}
double Timer::duration_in_seconds() {
return duration;
}
long Timer::duration_in_microseconds() {
return static_cast<long>(duration * 1000000);
}
void Timer::report(const int count, bool preface_with_spaces) {
std::string preface;
if (preface_with_spaces) {
preface = " ";
} else {
preface = "Stopping ";
}
if (allow_output) {
if (!started_before) {
printf("%stimer for %s was never started\n", preface.c_str(), name.c_str());
} else if (count > 0) {
double average = (duration / static_cast<double>(count)) * 1000.0;
printf("%stimer for %s took %fs, %.3lfus each\n", preface.c_str(), name.c_str(), duration, average * 1000.0);
} else {
printf("%stimer for %s took %fs\n", preface.c_str(), name.c_str(), duration);
}
}
}

Related

Cuda C threads synchronization with printf or other functions

I have a problem with threads' id during the block executes.
I would like to have sentence like :"My temporary string is printed via GPU!" as you see (on the attached photo ealier) the sentence has been displayed wrongly and I don't know how to fix it.
Code:
__global__ void Print(const char* const __string, const size_t* const loop_repeat)
{
int id_x = threadIdx.x + blockIdx.x * blockDim.x;
while (id_x < static_cast<int>(*loop_repeat))
{
printf("%c", __string[id_x]);
__syncthreads();
id_x += blockDim.x * gridDim.x;
}
}
int main()
{
const char* my_string = "My temporary string is printed via GPU!";
size_t temp{};
temp = Get_String_Length(my_string); //get the string length
//GPU MEMORY ALLOCATION
size_t* my_string_length{};
cudaMalloc((void**)&my_string_length, sizeof(size_t));
//COPY VALUE FROM CPU(RAM) TO GPU
cudaMemcpy(my_string_length, &temp, sizeof(size_t), HostToDevice);
char* string_GPU{};
cudaMalloc((void**)&string_GPU, (temp) * sizeof(char));
//COPY VALUE FROM CPU(RAM) TO GPU
cudaMemcpy(string_GPU, my_string, (temp) * sizeof(char), HostToDevice);
dim3 grid_size(1);
dim3 block_size((temp));
Print <<< grid_size, temp >>> (string_GPU, my_string_length);
cudaError_t final_error = cudaDeviceSynchronize(); //for synchronization e.g Hello_World then printf
if (final_error == cudaSuccess)
{
printf("%cKernel executed successfully with code: %d !%\n", NEW_LINE, final_error);
}
else
{
printf("%cKernel executed with code error: %d !\n", NEW_LINE, final_error);
}
cudaFree(my_string_length);
cudaFree(string_GPU);
return 0;
}
I will be grateful for any help given.
The main issue here is that you are expecting that the thread or warp execution order has some predictable order. Actually, it does not. Your usage of __syncthreads() doesn't fix or address this issue.
If you want the warps to execute in a predictable order (not recommended) you would need to impose that order yourself. Here is an example that demonstrates that for this very simple code. It is not extensible without modification to larger strings, and this method will completely break down if you introduce more than 1 threadblock.
$ cat t1543.cu
#include <stdio.h>
#include <stdlib.h>
__global__ void Print(const char* const __string, const size_t* const loop_repeat)
{
int id_x = threadIdx.x + blockIdx.x * blockDim.x;
int warp_ID = threadIdx.x>>5;
while (id_x < static_cast<int>(*loop_repeat))
{
if (warp_ID == 0)
printf("%c", __string[id_x]);
__syncthreads();
if (warp_ID == 1)
printf("%c", __string[id_x]);
__syncthreads();
id_x += blockDim.x * gridDim.x;
}
}
int main()
{
const char* my_string = "My temporary string is printed via GPU!";
size_t temp;
temp = 40; //get the string length
//GPU MEMORY ALLOCATION
size_t* my_string_length;
cudaMalloc((void**)&my_string_length, sizeof(size_t));
//COPY VALUE FROM CPU(RAM) TO GPU
cudaMemcpy(my_string_length, &temp, sizeof(size_t), cudaMemcpyHostToDevice);
char* string_GPU;
cudaMalloc((void**)&string_GPU, (temp) * sizeof(char));
//COPY VALUE FROM CPU(RAM) TO GPU
cudaMemcpy(string_GPU, my_string, (temp) * sizeof(char), cudaMemcpyHostToDevice);
dim3 grid_size(1);
dim3 block_size((temp));
Print <<< grid_size, temp >>> (string_GPU, my_string_length);
cudaError_t final_error = cudaDeviceSynchronize(); //for synchronization e.g Hello_World then printf
if (final_error == cudaSuccess)
{
printf("\nKernel executed successfully with code: %d !%\n", final_error);
}
else
{
printf("\nKernel executed with code error: %d !\n", final_error);
}
cudaFree(my_string_length);
cudaFree(string_GPU);
return 0;
}
$ nvcc -o t1543 t1543.cu
$ cuda-memcheck ./t1543
========= CUDA-MEMCHECK
My temporary string is printed via GPU!
Kernel executed successfully with code: 0 !%
========= ERROR SUMMARY: 0 errors
$
Note that I'm not suggesting the above is good coding style. It's provided for understanding of the issue. Even this code is relying on the idea that the threads within a warp will call the printf function in a predictable order, which is not guaranteed by the CUDA programming model. So the code is really still broken.
This happened because The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps as you can see in CUDA Programming Guide, so the first 32 threads covers "My temporary string is printed v" and the remaining part covers "ia GPU!". It seems that the kernel put the latter wrap before the first one in execution order.

Atomic operation on the circular global buffer in cuda

I am implementing a circular global memory to enable all threads read/write data to the same buffer simultaneously. It is a very simple producer/consumer algorithm in cpu. But i found something wrong in my cuda code.
The circular buffer was defined as follows:
#define BLOCK_NUM 1024
#define THREAD_NUM 64
#define BUFFER_SIZE BLOCK_NUM*THREAD_NUM*10
struct Stack {
bool bDirty[BUFFER_SIZE];
unsigned int index;
unsigned int iStackSize;
}
The read device is implemented as
__device__ void read(Stack *pStack) {
unsigned int index = atomicDec(&pStack->index, BUFFER_SIZE-1);
if(- -index >= BUFFER_SIZE)
index = BUFFER_SIZE - 1;
// check
if(pStack->bDirty[index] == false) {
printf(“no data\n”);
return;
}
//set read flag
pStack->bDirty[index] = false;
atomicSub(&pStack->iStackSize, 1);
}
The write device function is:
__device__ void write(Stack *pStack) {
unsigned int index = atomicInc(&pStack->index, BUFFER_SIZE - 1);
//check
if(pStack->bDirty[index] == true) {
printf(“why dirty\n”);
return;
}
pStack->bDirty[index] = true;
atomicAdd(&pStack->iStackSize, 1);
}
In order to test the read/write function in a more robust way, I write the following kernels:
__global__ void kernelWrite(Stack *pStack) {
if(threadIdx.x != 0) //make write less than thread number for testing purpose
write(pStack);
}
__global__ void kernelRead(Stack *pStack) {
read(pStack);
__syncthreads();
if(threadIdx.x % 3 != 0) // make write less than read
write(pStack);
__syncthreads();
}
In the main function, I used a dead loop to test if the read/write is atomic.
int main() {
Stack *pHostStack = (Stack*)malloc(sizeof(Stack));
Stack *pStack;
cudaMalloc(&pStack, sizeof(Stack));
cudaMemset(pStack, 0, sizeof(Stack));
while(true) { //dead loop
kernelWrite<<<BLOCK_NUM, THREAD_NUM>>>(pStack);
cudaDeviceSynchonize();
cudaMemcpy(pHostStack, pStack, sizeof(Stack), cudaMemcpyDeviceToHost);
while(pHost->iStackSize >= BLOCK_NUM*THREAD_NUM) {
kernelRead<<<BLOCK_NUM, THREAD_NUM>>>(pStack);
cudaDeviceSynchonize();
cudaMemcpy(pHostStack, pStack, sizeof(Stack), cudaMemcpyDeviceToHost);
}
return 0;
}
When I execute the above code, I got error msg “why dirty” and “no data”. What is wrong to the read/write logic?
By the way, I do not map the thread ID to the linear buffer address because in my application maybe only 10% threads write to the buffer, it is unpredictable/random.
The key problem is that the atomic operation is not real atomic because of reading and writing to the same buffer. The weird thing is that when the total thread number is less then 4096, no error message will be shown.

thrust::device_vector in constant memory

I have a float array that needs to be referenced many times on the device, so I believe the best place to store it is in __ constant __ memory (using this reference). The array (or vector) will need to be written once at run-time when initializing, but read by multiple different functions many millions of times, so constant copying to the kernel each function call seems like A Bad Idea.
const int n = 32;
__constant__ float dev_x[n]; //the array in question
struct struct_max : public thrust::unary_function<float,float> {
float C;
struct_max(float _C) : C(_C) {}
__host__ __device__ float operator()(const float& x) const { return fmax(x,C);}
};
void foo(const thrust::host_vector<float> &, const float &);
int main() {
thrust::host_vector<float> x(n);
//magic happens populate x
cudaMemcpyToSymbol(dev_x,x.data(),n*sizeof(float));
foo(x,0.0);
return(0);
}
void foo(const thrust::host_vector<float> &input_host_x, const float &x0) {
thrust::device_vector<float> dev_sol(n);
thrust::host_vector<float> host_sol(n);
//this method works fine, but the memory transfer is unacceptable
thrust::device_vector<float> input_dev_vec(n);
input_dev_vec = input_host_x; //I want to avoid this
thrust::transform(input_dev_vec.begin(),input_dev_vec.end(),dev_sol.begin(),struct_max(x0));
host_sol = dev_sol; //this memory transfer for debugging
//this method compiles fine, but crashes at runtime
thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(dev_x);
thrust::transform(dev_ptr,dev_ptr+n,dev_sol.begin(),struct_max(x0));
host_sol = dev_sol; //this line crashes
}
I tried adding a global thrust::device_vector dev_x(n), but that also crashed at run-time, and would be in __ global __ memory rather than __ constant__ memory
This can all be made to work if I just discard the thrust library, but is there a way to use the thrust library with globals and device constant memory?
Good question! You can't cast a __constant__ array as if it's a regular device pointer.
I will answer your question (after the line below), but first: this is a bad use of __constant__, and it isn't really what you want. The constant cache in CUDA is optimized for uniform access across threads in a warp. That means all threads in the warp access the same location at the same time. If each thread of the warp accesses a different constant memory location, then the accesses get serialized. So your access pattern, where consecutive threads access consecutive memory locations, will be 32 times slower than a uniform access. You should really just use device memory. If you need to write the data once, but read it many times, then just use a device_vector: initialize it once, and then read it many times.
To do what you asked, you can use a thrust::counting_iterator as the input to thrust::transform to generate a range of indices into your __constant__ array. Then your functor's operator() takes an int index operand rather than a float value operand, and does the lookup into constant memory.
(Note that this means your functor is now __device__ code only. You could easily overload the operator to take a float and call it differently on host data if you need portability.)
I modified your example to initialize the data and print the result to verify that it is correct.
#include <stdio.h>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/counting_iterator.h>
const int n = 32;
__constant__ float dev_x[n]; //the array in question
struct struct_max : public thrust::unary_function<float,float> {
float C;
struct_max(float _C) : C(_C) {}
// only works as a device function
__device__ float operator()(const int& i) const {
// use index into constant array
return fmax(dev_x[i],C);
}
};
void foo(const thrust::host_vector<float> &input_host_x, const float &x0) {
thrust::device_vector<float> dev_sol(n);
thrust::host_vector<float> host_sol(n);
thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(dev_x);
thrust::transform(thrust::make_counting_iterator(0),
thrust::make_counting_iterator(n),
dev_sol.begin(),
struct_max(x0));
host_sol = dev_sol; //this line crashes
for (int i = 0; i < n; i++)
printf("%f\n", host_sol[i]);
}
int main() {
thrust::host_vector<float> x(n);
//magic happens populate x
for (int i = 0; i < n; i++) x[i] = rand() / (float)RAND_MAX;
cudaMemcpyToSymbol(dev_x,x.data(),n*sizeof(float));
foo(x, 0.5);
return(0);
}

CUDA program giving garbage value

I really do not understand why the output for the below code is not a and b.
#include<cutil.h>
#include<iostream>
__global__ void p(unsigned char **a){
unsigned char temp[2];
temp[0] = 'a';
temp[1] = 'b';
a[0] = temp;
}
void main(){
unsigned char **a ;
cudaMalloc((void**)&a, sizeof(unsigned char*));
p<<<1,1>>>(a);
unsigned char **c;
unsigned char b[2];
cudaMemcpy(c, a, sizeof(unsigned char *), cudaMemcpyDeviceToHost);
cudaMemcpy(b, c[0], 2*sizeof(unsigned char), cudaMemcpyDeviceToHost);
for( int i=0 ; i < 2; i++){
printf("%c\n", b[i]);
}
getchar();
}
what is wrong with my logic?
Let's leave out CUDA for now. Let's just make a function that writes data to a user-provided array. The user passes the array via a pointer:
void fill_me_up(int * dst)
{
// We sure hope that `dst` points to a large enough area of memory!
dst[0] = 28;
dst[1] = 75;
}
Now, what you're doing with the local variable doesn't make sense, because you want to use the address of a local variable, which becomes invalid after you leave the function scope. The next best thing you could do is memcpy(), or some equivalent C++ algorithm:
void fill_me_up_again(int * dst)
{
int temp[] = { 28, 75 };
memcpy((void *)dst, (const void *)temp, sizeof(temp));
}
OK, now on to calling that function: We first must provide the target memory, and then pass a pointer:
int main()
{
int my_memory[2]; // here's our memory -- automatic local storage
fill_me_up(my_memory); // OK, array decays to pointer-to-beginning
fill_me_up(&my_memory[0]); // A bit more explicit
int * your_memory = malloc(sizeof(int) * 2); // more memory, this time dynamic
fill_me_up_again(your_memory);
/* ... */
free(your_memory);
}
(In C++ you would probably have uses new int[2] and delete your_memory instead, but by using C malloc() the connection to CUDA hopefully becomes clear.)
When you're moving fill_me_up to the CUDA device, you have to give it a device pointer rather than a host pointer, so you have to set that one up first and afterwards copy the results back out, but that's about the only change.

mips _Unwind_Backtrace on SIGSEGV

On a mips platform, I am trying to get Unwind work. Currently if I issue print_trace manually stack trace is correctly shown as below:
backtrace_helper 0x4b6958
backtrace_helper 0x4b6ab4
backtrace_helper 0x2ac2f628
Obtained 3 stack frames.
./v(print_trace+0x38) [0x4b6958]
./v(main+0x90) [0x4b6ab4]
/lib/libc.so.0(__uClibc_main+0x24c) [0x2ac2f628]
But when a SIGSEGV occurs, stack trace does not show correct function call sequence.
backtrace_helper 0x4b7a74
backtrace_helper 0x2ab9b84c
Obtained 2 stack frames.
./v(getLineIDByPhyIdx+0x3d8) [0x4b7a74]
/lib/libpthread.so.0(__new_sem_post+0x2c8) [0x2ab9b84c]
I am compiling with -g -fexceptions -rdynamic. Also I have seen How to generate a stacktrace when my gcc C++ app crashes in which 2nd answer mentiones about wrong address but when I set as he does but it only changes 2nd frame and rest is the same, code snippet is below:
caller_address = (void *) uc->uc_mcontext.gregs[30]; // Frame pointer (from wikipedia here)
fprintf(stderr, "signal %d (%s), address is %p from %p\n",
sig_num, strsignal(sig_num), info->si_addr,
(void *)caller_address);
size = backtrace(array, 50);
/* overwrite sigaction with caller's address */
array[1] = caller_address;
messages = backtrace_symbols(array, size);
Code:
int main(int argc, char *argv[]) {
registerSignalHandler(signalHandler);
print_trace();
{
// Seg Fault
int *p = NULL;
*p = 54;
}
}
void print_trace(void) {
void *array[10];
size_t size;
char **strings;
size_t i;
/* Get the address at the time the signal was raised from the EIP (x86) */
size = backtrace(array, 10);
strings = backtrace_symbols(array, size);
printf("Obtained %zd stack frames.\n", size);
for (i = 0; i < size; i++)
printf("%s\n", strings[i]);
free(strings);
}
static _Unwind_Reason_Code
backtrace_helper (struct _Unwind_Context *ctx, void *a)
{
struct trace_arg *arg = a;
assert (unwind_getip != NULL);
/* We are first called with address in the __backtrace function. Skip it. */
if (arg->cnt != -1) {
arg->array[arg->cnt] = (void *) unwind_getip (ctx);
printf("backtrace_helper %p \n", arg->array[arg->cnt]);
}
if (++arg->cnt == arg->size)
return _URC_END_OF_STACK;
return _URC_NO_REASON;
}
/*
* Perform stack unwinding by using the _Unwind_Backtrace.
*
* User application that wants to use backtrace needs to be
* compiled with -fexceptions option and -rdynamic to get full
* symbols printed.
*/
int backtrace (void **array, int size)
{
struct trace_arg arg = { .array = array, .size = size, .cnt = -1 };
if (unwind_backtrace == NULL)
backtrace_init();
if (size >= 1)
unwind_backtrace (backtrace_helper, &arg);
return arg.cnt != -1 ? arg.cnt : 0;
}
void signalHandler( int sig, siginfo_t* siginfo, void* notused)
{
/* Print out the signal info */
signalInfo(sig, siginfo);
switch (sig) {
case SIGSEGV:
{
print_trace();
abort();
}
}
}
Frame pointer is practically never used on MIPS and obtaining backtrace without digging into symbols requires some heuristics.
Typical approach is to analyze code preceding current instruction pointer and try finding function prologue that adjusts SP. Using that info one can figure out location of preceding frame, etc.
See these slides for some of the gory details:
http://elinux.org/images/0/07/Intricacies_of_a_MIPS_Stack_Backtrace_Implementation.pdf