I have a problem with threads' id during the block executes.
I would like to have sentence like :"My temporary string is printed via GPU!" as you see (on the attached photo ealier) the sentence has been displayed wrongly and I don't know how to fix it.
Code:
__global__ void Print(const char* const __string, const size_t* const loop_repeat)
{
int id_x = threadIdx.x + blockIdx.x * blockDim.x;
while (id_x < static_cast<int>(*loop_repeat))
{
printf("%c", __string[id_x]);
__syncthreads();
id_x += blockDim.x * gridDim.x;
}
}
int main()
{
const char* my_string = "My temporary string is printed via GPU!";
size_t temp{};
temp = Get_String_Length(my_string); //get the string length
//GPU MEMORY ALLOCATION
size_t* my_string_length{};
cudaMalloc((void**)&my_string_length, sizeof(size_t));
//COPY VALUE FROM CPU(RAM) TO GPU
cudaMemcpy(my_string_length, &temp, sizeof(size_t), HostToDevice);
char* string_GPU{};
cudaMalloc((void**)&string_GPU, (temp) * sizeof(char));
//COPY VALUE FROM CPU(RAM) TO GPU
cudaMemcpy(string_GPU, my_string, (temp) * sizeof(char), HostToDevice);
dim3 grid_size(1);
dim3 block_size((temp));
Print <<< grid_size, temp >>> (string_GPU, my_string_length);
cudaError_t final_error = cudaDeviceSynchronize(); //for synchronization e.g Hello_World then printf
if (final_error == cudaSuccess)
{
printf("%cKernel executed successfully with code: %d !%\n", NEW_LINE, final_error);
}
else
{
printf("%cKernel executed with code error: %d !\n", NEW_LINE, final_error);
}
cudaFree(my_string_length);
cudaFree(string_GPU);
return 0;
}
I will be grateful for any help given.
The main issue here is that you are expecting that the thread or warp execution order has some predictable order. Actually, it does not. Your usage of __syncthreads() doesn't fix or address this issue.
If you want the warps to execute in a predictable order (not recommended) you would need to impose that order yourself. Here is an example that demonstrates that for this very simple code. It is not extensible without modification to larger strings, and this method will completely break down if you introduce more than 1 threadblock.
$ cat t1543.cu
#include <stdio.h>
#include <stdlib.h>
__global__ void Print(const char* const __string, const size_t* const loop_repeat)
{
int id_x = threadIdx.x + blockIdx.x * blockDim.x;
int warp_ID = threadIdx.x>>5;
while (id_x < static_cast<int>(*loop_repeat))
{
if (warp_ID == 0)
printf("%c", __string[id_x]);
__syncthreads();
if (warp_ID == 1)
printf("%c", __string[id_x]);
__syncthreads();
id_x += blockDim.x * gridDim.x;
}
}
int main()
{
const char* my_string = "My temporary string is printed via GPU!";
size_t temp;
temp = 40; //get the string length
//GPU MEMORY ALLOCATION
size_t* my_string_length;
cudaMalloc((void**)&my_string_length, sizeof(size_t));
//COPY VALUE FROM CPU(RAM) TO GPU
cudaMemcpy(my_string_length, &temp, sizeof(size_t), cudaMemcpyHostToDevice);
char* string_GPU;
cudaMalloc((void**)&string_GPU, (temp) * sizeof(char));
//COPY VALUE FROM CPU(RAM) TO GPU
cudaMemcpy(string_GPU, my_string, (temp) * sizeof(char), cudaMemcpyHostToDevice);
dim3 grid_size(1);
dim3 block_size((temp));
Print <<< grid_size, temp >>> (string_GPU, my_string_length);
cudaError_t final_error = cudaDeviceSynchronize(); //for synchronization e.g Hello_World then printf
if (final_error == cudaSuccess)
{
printf("\nKernel executed successfully with code: %d !%\n", final_error);
}
else
{
printf("\nKernel executed with code error: %d !\n", final_error);
}
cudaFree(my_string_length);
cudaFree(string_GPU);
return 0;
}
$ nvcc -o t1543 t1543.cu
$ cuda-memcheck ./t1543
========= CUDA-MEMCHECK
My temporary string is printed via GPU!
Kernel executed successfully with code: 0 !%
========= ERROR SUMMARY: 0 errors
$
Note that I'm not suggesting the above is good coding style. It's provided for understanding of the issue. Even this code is relying on the idea that the threads within a warp will call the printf function in a predictable order, which is not guaranteed by the CUDA programming model. So the code is really still broken.
This happened because The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps as you can see in CUDA Programming Guide, so the first 32 threads covers "My temporary string is printed v" and the remaining part covers "ia GPU!". It seems that the kernel put the latter wrap before the first one in execution order.
What's a good way to zero-out cudaMalloc'd data? Assume using cudaMemset or cudaMemsetAsync from the CPU results in synchronization issues with other cuda API calls, forcing you to do something else.
Edit 1
In the first picture below, you can see that Thread 1291997440 has issued a cudaMemcpyAsync and takes a while to execute it. For some reason, this cudaMemcpyAsync seems to be blocking the cudaMemsetAsync, as shown in the second picture below. Note that each CPU thread is queuing up these operations in its own stream. Someone of reputation mentioned offsite that using a kernel instead of using a cudaMemsetAsync call could result in clearing the memory sooner -- that's the reason I pursue this question.
Edit 2
At this point, I've improved the code (by reducing the size of the HtoD and DtoH copies) enough to prevent this issue from showing up. The above pictures are from the previous night. If the comments are 100% true, then there must be some other sliver in the profiling report that I didn't notice. In the newer version of the code, there is no observable difference between using cudaMemsetAsync and calling a kernel to clear out the memory.
Here I show the results of zero'ing out ~5.5GB of data with three different approaches. The code was compiled with -O3 and ran on a V100 with 16 GB of memory.
Approach A: cudaMemset
To establish a baseline, I zero out the the data from the CPU, using cudaMemset. This is very fast, but even the cudaMemsetAsync version can be serialized at run time if lots of cudaMemcpys are in flight.
Result: 6 ms
Approach B: memset
Calling memset may invoked the worst of both the CPU and GPU world. memset has this feeling to it that once you exit it, the data is exactly as you indicated it should be. Of course this isn't true in the presence of other kernels, race conditions, etc. But, that's my guess as to why it's so slow.
Result: 241 ms
Approach C: coalesced writes
Writing in a coalesced manner seems to have the best of both the CPU and GPU worlds. It's as fast as the CPU-issued cudaMemset, and it's also clear to any programmer who makes coalesced writes that of course there are race conditions, etc.
Result: 6 ms
Conclusion
If you can't use cudaMemset[Async] from the CPU, then use coalesced writes with 32 threads or more per block.
Program Output
Starting timer for calling cudaMemset from CPU
Stopping timer for calling cudaMemset from CPU took 0.006015s
Starting timer for calling kernel<80,1> that uses memset
Stopping timer for calling kernel<80,1> that uses memset took 0.393921s
Starting timer for calling kernel<80,2> that uses memset
Stopping timer for calling kernel<80,2> that uses memset took 0.300473s
Starting timer for calling kernel<80,4> that uses memset
Stopping timer for calling kernel<80,4> that uses memset took 0.269686s
Starting timer for calling kernel<80,8> that uses memset
Stopping timer for calling kernel<80,8> that uses memset took 0.241374s
Starting timer for calling kernel<80,16> that uses memset
Stopping timer for calling kernel<80,16> that uses memset took 0.645509s
Starting timer for calling kernel<80,32> that uses memset
Stopping timer for calling kernel<80,32> that uses memset took 0.611437s
Starting timer for calling kernel<80,64> that uses memset
Stopping timer for calling kernel<80,64> that uses memset took 0.611276s
Starting timer for calling kernel<80,128> that uses memset
Stopping timer for calling kernel<80,128> that uses memset took 0.459663s
Starting timer for calling kernel<80,256> that uses memset
Stopping timer for calling kernel<80,256> that uses memset took 0.308788s
Starting timer for calling kernel<80,512> that uses memset
Stopping timer for calling kernel<80,512> that uses memset took 0.595893s
Starting timer for calling kernel<80,1024> that uses memset
Stopping timer for calling kernel<80,1024> that uses memset took 2.552866s
Starting timer for calling kernel<80,1> that performs coalesced writes
Stopping timer for calling kernel<80,1> that performs coalesced writes took 0.136967s
Starting timer for calling kernel<80,2> that performs coalesced writes
Stopping timer for calling kernel<80,2> that performs coalesced writes took 0.068426s
Starting timer for calling kernel<80,4> that performs coalesced writes
Stopping timer for calling kernel<80,4> that performs coalesced writes took 0.039974s
Starting timer for calling kernel<80,8> that performs coalesced writes
Stopping timer for calling kernel<80,8> that performs coalesced writes took 0.017121s
Starting timer for calling kernel<80,16> that performs coalesced writes
Stopping timer for calling kernel<80,16> that performs coalesced writes took 0.008586s
Starting timer for calling kernel<80,32> that performs coalesced writes
Stopping timer for calling kernel<80,32> that performs coalesced writes took 0.006139s
Starting timer for calling kernel<80,64> that performs coalesced writes
Stopping timer for calling kernel<80,64> that performs coalesced writes took 0.006075s
Starting timer for calling kernel<80,128> that performs coalesced writes
Stopping timer for calling kernel<80,128> that performs coalesced writes took 0.006093s
Starting timer for calling kernel<80,256> that performs coalesced writes
Stopping timer for calling kernel<80,256> that performs coalesced writes took 0.006479s
Starting timer for calling kernel<80,512> that performs coalesced writes
Stopping timer for calling kernel<80,512> that performs coalesced writes took 0.006972s
Starting timer for calling kernel<80,1024> that performs coalesced writes
Stopping timer for calling kernel<80,1024> that performs coalesced writes took 0.007354s
Test Implementation
memset_timing.cu
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include "timer.h"
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
#define round_up(x, multiple) (((x + multiple - 1) / multiple) * multiple)
const long COUNT = 80 << 24;
const int GPU_CACHE_LINE_SIZE_IN_BYTES = 32;
const long SIZE_OF_DATA = sizeof(int) * COUNT;
__global__ void clear_scratch_space_kernel(int * data, int blocks, int threads) {
// BOZO: change the code to just error out if we're any of the border cases below
const int idx = blockIdx.x * threads + threadIdx.x;
long size = sizeof(int) * COUNT;
long size_of_typical_chunk = round_up(size / (blocks * threads), GPU_CACHE_LINE_SIZE_IN_BYTES);
// Due to truncation, the threads at the end won't have anything to do. This is a little sloppy but costs us
// hardly anything in performance, so we do the simpler thing.
long this_threads_offset = idx * size_of_typical_chunk;
if (this_threads_offset > SIZE_OF_DATA) {
return;
}
long size_of_this_threads_chunk;
if (this_threads_offset + size_of_typical_chunk >= SIZE_OF_DATA) {
// We are the last thread, so we do a partial write
size_of_this_threads_chunk = SIZE_OF_DATA - this_threads_offset;
} else {
size_of_this_threads_chunk = size_of_typical_chunk;
}
void * starting_address = reinterpret_cast<void *>(reinterpret_cast<char *>(data) + this_threads_offset);
memset((void *) starting_address, 0, size_of_this_threads_chunk);
}
__global__ void clear_scratch_space_with_coalesced_writes_kernel(int * data, int blocks, int threads) {
if (COUNT % (blocks * threads) != 0) {
printf("Adjust the SIZE_OF_DATA so it's divisible by the number of (blocks * threads)\n");
}
const long count_of_ints_in_each_blocks_chunk = COUNT / blocks;
int block = blockIdx.x;
int thread = threadIdx.x;
const long rounds_needed = count_of_ints_in_each_blocks_chunk / threads;
const long this_blocks_starting_offset = block * count_of_ints_in_each_blocks_chunk;
//printf("Clearing %li ints starting at offset %li\n", count_of_ints_in_each_blocks_chunk, this_blocks_starting_offset);
int * this_threads_base_pointer = &data[this_blocks_starting_offset + thread];
for (int round = 0; round < rounds_needed; ++round) {
*this_threads_base_pointer = 0;
this_threads_base_pointer += threads;
}
}
void set_gpu_data_to_ones(int * data_on_gpu) {
cudaMemset(data_on_gpu, 1, SIZE_OF_DATA);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
}
void check_gpu_data_is_zeroes(int * data_on_gpu, char * data_on_cpu) {
cudaMemcpy(data_on_cpu, data_on_gpu, SIZE_OF_DATA, cudaMemcpyDeviceToHost);
for (long i = 0; i < SIZE_OF_DATA; ++i) {
if (data_on_cpu[i] != 0) {
printf("Failed to zero-out byte offset %i in the data\n", i);
}
}
}
int main(void)
{
const long count = COUNT;
int * data_on_gpu;
char * data_on_cpu = (char *) malloc(SIZE_OF_DATA);
if (data_on_cpu == NULL) {
printf("Failed to allocate data on cpu");
}
CUDA_CHECK_RETURN(cudaMalloc(&data_on_gpu, sizeof(int) * count));
{
Timer memset_timer("calling cudaMemset from CPU");
memset_timer.start();
CUDA_CHECK_RETURN(cudaMemset(data_on_gpu, 0, SIZE_OF_DATA));
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
memset_timer.stop_and_report();
}
for (int threads = 1; threads <= 1024; threads *= 2) {
set_gpu_data_to_ones(data_on_gpu);
char buffer[200];
sprintf(buffer, "calling kernel<80,%i> that uses memset", threads);
Timer memset_timer(buffer);
memset_timer.start();
clear_scratch_space_kernel<<<80, threads>>>(data_on_gpu, 80, threads);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
memset_timer.stop_and_report();
check_gpu_data_is_zeroes(data_on_gpu, data_on_cpu);
}
for (int threads = 1; threads <= 1024; threads *= 2) {
set_gpu_data_to_ones(data_on_gpu);
char buffer[200];
sprintf(buffer, "calling kernel<80,%i> that performs coalesced writes", threads);
Timer memset_timer(buffer);
memset_timer.start();
clear_scratch_space_with_coalesced_writes_kernel<<<80, threads>>>(data_on_gpu, 80, threads);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
memset_timer.stop_and_report();
check_gpu_data_is_zeroes(data_on_gpu, data_on_cpu);
}
free(data_on_cpu);
}
/**
* Check the return value of the CUDA runtime API call and exit
* the application if the call has failed.
*/
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
Timer.h
#include <string>
#include <chrono>
class Timer {
public:
Timer(std::string name_, bool allow_output = true);
virtual ~Timer();
void start();
void start_or_restart();
void stop(bool force_no_output = false);
void report(const int count = 0, bool preface_with_spaces = true);
void stop_and_report(const int count = 0);
double duration_in_seconds();
long duration_in_microseconds();
private:
std::string name;
// even though we call report, we still might suppress output since the output is often a type of debugging info
bool allow_output;
std::chrono::time_point<std::chrono::system_clock> start_time;
std::chrono::time_point<std::chrono::system_clock> end_time;
bool started_before = false;
bool currently_rolling = false; // if timer (i.e., the clock) is currently rolling
double duration = -1.0;
};
Timer.cpp
#include <stdexcept>
#include "timer.h"
Timer::Timer(std::string name_, bool allow_output_) {
name = name_;
allow_output = allow_output_;
}
Timer::~Timer() {
}
void Timer::start() {
#ifdef DEBUG
if(started_before) {
printf("Attempting to start same timer twice. Exiting.\n");
throw std::runtime_error("Attempting to start timer that was previously started");
}
#endif
if (allow_output) {
printf("Starting timer for %s\n", name.c_str());
}
start_time = std::chrono::system_clock::now();
currently_rolling = true;
started_before = true;
duration = 0.0;
}
void Timer::start_or_restart() {
if (currently_rolling) {
throw std::runtime_error("Can't start or restart a timer that's already rolling.");
}
if (!started_before && allow_output) {
printf("Starting timer for %s\n", name.c_str());
}
started_before = true;
start_time = std::chrono::system_clock::now();
currently_rolling = true;
if (duration < 0.0) {
duration = 0.0;
}
}
void Timer::stop(bool force_no_output) {
if (!force_no_output) { // Slight style violation: I prefer nested if's over && statements with two && operators
if (allow_output && duration <= 0.0) {
printf("Stopping timer for %s\n", name.c_str());
}
}
end_time = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end_time - start_time;
currently_rolling = false;
duration += elapsed_seconds.count();
}
void Timer::stop_and_report(const int count) {
stop(true);
report(count, false);
}
double Timer::duration_in_seconds() {
return duration;
}
long Timer::duration_in_microseconds() {
return static_cast<long>(duration * 1000000);
}
void Timer::report(const int count, bool preface_with_spaces) {
std::string preface;
if (preface_with_spaces) {
preface = " ";
} else {
preface = "Stopping ";
}
if (allow_output) {
if (!started_before) {
printf("%stimer for %s was never started\n", preface.c_str(), name.c_str());
} else if (count > 0) {
double average = (duration / static_cast<double>(count)) * 1000.0;
printf("%stimer for %s took %fs, %.3lfus each\n", preface.c_str(), name.c_str(), duration, average * 1000.0);
} else {
printf("%stimer for %s took %fs\n", preface.c_str(), name.c_str(), duration);
}
}
}
This seems like it should be simple, but I can't find any references, so I'm asking here.
I have the following CUDA kernel, which I am launching in a grid of 2-D thread blocks:
__global__ void kernel(){
if (threadIdx.x == 0 && threadIdx.y == 0) {
__shared__ int test = 100;
}
__syncthreads();
// Do more stuff
}
When I try to compile, I get the error "initializer not allowed for shared variable"
What am I doing wrong? It seems to me like I have just one thread doing the initializing...
Thanks!
Do this instead:
__global__ void kernel(){
__shared__ int test;
if (threadIdx.x == 0 && threadIdx.y == 0) {
test = 100;
}
__syncthreads();
// Do more stuff
}
The declaration of the __shared___ variable must be separate from code that manipulates it.
I really do not understand why the output for the below code is not a and b.
#include<cutil.h>
#include<iostream>
__global__ void p(unsigned char **a){
unsigned char temp[2];
temp[0] = 'a';
temp[1] = 'b';
a[0] = temp;
}
void main(){
unsigned char **a ;
cudaMalloc((void**)&a, sizeof(unsigned char*));
p<<<1,1>>>(a);
unsigned char **c;
unsigned char b[2];
cudaMemcpy(c, a, sizeof(unsigned char *), cudaMemcpyDeviceToHost);
cudaMemcpy(b, c[0], 2*sizeof(unsigned char), cudaMemcpyDeviceToHost);
for( int i=0 ; i < 2; i++){
printf("%c\n", b[i]);
}
getchar();
}
what is wrong with my logic?
Let's leave out CUDA for now. Let's just make a function that writes data to a user-provided array. The user passes the array via a pointer:
void fill_me_up(int * dst)
{
// We sure hope that `dst` points to a large enough area of memory!
dst[0] = 28;
dst[1] = 75;
}
Now, what you're doing with the local variable doesn't make sense, because you want to use the address of a local variable, which becomes invalid after you leave the function scope. The next best thing you could do is memcpy(), or some equivalent C++ algorithm:
void fill_me_up_again(int * dst)
{
int temp[] = { 28, 75 };
memcpy((void *)dst, (const void *)temp, sizeof(temp));
}
OK, now on to calling that function: We first must provide the target memory, and then pass a pointer:
int main()
{
int my_memory[2]; // here's our memory -- automatic local storage
fill_me_up(my_memory); // OK, array decays to pointer-to-beginning
fill_me_up(&my_memory[0]); // A bit more explicit
int * your_memory = malloc(sizeof(int) * 2); // more memory, this time dynamic
fill_me_up_again(your_memory);
/* ... */
free(your_memory);
}
(In C++ you would probably have uses new int[2] and delete your_memory instead, but by using C malloc() the connection to CUDA hopefully becomes clear.)
When you're moving fill_me_up to the CUDA device, you have to give it a device pointer rather than a host pointer, so you have to set that one up first and afterwards copy the results back out, but that's about the only change.
On a mips platform, I am trying to get Unwind work. Currently if I issue print_trace manually stack trace is correctly shown as below:
backtrace_helper 0x4b6958
backtrace_helper 0x4b6ab4
backtrace_helper 0x2ac2f628
Obtained 3 stack frames.
./v(print_trace+0x38) [0x4b6958]
./v(main+0x90) [0x4b6ab4]
/lib/libc.so.0(__uClibc_main+0x24c) [0x2ac2f628]
But when a SIGSEGV occurs, stack trace does not show correct function call sequence.
backtrace_helper 0x4b7a74
backtrace_helper 0x2ab9b84c
Obtained 2 stack frames.
./v(getLineIDByPhyIdx+0x3d8) [0x4b7a74]
/lib/libpthread.so.0(__new_sem_post+0x2c8) [0x2ab9b84c]
I am compiling with -g -fexceptions -rdynamic. Also I have seen How to generate a stacktrace when my gcc C++ app crashes in which 2nd answer mentiones about wrong address but when I set as he does but it only changes 2nd frame and rest is the same, code snippet is below:
caller_address = (void *) uc->uc_mcontext.gregs[30]; // Frame pointer (from wikipedia here)
fprintf(stderr, "signal %d (%s), address is %p from %p\n",
sig_num, strsignal(sig_num), info->si_addr,
(void *)caller_address);
size = backtrace(array, 50);
/* overwrite sigaction with caller's address */
array[1] = caller_address;
messages = backtrace_symbols(array, size);
Code:
int main(int argc, char *argv[]) {
registerSignalHandler(signalHandler);
print_trace();
{
// Seg Fault
int *p = NULL;
*p = 54;
}
}
void print_trace(void) {
void *array[10];
size_t size;
char **strings;
size_t i;
/* Get the address at the time the signal was raised from the EIP (x86) */
size = backtrace(array, 10);
strings = backtrace_symbols(array, size);
printf("Obtained %zd stack frames.\n", size);
for (i = 0; i < size; i++)
printf("%s\n", strings[i]);
free(strings);
}
static _Unwind_Reason_Code
backtrace_helper (struct _Unwind_Context *ctx, void *a)
{
struct trace_arg *arg = a;
assert (unwind_getip != NULL);
/* We are first called with address in the __backtrace function. Skip it. */
if (arg->cnt != -1) {
arg->array[arg->cnt] = (void *) unwind_getip (ctx);
printf("backtrace_helper %p \n", arg->array[arg->cnt]);
}
if (++arg->cnt == arg->size)
return _URC_END_OF_STACK;
return _URC_NO_REASON;
}
/*
* Perform stack unwinding by using the _Unwind_Backtrace.
*
* User application that wants to use backtrace needs to be
* compiled with -fexceptions option and -rdynamic to get full
* symbols printed.
*/
int backtrace (void **array, int size)
{
struct trace_arg arg = { .array = array, .size = size, .cnt = -1 };
if (unwind_backtrace == NULL)
backtrace_init();
if (size >= 1)
unwind_backtrace (backtrace_helper, &arg);
return arg.cnt != -1 ? arg.cnt : 0;
}
void signalHandler( int sig, siginfo_t* siginfo, void* notused)
{
/* Print out the signal info */
signalInfo(sig, siginfo);
switch (sig) {
case SIGSEGV:
{
print_trace();
abort();
}
}
}
Frame pointer is practically never used on MIPS and obtaining backtrace without digging into symbols requires some heuristics.
Typical approach is to analyze code preceding current instruction pointer and try finding function prologue that adjusts SP. Using that info one can figure out location of preceding frame, etc.
See these slides for some of the gory details:
http://elinux.org/images/0/07/Intricacies_of_a_MIPS_Stack_Backtrace_Implementation.pdf