CUDA - device memory,searching a string in a text - cuda

I have a string of length maximum 500 characters and a text file of size 200MB. I want to write a program in CUDA to search the string in the text file. My text file is too large and I think I have to put it in a global memory of the device, but what about my string? Which is the best among the shared, constant and texture memory? and why?
Also I have an array of size maximum 2500. Which types of device memory is suitable for storing it?

For a naive implementation on Fermi:
Store the text file in global memory and the search string in constant memory. Set up a result buffer of the same size as the text file. Fill the result buffer with zeroes.
Let the number of threads per block, t, be the same as the length of the search string. To determine grid dimensions, consider the size of your text file and the grid dimension limit of 64K. To cover your whole file, select the dimension for x to be, for instance, 10K. Then find the dimension for y by dividing the size of your text file with x and rounding up the result. So 200M / 10K = 20K (which is within 64K). Launch the kernel with t threads and an (x, y) grid.
In the kernel:
Calculate the offset into the text file as d = x + 1024 * y.
Since the y dimension was rounded up above, some kernels at the end of the run must be aborted. Abort the thread if d + t is higher than the size of the text file.
Else, have the thread load one character at index t from the search string and compare it with one character at index t + d in the text file. If the characters didn't match, store a "1" in the result buffer at index d, else do nothing.
When the kernel completes, scan through the result buffer with Thrust. Each location that is 0 marks the starting point of a match.

Assuming you write the kernel so that all threads in a half-warp are accessing the same element of the search string simultaneously, constant memory is likely to yield good results. It's optimized for that case.
Here's some pseudo-code for a simple baseline implementation
...load blocksize+strlen bytes of the file into shared memory...
__syncthreads();
bool found = true;
for (int i = 0; i < strlen; i++) {
if (file_chunk_in_sharedmem[threadIdx.x + i] !=
search_str_in_constantmem[i])
{
found = false;
break;
}
}
if (found) {
...output the result...
}
If the loop is structured such that each thread is accessing a different element of the search string, 1d texture memory might be faster.
The profiler and/or cuda timing functions are your friends.

Related

How to debug numba error involving synchronization [duplicate]

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

Copying 2D arrays to GPU of known variable width

I am looking into how to copy a 2D array of variable width for each row into the GPU.
int rows = 1000;
int cols;
int** host_matrix = malloc(sizeof(*int)*rows);
int *d_array;
int *length;
...
Each host_matrix[i] might have a different length, which I know length[i], and there is where the problem starts. I would like to avoid copying dummy data. Is there a better way of doing it?
According to this thread, that won't be a clever way of doing it:
cudaMalloc(d_array, rows*sizeof(int*));
for(int i = 0 ; i < rows ; i++) {
cudaMalloc((void **)&d_array[i], length[i] * sizeof(int));
}
But I cannot think of any other method. Is there any other smarter way of doing it?
Can it be improved using cudaMallocPitch and cudaMemCpy2D ??
The correct way to allocate an array of pointers for the GPU in CUDA is something like this:
int **hd_array, **d_array;
hd_array = (int **)malloc(nrows*sizeof(int*));
cudaMalloc(d_array, nrows*sizeof(int*));
for(int i = 0 ; i < nrows ; i++) {
cudaMalloc((void **)&hd_array[i], length[i] * sizeof(int));
}
cudaMemcpy(d_array, hd_array, nrows*sizeof(int*), cudaMemcpyHostToDevice);
(disclaimer: written in browser, never compiled, never tested, use at own risk)
The idea is that you assemble a copy of the array of device pointers in host memory first, then copy that to the device. For your hypothetical case with 1000 rows, that means 1001 calls to cudaMalloc and then 1001 calls to cudaMemcpy just to set up the device memory allocations and copy data into the device. That is an enormous overhead penalty, and I would counsel against trying it; the performance will be truly terrible.
If you have very jagged data and need to store it on the device, might I suggest taking a cue of the mother of all jagged data problems - large, unstructured sparse matrices - and copy one of the sparse matrix formats for your data instead. Using the classic compressed sparse row format as a model you could do something like this:
int * data, * rows, * lengths;
cudaMalloc(rows, nrows*sizeof(int));
cudaMalloc(lengths, nrows*sizeof(int));
cudaMalloc(data, N*sizeof(int));
In this scheme, store all the data in a single, linear memory allocation data. The ith row of the jagged array starts at data[rows[i]] and each row has a length of length[i]. This means you only need three memory allocation and copy operations to transfer any amount of data to the device, rather than nrows in your current scheme, ie. it reduces the overheads from O(N) to O(1).
I would put all the data into one array. Then compose another array with the row lengths, so that A[0] is the length of row 0 and so on. so A[i] = length[i]
Then you need just to allocate 2 arrays on the card and call memcopy twice.
Of course it's a little bit of extra work, but i think performance wise it will be an improvement (depending of course on how you use the data on the card)

How best to transfer a large number of arrays of chars to the GPU?

I am new to CUDA and am trying to do some processing of a large number of arrays. Each array is an array of about 1000 chars (not a string, just stored as chars) and there can be up to 1 million of them, so about 1 gb of data to be transfered. This data is already all loaded into memory and I have a pointer to each array, but I don't think I can rely on all the data being sequential in memory, so I can't just transfer it all with one call.
I currently made a first go at it with thrust, and based my solution kind of on this message ... I made a struct with a static call that allocates all the memory, and then each individual constructor copies that array, and I have a transform call which takes in the struct with the pointer to the device array.
My problem is that this is obviously extremely slow, since each array is copied individually. I'm wondering how to transfer this data faster.
In this question (the question is mostly unrelated, but I think the user is trying to do something similar) talonmies suggests that they try and use a zip iterator but I don't see how that would help transfer a large number of arrays.
I also just found out about cudaMemcpy2DToArray and cudaMemcpy2D while writing this question, so maybe those are the answer, but I don't see immediately how these would work, since neither seem to take pointers to pointers as input...
Any suggestions are welcome...
One way to do this is as marina.k suggested, batching your transfers only as you need them. Since you said each array only contains about 1000 chars, you could assign each char to a thread (since on Fermi we can allocate 1024 threads per block) and have each array handled by one block. In this case you may be able to transfer all the arrays for one "round" in one call - can you use a FORTRAN style, where you make one gigantic array and to get the 5th element of the "third" 1000 char array you would go:
third_array[5] = big_array[5 + 2*1000]
so that the first 1000 char array makes up the first 1000 elements of big_array, the second 1000 char array makes up the second 1000 elements of big_array, etc. ? In this case your chars would be continuous in memory and you could move the set you were going to process with one kernel launch in only one memcpy. Then as soon as you launch one kernel, you refill big_array on the CPU side and copy it asynchronously to the GPU.
Within each kernel, you could simply handle each array within 1 block, so that block N handles the (N-1)-thousandth element up to the N-thousandth of d_big_array (where you copied all those chars to).
Did you try pinned memory? This may provide a considerable speed-up on some hardware configurations.
Take try of async, you can assign the same job to different streams, each stream process a small part of date, make tranfer and computation at the same time
here is code:
cudaMemcpyAsync(
inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]
);
MyKernel<<<100, 512, 0, stream[i]>>> (outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(
hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]
);

CUDA: streaming the same memory location to all threads

Here's my problem: I have quite a big set of doubles (it's an array of 77.500 doubles) to be stored somewhere in cuda. Now, I need a big set of threads to sequentially do a bunch of operations with that array. Every thread will have to read the SAME element of that array, perform tasks, store results in shared memory and read the next element of the array. Note that every thread will simultaneously have to read (just read) from the same memory location. So I wonder: is there any way to broadcast the same double to all threads with just one memory read? Reading many times would be quite useless... Any idea??
This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:
// choose some reasonable block size
const unsigned int block_size = 256;
__global__ void kernel(double *ptr)
{
__shared__ double window[block_size];
// cooperate with my block to load block_size elements
window[threadIdx.x] = ptr[threadIdx.x];
// wait until the window is full
__syncthreads();
// operate on the data
...
}
You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.

NPP library function argument *pDeviceBuffer

I notice some of the npp functions has an argument *pDeviceBuffer. I am wondering what this argument is for and how I shall set it while using the functions. Also, the results of functions,such as nppsMax_32f, are written back to a pointer. Is the memory on host or device memory? Thank you.
pDeviceBuffer is used as scratch space inside npp. Scratch space is usually allocated internally (like in CUFFT). But some of these operations (sum, min, max) are so quick that allocating the scratch space itself might become a bottleneck. Querying for the scratch space required and then allocating it once before re-using it multiple times would be a good idea.
Example:
Let's say you have a very large array that you want to get min max and sum from, you will need to do the following.
int n = 1e6, bytes = 0;
nppsReductionGetBufferSize_32f(n, &bytes);
Npp8u *scratch = nppsMalloc_8f(bytes);
nppsMax_32f(in, n, max_val, nppAlgHintNone, scratch);
// Reusing scratch space for input of same size
nppsMin_32f(in, n, min_val, nppAlgHintNone, scratch);
// Reusing scratch space for input of smaller size
nppsSum_32f(in, 1e4, sum_val, nppAlgHintNone, scratch);
// Larger inputs may require more scratch space.
// So you may need to check and allocate appropriate space
int newBytes = 0; nppsReductionGetBufferSize_32f(5e6, &newBytes);
if (bytes != newBytes) {
nppsFree(scratch);
scratch = nppsMalloc_8u(bytes);
}
nppsSum_32f(in, 5e6, sum_val, nppAlgHintNone, scratch);