CUDA run lead to display drive stop - cuda

I wrote a function. When I run it in the cpu I can get the right result. The part of cpu code is:
for(int x = startx; x < endx; x+=SampleStep)
for(int y = starty; y < endy; y+=SampleMin)
{
int idoff = Width;
Then I port it to the GPU, like this:
int x = threadIdx.x + blockIdx.x * blockDim.x + startx;
int y = threadIdx.y + blockIdx.y * blockDim.y + starty;
int idoff = blockDim.x * gridDim.x;
when I run the code, the black screen happened and then recovered after a little while. At the same time, the system showed the message like: Display drive stopped responding.
and the cuda event time output cost time is 0ms, the result is wrong.
for (int k = CircleBegin; k < CircleEnd; k++)
{
bool Isright = (k-ww>=0) && (k+ww<Width);
if (Isright)
{
float AverR = 0;
for (int i = -ww; i <= ww; i++)
{
for (int j = -wh; j <= wh; j++)
{
AverR += ImgR[(k+i)+(y+j)*idoff];
}
}
when I comment the AverR += ImgR[(k+i)+(y+j)*idoff]; The code can run without black screen. I want to know why. Is this related with my display device (my device is nvida gt 240) or is there some access violation happened? how can I solve this problem?

Your screen is turning black because you are hitting the windows TDR event. For further description of this and possible solutions, see my answer here.
Since you have nested for-loops, and you haven't told us the size of the data set, it's certainly possible that your code is taking too long to execute, if your for-loops are operating over a large enough range.
When you comment out that line of code, the compiler can completely optimize away the loops, and so that section of code would take essentially zero time to run. As a result, your kenel is no longer taking too long and so you don't hit the TDR event.
There's no reason based on any of the above to assume an access violation is occurring. In fact, I would say it is unlikely because an access violation will often cause an unspecified launch failure, which will terminate a running kernel.
So you'll need to investigate some of the ideas I mentioned in the answer I linked above.

Related

Matrix Multiplication of matrix and its transpose in Cuda

I am relatively new to CUDA programming so there are some unsolved issues for which I hope I can get some hints in the right direction.
So the case is that I want to multiply a 2D array with its transpose and to be precise I want to execute the operation ATA.
I have already used the cublas Dgemm function and now I am trying to do the same operation with a tiled algorithm, very similar to the one from CUDA guide.
The case is that while the initial algorithm runs properly, I want to calculate only the upper triangular matrix of the product hoping that I could achieve a better time for the operation, and I am not sure on how to extract tiles/blocks which will have the respective elements.
So if you could enlighten me on this, or give any hint I would be grateful, cause I have stuck on that for a while.
This is the code of the kernel
__shared__ double Ads1[TILE_WIDTH][TILE_WIDTH];
__shared__ double Ads2[TILE_WIDTH][TILE_WIDTH];
//block row and column
//we save in registers for faster access
int by = blockIdx.y;
int bx = blockIdx.x;
int ty = threadIdx.y;
int tx = threadIdx.x;
int row = by * TILE_WIDTH + ty;
int col = bx * TILE_WIDTH + tx;
double Rvalue = 0;
if(row >= width || col >= width) return;
//Each thread block computes one sub-matrix Rsub of result R
for (int i=0; i<(int) ceil(((double) height/TILE_WIDTH)); ++i)
{
Ads1[tx][ty] = Ad[(i * TILE_WIDTH + ty)*width + col];
Ads2[tx][ty] = Ad[(i * TILE_WIDTH + tx)*width + row];
__syncthreads();
for (int j = 0; j < TILE_WIDTH; ++j)
{
if ((i*TILE_WIDTH + j) > height ) break; //in order not to exceed the matrix's height
Rvalue+=Ads1[j][tx]*Ads2[ty][j];
}
__syncthreads();
}
Rd [row * width + col] = Rvalue;
You may want to use the batch dgemm API function described here recursely dividing your output matrix with block diagonal and corner. You also want to balance smallest block size versus overhead in compute to avoid small invokes. Finally, note that matrix multiply turns memory bound at some stage, which can be on modern GPU somewhat large.

thread management in nbody code of cuda-sdk

When I read the nbody code in Cuda-SDK, I went through some lines in the code and I found that it is a little bit different than their paper in GPUGems3 "Fast N-Body Simulation with CUDA".
My questions are: First, why the blockIdx.x is still involved in loading memory from global to share memory as written in the following code?
for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++)
{
sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
multithreadBodies ?
positions[WRAP(blockIdx.x + q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] : //this line
positions[WRAP(blockIdx.x + tile, gridDim.x) * p + threadIdx.x]; //this line
__syncthreads();
// This is the "tile_calculation" function from the GPUG3 article.
acc = gravitation(bodyPos, acc);
__syncthreads();
}
isn't it supposed to be like this according to paper? I wonder why
sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
multithreadBodies ?
positions[WRAP(q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] :
positions[WRAP(tile, gridDim.x) * p + threadIdx.x];
Second, in the multiple threads per body why the threadIdx.x is still involved? Isn't it supposed to be a fix value or not involving at all because the sum only due to threadIdx.y
if (multithreadBodies)
{
SX_SUM(threadIdx.x, threadIdx.y).x = acc.x; //this line
SX_SUM(threadIdx.x, threadIdx.y).y = acc.y; //this line
SX_SUM(threadIdx.x, threadIdx.y).z = acc.z; //this line
__syncthreads();
// Save the result in global memory for the integration step
if (threadIdx.y == 0)
{
for (int i = 1; i < blockDim.y; i++)
{
acc.x += SX_SUM(threadIdx.x,i).x; //this line
acc.y += SX_SUM(threadIdx.x,i).y; //this line
acc.z += SX_SUM(threadIdx.x,i).z; //this line
}
}
}
Can anyone explain this to me? Is it some kind of optimization for faster code?
I am an author of this code and the paper. Numbered answers correspond to your numbered questions.
The blockIdx.x offset to the WRAP macro is not mentioned in the paper because this is a micro-optimization. I'm not even sure it is worthwhile any more. The purpose was to ensure that different SMs were accessing different DRAM memory banks rather than all pounding on the same bank at the same time, to ensure we maximize the memory throughput during these loads. Without the blockIdx.x offset, all simultaneously running thread blocks will access the same address at the same time. Since the overall algorithm is compute rather than bandwidth bound, this is definitely a minor optimization. Sadly, it makes the code more confusing.
The sum is across threadIdx.y, as you say, but each thread needs to do a separate sum (each thread computes gravitation for a separate body). Therefore we need to use threadIdx.x to index the right column of the (conceptually 2D) shared memory array.
To Answer SystmD's question in his (not really correct) answer, gridDim.y is only 1 in the (default/common) 1D block case.
1)
the array SharedPos is loaded in the shared memory of each block (i.e. each tile) before synchronization of the threads of each block (with __syncthreads()). blockIdx.x is the index of the tile, according to the algorithm.
each thread (index threadIdx.x threadIdx.y) loads a part of the shared array SharedPos. blockIdx.x refers to the index of the tile (without multithreading).
2)
acc is the float3 of the body index blockIdx.x * blockDim.x + threadIdx.x (see the beginning of the integrateBodies function)
I found some trouble with multithreadBodies=true during this sum with q>4 (128 bodies,p =16,q=8 gridx=8) . (with GTX 680). Some sums were not done on the whole blockDim.y ...
I changed the code to avoid that, It works but I don't know really why...
if (multithreadBodies)
{
SX_SUM(threadIdx.x, threadIdx.y).x = acc.x;
SX_SUM(threadIdx.x, threadIdx.y).y = acc.y;
SX_SUM(threadIdx.x, threadIdx.y).z = acc.z;
__syncthreads();
for (int i = 0; i < blockDim.y; i++)
{
acc.x += SX_SUM(threadIdx.x,i).x;
acc.y += SX_SUM(threadIdx.x,i).y;
acc.z += SX_SUM(threadIdx.x,i).z;
}
}
Another question:
In the first loop:
for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++)
{
}
I don't know why blockIdx.y is used since grid.y=1.
3) For a faster code, I use asynchronous H2D and D2D memory copies (my code only uses the gravitation kernel).

CUDA: Unspecified Launch Failure

I was using the CUDA-GDB to find out what the problem was with my kernel execution. It would always output; Cuda error: kernel execution: unspecified launch failure. That's probably the worst error anyone could possibly get because there is no indication whatsoever of what is going on!
Back to the CUDA-GDB... When I was using the debugger it would arrive at the kernel and output;
Breakpoint 1, myKernel (__cuda_0=0x200300000, __cuda_1=0x200400000, __cuda_2=320, __cuda_3=7872, __cuda_4=0xe805c0, __cuda_5=0xea05e0, __cuda_6=0x96dfa0, __cuda_7=0x955680, __cuda_8=0.056646065580379823, __cuda_9=-0.0045986640087569072, __cuda_10=0.125,
__cuda_11=18.598229033761132, __cuda_12=0.00048828125, __cuda_13=5.9604644775390625e-08)
at myFunction.cu:60
Then I would type: next.
output;
0x00007ffff7f7a790 in __device_stub__Z31chisquared_LogLikelihood_KernelPdS_iiP12tagCOMPLEX16S1_S1_S_dddddd ()
from /home/alex/master/opt/lscsoft/lalinference/lib/liblalinference.so.3
The notable part in that section is that it has a tag to a typedef'd datatype. COMPLEX16 is defined as: typedef double complex COMPLEX16
Then I would type: next.
output;
Single stepping until exit from function Z84_device_stub__Z31chisquared_LogLikelihood_KernelPdS_iiP12tagCOMPLEX16S1_S1_S_ddddddPdS_iiP12tagCOMPLEX16S1_S1_S_dddddd#plt,
which has no line number information.
0x00007ffff7f79560 in ?? () from /home/alex/master/opt/lscsoft/lalinference/lib/liblalinference.so.3
Type next...
output;
Cannot find bounds of current function
Type continue...
Cuda error: kernel execution: unspecified launch failure.
Which is the error I get without debugging. I have seen some forum topics on something similar where the debugger cannot find the bounds of current function, possibly because the library is somehow not linked or something along those lines? The ?? was said to be because the debugger is somewhere is shell for some reason and not in any function.
I believe the problem lies deeper in the fact that I have these interesting data types in my code. COMPLEX16 REAL8
Here is my kernel...
__global__ void chisquared_LogLikelihood_Kernel(REAL8 *d_temp, double *d_sum, int lower, int dataSize,
COMPLEX16 *freqModelhPlus_Data,
COMPLEX16 *freqModelhCross_Data,
COMPLEX16 *freqData_Data,
REAL8 *oneSidedNoisePowerSpectrum_Data,
double FplusScaled,
double FcrossScaled,
double deltaF,
double twopit,
double deltaT,
double TwoDeltaToverN)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ REAL8 ssum[MAX_THREADS];
if (idx < dataSize)
{
idx += lower; //accounts for the shift that was made in the original loop
memset(ssum, 0, MAX_THREADS * sizeof(*ssum));
int tid = threadIdx.x;
int bid = blockIdx.x;
REAL8 plainTemplateReal = FplusScaled * freqModelhPlus_Data[idx].re
+ freqModelhCross_Data[idx].re;
REAL8 plainTemplateImag = FplusScaled * freqModelhPlus_Data[idx].im
+ freqModelhCross_Data[idx].im;
/* do time-shifting... */
/* (also un-do 1/deltaT scaling): */
double f = ((double) idx) * deltaF;
/* real & imag parts of exp(-2*pi*i*f*deltaT): */
double re = cos(twopit * f);
double im = - sin(twopit * f);
REAL8 templateReal = (plainTemplateReal*re - plainTemplateImag*im) / deltaT;
REAL8 templateImag = (plainTemplateReal*im + plainTemplateImag*re) / deltaT;
double dataReal = freqData_Data[idx].re / deltaT;
double dataImag = freqData_Data[idx].im / deltaT;
/* compute squared difference & 'chi-squared': */
double diffRe = dataReal - templateReal; // Difference in real parts...
double diffIm = dataImag - templateImag; // ...and imaginary parts, and...
double diffSquared = diffRe*diffRe + diffIm*diffIm ; // ...squared difference of the 2 complex figures.
//d_temp[idx - lower] = ((TwoDeltaToverN * diffSquared) / oneSidedNoisePowerSpectrum_Data[idx]);
//ssum[tid] = ((TwoDeltaToverN * diffSquared) / oneSidedNoisePowerSpectrum_Data[idx]);
/***** REDUCTION *****/
//__syncthreads(); //all the temps should have data before we add them up
//for (int i = blockDim.x / 2; i > 0; i >>= 1) { /* per block */
// if (tid < i)
// ssum[tid] += ssum[tid + i];
// __syncthreads();
//}
//d_sum[bid] = ssum[0];
}
}
When I'm not debugging (-g -G not included in command) then the kernel only runs fine if I don't include the line(s) that begin with d_temp[idx - lower] and ssum[tid]. I only did d_temp to make sure that it wasn't a shared memory error, ran fine. I also tried running with ssum[tid] = 20.0 and other various number types to make sure it wasn't that sort of problem, ran fine too. When I run with either of them included then the kernel exits with the cuda error above.
Please ask me if something is unclear or confusing.
There was a lack of context here for my question. The assumption was probably that I had done cudaMalloc and other such preliminary things before the kernel execution for ALL the pointers involved. However I had only done it to d_temp and d_sum (I was making tons of switches and barely realized I was making the other four pointers). Once I did cudaMalloc and cudaMemcpy for the data needed, then everything ran perfectly.
Thanks for the insight.

How to synchronize global memory between multiple kernel launches?

I want to launch multiple times the following kernel in a FOR LOOP (pseudo):
__global__ void kernel(t_dev is input array in global mem) {
__shared__ PREC tt[BLOCK_DIM];
if (thid < m) {
tt[thid] = t_dev.data[ii]; // MEM READ!
}
... // MODIFY
__syncthreads();
if (thid < m) {
t_dev.data[thid] = tt[thid]; // MEM WRITE!
}
__threadfence(); // or __syncthreads(); //// NECESSARY!! but why?
}
What I do conceptually is I read in values from t_dev . modify them, and write out to global mem again! and then I start the same kernel again!!
Why do I need obviously the _threadfence or __syncthread
otherwise the result get wrong, because, memory writes are not finished when the same kernel starts again. Thats what happens here, my GTX580 has device overlap enabled,
But why are global mem writes not finished when the next kernel starts... is this because of the device overlap or because its always like that? I thought, when we launch kernel after kernel, mem write/reads are finished after one kernel... :-)
Thanks for your answers!
SOME CODE :
for(int kernelAIdx = 0; kernelAIdx < loops; kernelAIdx++){
proxGPU::sorProxContactOrdered_1threads_StepA_kernelWrap<PREC,SorProxSettings1>(
mu_dev,x_new_dev,T_dev,x_old_dev,d_dev,
t_dev,
kernelAIdx,
pConvergedFlag_dev,
m_absTOL,m_relTOL);
proxGPU::sorProx_StepB_kernelWrap<PREC,SorProxSettings1>(
t_dev,
T_dev,
x_new_dev,
kernelAIdx
);
}
These are thw two kernels which are in the loop, the t_dev and x_new_dev, is moved from Step A to Step B,
Kernel A looks as follows:
template<typename PREC, int THREADS_PER_BLOCK, int BLOCK_DIM, int PROX_PACKAGES, typename TConvexSet>
__global__ void sorProxContactOrdered_1threads_StepA_kernel(
utilCuda::Matrix<PREC> mu_dev,
utilCuda::Matrix<PREC> y_dev,
utilCuda::Matrix<PREC> T_dev,
utilCuda::Matrix<PREC> x_old_dev,
utilCuda::Matrix<PREC> d_dev,
utilCuda::Matrix<PREC> t_dev,
int kernelAIdx,
int maxNContacts,
bool * convergedFlag_dev,
PREC _absTOL, PREC _relTOL){
//__threadfence() HERE OR AT THE END; THEN IT WORKS???? WHY
// Assumend 1 Block, with THREADS_PER_BLOCK Threads and Column Major Matrix T_dev
int thid = threadIdx.x;
int m = min(maxNContacts*PROX_PACKAGE_SIZE, BLOCK_DIM); // this is the actual size of the diagonal block!
int i = kernelAIdx * BLOCK_DIM;
int ii = i + thid;
//First copy x_old_dev in shared
__shared__ PREC xx[BLOCK_DIM]; // each thread writes one element, if its in the limit!!
__shared__ PREC tt[BLOCK_DIM];
if(thid < m){
xx[thid] = x_old_dev.data[ii];
tt[thid] = t_dev.data[ii];
}
__syncthreads();
PREC absTOL = _absTOL;
PREC relTOL = _relTOL;
int jj;
//PREC T_iijj;
//Offset the T_dev_ptr to the start of the Block
PREC * T_dev_ptr = PtrElem_ColM(T_dev,i,i);
PREC * mu_dev_ptr = &mu_dev.data[PROX_PACKAGES*kernelAIdx];
__syncthreads();
for(int j_t = 0; j_t < m ; j_t+=PROX_PACKAGE_SIZE){
//Select the number of threads we need!
// Here we process one [m x PROX_PACKAGE_SIZE] Block
// First Normal Direction ==========================================================
jj = i + j_t;
__syncthreads();
if( ii == jj ){ // select thread on the diagonal ...
PREC x_new_n = (d_dev.data[ii] + tt[thid]);
//Prox Normal!
if(x_new_n <= 0.0){
x_new_n = 0.0;
}
/* if( !checkConverged(x_new,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
xx[thid] = x_new_n;
tt[thid] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
// Select only m threads!
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t];
}
// ====================================================================================
// wee need to syncronize here because one threads finished lambda_t2 with shared mem tt, which is updated from another thread!
__syncthreads();
// Second Tangential Direction ==========================================================
jj++;
__syncthreads();
if( ii == jj ){ // select thread on diagonal, one thread finishs T1 and T2 directions.
// Prox tangential
PREC lambda_T1 = (d_dev.data[ii] + tt[thid]);
PREC lambda_T2 = (d_dev.data[ii+1] + tt[thid+1]);
PREC radius = (*mu_dev_ptr) * xx[thid-1];
PREC absvalue = sqrt(lambda_T1*lambda_T1 + lambda_T2*lambda_T2);
if(absvalue > radius){
lambda_T1 = (lambda_T1 * radius ) / absvalue;
lambda_T2 = (lambda_T2 * radius ) / absvalue;
}
/*if( !checkConverged(lambda_T1,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}
if( !checkConverged(lambda_T2,xx[thid+1],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
//Write the two values back!
xx[thid] = lambda_T1;
tt[thid] = 0.0;
xx[thid+1] = lambda_T2;
tt[thid+1] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+1];
}
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+2];
}
// ====================================================================================
__syncthreads();
// move T_dev_ptr 1 column
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
// move mu_ptr to nex contact
__syncthreads();
mu_dev_ptr = &mu_dev_ptr[1];
__syncthreads();
}
__syncthreads();
// Write back the results, dont need to syncronize because
// do it anyway to be safe for testing first!
if(thid < m){
y_dev.data[ii] = xx[thid]; THIS IS UPDATED IN KERNEL B
t_dev.data[ii] = tt[thid]; THIS IS UPDATED IN KERNEL B
}
//__threadfence(); /// THIS STUPID THREADFENCE MAKES IT WORKING!
I compare the solution at the end with the CPU, and HERE I put everywhere I can a syncthread around only to be safe, for the start! (this code does gauss seidel stuff)
but it does not work at all without the THREAD_FENCE at the END or at the BEGINNIG where it does not make sense...
Sorry for so much code, but probably you can guess where the problem comes, frome because I am bit at my end, with explainig why this happens?
We checked the algorithm several times, there is no memory error (reported from Nsight) or
other stuff, every thing works fine... Kernel A is launched with ONE Block only!
If you launch the successive instances of the kernel into the same stream, each kernel launch is synchronous compared to the kernel instance before and after it. The programming model guarantees it. CUDA only permits simultaneous kernel execution on kernels launched into different streams of the same context, and even then overlapping kernel execution only happens if the scheduler determines that sufficient resources are available to do so.
Neither __threadfence nor __syncthreads will have the effect you seem to be thinking about - __threadfence works only at the scope of all active threads and __syncthreads is an intra-block barrier operation. If you really want kernel to kernel synchronization, you need to use one of the host side synchronization calls, like cudaThreadSynchronize (pre CUDA 4.0) or cudaDeviceSynchronize (cuda 4.0 and later), or the per-stream equivalent if you are using streams.
While I am a bit surprised by what you are experiencing, I believe your explanation may be correct.
Writes to global memory, with an exception of atomic functions, are not guaranteed to be immediately visible by other threads (from the same, or from different blocks). By putting __threadfence() you halt the current thread until the writes are in fact visible. This might be important in particular when you are using global memory with a cache (the Fermi series).
One thing to note: Kernel calls are asynchronous. While your first kernel call is being handled by the GPU, the host may issue another call. The next kernel will not run in parallel with your current one, but will launch as soon as the current one finishes, esentially hiding the latency caused by the CPU->GPU communication.
Using cudaThreadSynchronise halts the host thread until all the CUDA tasks are done. It may help you, but it will also prevent you from hiding the CPU->GPU communication latency. Do note, that using synchronous memory access (e.g. cudaMemcpy, without "Async" suffix) esentially behaves like cudaThreadSynchronise too.

CUDA - specifiying <<<x,y>>> for a for loop

Hey,
I have two arrays of size 2000. I want to write a kernel to copy one array to the other. The array represents 1000 particles. index 0-999 will contain an x value and 1000-1999 the y value for their position.
I need a for loop to copy up to N particles from 1 array to the other. eg
int halfway = 1000;
for(int i = 0; i < N; i++){
array1[i] = array2[i];
array1[halfway + i] = array[halfway + i];
}
Due to the number of N always being less than 2000, can I just create 2000 threads? or do I have to create several blocks.
I was thinking about doing this inside a kernel:
int tid = threadIdx.x;
if (tid >= N) return;
array1[tid] = array2[tid];
array1[halfway + tid] = array2[halfway + tid];
and calling it as follows:
kernel<<<1,2000>>>(...);
Would this work? will it be fast? or will I be better off splitting the problem into blocks. I'm not sure how to do this, perhaps: (is this correct?)
int tid = blockDim.x*blockIdx.x + threadIdx.x;
if (tid >= N) return;
array1[tid] = array2[tid];
array1[halfway + tid] = array2[halfway + tid];
kernel<<<4,256>>>(...);
Would this work?
Have you actually tried it?
It will fail to launch, because you are allowed to have 512 threads maximum (value may vary on different architectures, mine is one of GTX 200-series). You will either need more blocks or have fewer threads and a for-loop inside with blockDim.x increment.
Your multi-block solution should work as well.
Other approach
If this is the only purpose of the kernel, you might as well try using cudaMemcpy with cudaMemcpyDeviceToDevice as the last parameter.
The only way to answer questions about configurations is to test them. To do this, write your kernels so that they work regardless of the configuration. Often, I will assume that I will launch enough threads, which makes the kernel easier to write. Then, I will do something like this:
threads_per_block = 512;
num_blocks = SIZE_ARRAY/threads_per_block;
if(num_blocks*threads_per_block<SIZE_ARRAY)
num_blocks++;
my_kernel <<< num_blocks, threads_per_block >>> ( ... );
(except, of course, threads_per_block might be a define, or a command line argument, or iterated to test many configurations)
Is better to use more than one block for any kernel.
It Seems to me that you are simply copying from one array to another as a sequence of values with an offset.
If this is the case you can simply use the cudaMemcpy API call and specify
cudaMemcpyDeviceToDevice
cudaMemcpy(array1+halfway,array1,1000,cudaMemcpyDeviceToDevice);
The API will figure out the best partition of block / threads.