i have some problem with cassandra 2.1.8
and i'm using the cassandra-driver-core-2.1.6.jar
i'm using 10 of thread doing insert to cassandra about 50000 rows(each thread) at the same time.
but, some thread is very slow. some thread time is very different from the others.( the time, it takes to insert 50000 rows).
i think need to tuning on cassandra.
this is part of my code.
for(int i = 0; i < 50000; i++) {
bind = statement.bind("KEY" + i + thread + j, param);
resultSetFuture = session.executeAsync(bind);
if(i % 1000 == 0) {
futures.add(resultSetFuture);
for(ResultSetFuture future : futures) {
future.getUninterruptibly();
}
}
}
can you help me?
please....
Should I need to change any settings?
1) How many cassandra nodes do you have?
2) What are they doing when you run thate code?
3) What's your schema?
4) How many requests per second do you think you should be able to do?
5) What are you basing the answer to #4 on?
Related
I have read that comparisons and branching is slow on GPU. I would like to know how much. (I'm familier with OpenCL, but the question is general also for CUDA, AMP ... )
I would like to know it, before I start to port my code to GPU. In particular I'm interested in finding lowest value in neighborhood ( 4 or 9 nearest neighbors) of each point in 2D array. i.e. something like convolution, but instead of summing and multiplying I need comparisons and branching.
for example code like this ( NOTE: this example code is not yet optimized for GPU to be more readeable ... so partitioning to workgroups, prefeaching of local memory ... is missing )
for(int i=1;i<n-1;i++){ for(int j=1;j<n-1;j++){ // iterate over 2D array
float hij = h[i][j];
int imin = 0,jmin = 0;
float dh,dhmin=0;
// find lowest neighboring element h[i+imin][j+jmin] of h[i][j]
dh = h[i-1][j ]-hij; if( dh<dhmin ){ imin = -1; jmin = 0; dhmin = dh; }
dh = h[i+1][j ]-hij; if( dh<dhmin ){ imin = +1; jmin = 0; dhmin = dh; }
dh = h[i ][j-1]-hij; if( dh<dhmin ){ imin = 0; jmin = -1; dhmin = dh; }
dh = h[i ][j+1]-hij; if( dh<dhmin ){ imin = 0; jmin = +1; dhmin = dh; }
if( dhmin<-0.00001 ){ // if lower
// ... Do something with hij, dhmin and save to h[i+imin][j+jmin] ...
}
} }
Would it be worth to port to GPU despite a lot of if branching and
comparison? ( i.e. if this 4-5 comparisons per elemet would be 10x slower than the same 4-5 comparisons on CPU it would be a bottleneck )
is there any optimization trick how to minizmize if
branching and comparison slow down?
Which I used in this hydraulic errosion code:
http://www.openprocessing.org/sketch/146982
Branching itself is not slow. Divergence is what gets you. GPUs compute multiple work items (typ. 16 or 32) in lock-step in "warps" or "wavefronts" and if different work items take different paths they all take all paths but gate writes based on which path they are on (using predicate flags)). So if your work items always (or mostly) branch the same way, you're good. If they don't the penalty can rob performance.
If you need to do comparison and if the array length 'n' is really big then you can use reduction instead of sequential comparison. Reduction would do comparison in parallel in O (log n) time as opposed to O (n) when done sequentially.
When you access memory sequentially in a GPU thread, the memory accesses are sequential since consecutive blocks are accessed from the same bank. Instead, its good to use coalesced reads. You can find plethora of examples on this.
On GPUs, don't access global memory multiple times (as GPU memory management and caching work not exactly like a CPU). Instead, cache the global memory elements into thread's private variables / shared memory as much as possible.
This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth?
I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below.
while (i < n) {
if (array[i] != NULL) {
return array[i];
}
i++;
}
return NULL;
Where n is the size of array and array is in shared memory. I'm only interested in the first address that is different from NULL (first match).
The whole code (i've posted only a piece, the whole code is big) is running fast, but the "heart" of the code (i.e, the part that is more repeated) is serialized, as you can see. I want to know if i can parallelize this part (the search) with some optimized algorithm.
Like i said, the program is already in CUDA (and the array in device), so it will not have memory transfers from host to device and vice versa.
My problem is: n is not big. Difficultly it will be greater than 8.
I've tried to parallelize it, but my "new" code took more time than the code above.
I was studying reduction and min operations, but i've checked that it's useful when n is big.
So, any tips? Can i parallelize it efficiently, i.e., with a low overhead?
Keeping things simple, one of the major limiting factors of GPGPU code is memory management. In most computers copying memory to the device (GPU) is a slow process.
As illustrated by http://www.ncsa.illinois.edu/~kindr/papers/ppac09_paper.pdf:
"The key requirement for obtaining effective
acceleration from GPU subroutine libraries is minimization of
I/O between the host and the GPU."
This is because I/O operations between host and device are SLOW!
Tying this back to your problem, it doesn't really make sense to run on the GPU since the amount of data you mention is so small. You would spend more time running the memcpy routines than it would take to run on the CPU in the first place - especially since you mention you are only interested in the first match.
One common misconception that many people have is that 'if I run it on the GPU, it has more cores so will run faster' and this just isn't the case.
When deciding if it is worth porting to CUDA or OpenCL you must think about if the process is inherently parallel or not - are you processing very large amounts of data etc.?
Since you say the array is a shared memory resource, the result of this search is the same for each thread of a block. This means a first and simple optimization would be to only let a single thread do the search. This will free all but the first warp of the block from doing any work (they still need to wait for the result, yet don't have to waste any computing resources):
__shared__ void *result = NULL;
if(tid == 0)
{
for(unsigned int i=0; i<n; ++i)
{
if (array[i] != NULL)
{
result = array[i];
break;
}
}
}
__syncthreads();
return result;
A step further would then be to let the threads perform the search in parallel as a classic intra-block reduction. If you can guarantee n to always be <= 64, you can do this in a single warp and don't need any synchronization during the search (except for the complete synchronization at the end, of course).
for(unsigned int i=n/2; i>32; i>>=1)
{
if(tid < i && !array[tid])
array[tid] = array[tid+i];
__syncthreads();
}
if(tid < 32)
{
if(n > 32 && !array[tid]) array[tid] = array[tid+32];
if(n > 16 && !array[tid]) array[tid] = array[tid+16];
if(n > 8 && !array[tid]) array[tid] = array[tid+8];
if(n > 4 && !array[tid]) array[tid] = array[tid+4];
if(n > 2 && !array[tid]) array[tid] = array[tid+2];
if(n > 1 && !array[tid]) array[tid] = array[tid+1];
}
__syncthreads();
return array[0];
Of course the example assumes n to be a power of two (and the array to be padded with NULLs accordingly), but feel free to tune it to your needs and optimize this further.
This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth?
I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below.
while (i < n) {
if (array[i] != NULL) {
return array[i];
}
i++;
}
return NULL;
Where n is the size of array and array is in shared memory. I'm only interested in the first address that is different from NULL (first match).
The whole code (i've posted only a piece, the whole code is big) is running fast, but the "heart" of the code (i.e, the part that is more repeated) is serialized, as you can see. I want to know if i can parallelize this part (the search) with some optimized algorithm.
Like i said, the program is already in CUDA (and the array in device), so it will not have memory transfers from host to device and vice versa.
My problem is: n is not big. Difficultly it will be greater than 8.
I've tried to parallelize it, but my "new" code took more time than the code above.
I was studying reduction and min operations, but i've checked that it's useful when n is big.
So, any tips? Can i parallelize it efficiently, i.e., with a low overhead?
Keeping things simple, one of the major limiting factors of GPGPU code is memory management. In most computers copying memory to the device (GPU) is a slow process.
As illustrated by http://www.ncsa.illinois.edu/~kindr/papers/ppac09_paper.pdf:
"The key requirement for obtaining effective
acceleration from GPU subroutine libraries is minimization of
I/O between the host and the GPU."
This is because I/O operations between host and device are SLOW!
Tying this back to your problem, it doesn't really make sense to run on the GPU since the amount of data you mention is so small. You would spend more time running the memcpy routines than it would take to run on the CPU in the first place - especially since you mention you are only interested in the first match.
One common misconception that many people have is that 'if I run it on the GPU, it has more cores so will run faster' and this just isn't the case.
When deciding if it is worth porting to CUDA or OpenCL you must think about if the process is inherently parallel or not - are you processing very large amounts of data etc.?
Since you say the array is a shared memory resource, the result of this search is the same for each thread of a block. This means a first and simple optimization would be to only let a single thread do the search. This will free all but the first warp of the block from doing any work (they still need to wait for the result, yet don't have to waste any computing resources):
__shared__ void *result = NULL;
if(tid == 0)
{
for(unsigned int i=0; i<n; ++i)
{
if (array[i] != NULL)
{
result = array[i];
break;
}
}
}
__syncthreads();
return result;
A step further would then be to let the threads perform the search in parallel as a classic intra-block reduction. If you can guarantee n to always be <= 64, you can do this in a single warp and don't need any synchronization during the search (except for the complete synchronization at the end, of course).
for(unsigned int i=n/2; i>32; i>>=1)
{
if(tid < i && !array[tid])
array[tid] = array[tid+i];
__syncthreads();
}
if(tid < 32)
{
if(n > 32 && !array[tid]) array[tid] = array[tid+32];
if(n > 16 && !array[tid]) array[tid] = array[tid+16];
if(n > 8 && !array[tid]) array[tid] = array[tid+8];
if(n > 4 && !array[tid]) array[tid] = array[tid+4];
if(n > 2 && !array[tid]) array[tid] = array[tid+2];
if(n > 1 && !array[tid]) array[tid] = array[tid+1];
}
__syncthreads();
return array[0];
Of course the example assumes n to be a power of two (and the array to be padded with NULLs accordingly), but feel free to tune it to your needs and optimize this further.
I've got such a structure, is described as a "binomial tree". Let'see a drawing:
Which is the best way to represent this in memory? Just to clarify, is not a simple binary tree since the node N4 is both the left child of N1 and the right child of N2, the same sharing happens for N7 and N8 and so on... I need a construction algorithm tha easily avoid to duplicates such nodes, but just referencing them.
UPDATE
Many of us does not agree with the "binomial tree deefinition" but this cames from finance ( expecially derivative pricing ) have a look here: http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter45.html for example. So I used the "Domain acceted definition".
You could generate the structure level by level. In each iteration, create one level of nodes, put them in an array, and connect the previous level to them. Something like this (C#):
Node GenerateStructure(int levels)
{
Node root = null;
Node[] previous = null;
for (int level = 1; level <= levels; level++)
{
int count = level;
var current = new Node[count];
for (int i = 0; i < count; i++)
current[i] = new Node();
if (level == 1)
root = current[0];
for (int i = 0; i < count - 1; i++)
{
previous[i].Left = current[i];
previous[i].Right = current[i + 1];
}
previous = current;
}
return root;
}
The whole structure requires O(N^2) memory, where N is the number of level. This approach requires O(N) additional memory for the two arrays. Another approach would be to generate the graph from left to right, but that would require O(N) additional memory too.
The time complexity is obviously O(N^2).
More than a tree, of which I would give a definition like 'connected graph of N vertex and N-1 edges', that structure seems like a Pascal (or Tartaglia, as teached in Italy) triangle. As such, an array with a suitable indexing suffices.
Details on construction depends on your data input: please give some more hint.
Hey,
I have two arrays of size 2000. I want to write a kernel to copy one array to the other. The array represents 1000 particles. index 0-999 will contain an x value and 1000-1999 the y value for their position.
I need a for loop to copy up to N particles from 1 array to the other. eg
int halfway = 1000;
for(int i = 0; i < N; i++){
array1[i] = array2[i];
array1[halfway + i] = array[halfway + i];
}
Due to the number of N always being less than 2000, can I just create 2000 threads? or do I have to create several blocks.
I was thinking about doing this inside a kernel:
int tid = threadIdx.x;
if (tid >= N) return;
array1[tid] = array2[tid];
array1[halfway + tid] = array2[halfway + tid];
and calling it as follows:
kernel<<<1,2000>>>(...);
Would this work? will it be fast? or will I be better off splitting the problem into blocks. I'm not sure how to do this, perhaps: (is this correct?)
int tid = blockDim.x*blockIdx.x + threadIdx.x;
if (tid >= N) return;
array1[tid] = array2[tid];
array1[halfway + tid] = array2[halfway + tid];
kernel<<<4,256>>>(...);
Would this work?
Have you actually tried it?
It will fail to launch, because you are allowed to have 512 threads maximum (value may vary on different architectures, mine is one of GTX 200-series). You will either need more blocks or have fewer threads and a for-loop inside with blockDim.x increment.
Your multi-block solution should work as well.
Other approach
If this is the only purpose of the kernel, you might as well try using cudaMemcpy with cudaMemcpyDeviceToDevice as the last parameter.
The only way to answer questions about configurations is to test them. To do this, write your kernels so that they work regardless of the configuration. Often, I will assume that I will launch enough threads, which makes the kernel easier to write. Then, I will do something like this:
threads_per_block = 512;
num_blocks = SIZE_ARRAY/threads_per_block;
if(num_blocks*threads_per_block<SIZE_ARRAY)
num_blocks++;
my_kernel <<< num_blocks, threads_per_block >>> ( ... );
(except, of course, threads_per_block might be a define, or a command line argument, or iterated to test many configurations)
Is better to use more than one block for any kernel.
It Seems to me that you are simply copying from one array to another as a sequence of values with an offset.
If this is the case you can simply use the cudaMemcpy API call and specify
cudaMemcpyDeviceToDevice
cudaMemcpy(array1+halfway,array1,1000,cudaMemcpyDeviceToDevice);
The API will figure out the best partition of block / threads.