How to create arrow shoot functionality in cocos2d/ cocos2d-x - cocos2d-x

In my game i have some archers shooting arrows. i have applied ccJumpBy action currently to the arrows spawned with CCrotate action. My problem is that it is not looking realistic i.e. fast when they go up and fast when they come down but slow in between. I even tried CCEASEACTIONS to that but they started behaving just the opposite of what i am doing. i know jump action cant be broken into two parts directly. I even tried with ccbezier action but even that was not helping... here is my current code.
float angle = kShootRotationAngle;
CCPoint pos = ccp(troops->getSprite()->getContentSize().width/2 + gameLayer->winSize.width * 0.48, - winSize.height * 0.15);
float jumpHeight = winSize.height * 0.28f;
if (weaponType == kArrowTypeEnemyNormal || weaponType == kArrowTypeEnemyRapid || weaponType == kArrowTypeEnemyFire) {
angle = - kShootRotationAngle;
pos = ccp(troops->getSprite()->getContentSize().width/2 - gameLayer->winSize.width * 0.57, - winSize.height * 0.15);
}
jumpAction = new CCJumpBy;
jumpAction->initWithDuration(kNormalArrowMovementTime, pos, jumpHeight, 1);// Allocated Memory
rotateAction = new CCRotateBy;
rotateAction->initWithDuration(kNormalArrowMovementTime, angle);// Allocated Memory
spawnActions = new CCSpawn;
spawnActions->initWithTwoActions(jumpAction, rotateAction);// Allocated memory
removeWeaponAction = CCCallFunc::create(this, callfunc_selector(Weapons::removeArrow));// AutoRelease Type
delayAction = new CCDelayTime;// Allocated memory
delayAction->initWithDuration((kFrameCountElvesShoot - 3) * kArrowShootTimeElves + delay);
showSprite = new CCFadeIn;// Allocated Memory
showSprite->initWithDuration(0.0f);
delayBeforeDeletion = new CCDelayTime; // Allocated Memory
delayBeforeDeletion->initWithDuration(kDelayBeforeArrowDestroy);
sequenceAction = new CCSequence; // Allocated Memory
sequenceAction->initWithTwoActions(delayAction, showSprite);
sequenceAction2 = new CCSequence; // Allocated Memory
sequenceAction2->initWithTwoActions(spawnActions, delayBeforeDeletion);
sequenceAction3 = new CCSequence;// Allocated memory
sequenceAction3->initWithTwoActions(sequenceAction, sequenceAction2);
sequenceAction4 = new CCSequence; // Allocated memory
sequenceAction4->initWithTwoActions(sequenceAction3, removeWeaponAction);
arrowSprite->runAction(sequenceAction4);

Related

Concurrent Writing CUDA

I am new to CUDA and I am facing a problem with a basic projection kernel. What I am trying to do is to project a 3D point cloud into a 2D image. In case multiple points project to the same pixel, only the point with the smallest depth (the closest one) should be written on the matrix.
Suppose two 3D points fall in an image pixel (0, 0), the way I am implementing the depth check here is not working if (depth > entry.depth), since the two threads (from two different blocks) execute this "in parallel". In the printf statement, in fact, both entry.depth give the numeric limit (the initialization value).
To solve this problem I thought of using a tensor-like structure, each image pixel corresponds to an array of values. After the array is reduced and only the point with the smallest depth is kept. Are there any smarter and more efficient ways of solving this problem?
__global__ void kernel_project(CUDAWorkspace* workspace_, const CUDAMatrix* matrix_) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid >= matrix_->size())
return;
const Point3& full_point = matrix_->at(tid);
float depth = 0.f;
Point2 image_point;
// full point as input, depth and image point as output
const bool& is_good = project(image_point, depth, full_point); // dst, dst, src
if (!is_good)
return;
const int irow = (int) image_point.y();
const int icol = (int) image_point.x();
if (!workspace_->inside(irow, icol)) {
return;
}
// get pointer to entry
WorkspaceEntry& entry = (*workspace_)(irow, icol);
// entry.depth is set initially to a numeric limit
if (depth > entry.depth) // PROBLEM HERE
return;
printf("entry depth %f\n", entry.depth) // BOTH PRINT THE NUMERIC LIMIT
entry.point = point;
entry.depth = depth;
}

html5 devicemotion acceleration always changes

I develop a web application and want to use the devicemotion event to get the acceleration to measure the speed and the distance but i noticed that even the device is static on a flat surface the acceleration values on y and always change.
var clock = null, prevClock = new Date().getTime();
window.addEventListener("devicemotion", function(e) {
if (e.acceleration.x) {
clock = new Date().getTime();
var d = (clock - prevClock) / 1000;
d *= d;
motion.x = (e.acceleration.x);
motion.y = (e.acceleration.y);
motion.z = (e.acceleration.z);
distance.x += (motion.x) * d;
distance.y += (motion.y) * d;
distance.z += (motion.z) * d;
prevMotion = motion;
prevClock = new Date().getTime();
}
}, true);
how can i measure the accurate acceleration.
you are already doing it the correct way. and also you won't get an more accurate acceleration than this. this is because the sensors are so damn sensitive (or just not accurate enough) that you will NEVER get a acceleration of 0 even when your device is just lying on a flat surface. newer phones have more accurate sensors. for example an iPhone 6+ gets way closer to the 0 instead of my Galaxy S4, but even with the iPhone6+ you will never get an acceleration of 0 on a flat surface.

Sharing highly irregular job among CUDA threads

I’m working on some task related to graph traversal (Viterbi algorithm)
Each time step I have a compacted set of active states, some job is done in each state, and than results are propagated through outgoing arcs to each arc’s destination state and so new active set of states is built.
The problem is that number of outgoing arcs varies very heavily , from two or three to several thousands. So compute threads are loaded very ineffectively.
I try to share the job through shared local memory queue
int tx = threaIdx.x;
extern __shared__ int smem[];
int *stateSet_s = smem; //new active set
int *arcSet_s = &(smem[Q_LEN]); //local shared queue
float *scores_s = (float*)&(smem[2*Q_LEN]);
__shared__ int arcCnt;
__shared__ int stateCnt;
if ( tx == 0 )
{
arcCnt = 0;
stateCnt = 0;
}
__syncthreads();
//load state index from compacted list of state indexes
int stateId = activeSetIn_g[gtx];
float srcCost = scores_g[ stateId ];
int startId = outputArcStartIds_g[stateId];
int nArcs = outputArcCounts_g[stateId]; //number of outgoing arcs to be propagated (2-3 to thousands)
/////////////////////////////////////////////
/// prepare arc set
/// !!!! that is the troubled code I think !!!!
/// bank conflicts? uncoalesced access?
int myPos = atomicAdd ( &arcCnt, nArcs );
while ( nArcs > 0 ) && ( myPos < Q_LEN ) )
{
scores_s[myPos] = srcCost;
arcSet_s[myPos] = startId + nArcs - 1;
myPos++;
nArcs--;
}
__syncthreads();
//////////////////////////////////////
/// parallel propagate arc set
if ( arcSet_s[tx] > 0 )
{
FstArc arc = arcs_g[ arcSet_s[tx] ];
float srcCost_ = scores_s[tx];
DoSomeJob ( &srcCost_ );
int *dst = &(transitionData_g[arc.dst]);
int old = atomicMax( dst, FloatToInt ( srcCost_ ) );
////////////////////////////////
//// new active set
if ( old == ILZERO )
{
int pos = atomicAdd ( &stateCnt, 1 );
stateSet_s[ pos ] = arc.dst;
}
}
/////////////////////////////////////////////
/// transfer new active set from smem to gmem
__syncthreads();
__shared__ int gPos;
if ( tx == 0 )
{
gPos = atomicAdd ( activeSetOutSz_g, stateCnt );
}
__syncthreads();
if ( tx < stateCnt )
{
activeSetOut_g[gPos + tx] = stateSet_s[tx];
}
__syncthreads();
But it runs very slow, I mean slower then if no active set is used (active set = all states), though active set is 10 – 15 percent of all states. Register pressure raised heavily, occupancy is low, but I don’t think anything can be done about it.
May be there are more effective ways of job sharing among threads?
A think about warp-shuffle ops on 3.0, but I have to use 2.x devices.
Usually problems with uneven workload and dynamic work creation are addressed using multiple CUDA kernel calls. This can be done by making CPU loop like the following:
//CPU pseudocode
while ( job not done) {
doYourComputationKernel();
loadBalanceKernel();
}
doYourComputationKernel() must have an heuristic to know when it is a good time to stop and send control back to CPU to balance the workload. This can be done by using a global counter for the number of idle blocks. This counter is incremented every time a block finishes its work or cannot create more work. When the number of idle blocks exceed a threshold, the work in all blocks is saved to global memory and all blocks finish.
loadBalanceKernel() should receive the global array with all saved work and another global array of work counters per block. A reduce operation on the later can calculate the total number of works. With this the number of works per block can be found. Finally, the kernel should copy the work so every block receive the same number of elements.
The loop continues until all computation is done. There's a good paper about this: http://gamma.cs.unc.edu/GPUCOL/. The idea is to balance the load of continuous collision detection which is very uneven.

CUDA binary search implementation

I am trying to speed up the CPU binary search. Unfortunately, GPU version is always much slower than CPU version. Perhaps the problem is not suitable for GPU or am I doing something wrong ?
CPU version (approx. 0.6ms):
using sorted array of length 2000 and do binary search for specific value
...
Lookup ( search[j], search_array, array_length, m );
...
int Lookup ( int search, int* arr, int length, int& m )
{
int l(0), r(length-1);
while ( l <= r )
{
m = (l+r)/2;
if ( search < arr[m] )
r = m-1;
else if ( search > arr[m] )
l = m+1;
else
{
return index[m];
}
}
if ( arr[m] >= search )
return m;
return (m+1);
}
GPU version (approx. 20ms):
using sorted array of length 2000 and do binary search for specific value
....
p_ary_search<<<16, 64>>>(search[j], array_length, dev_arr, dev_ret_val);
....
__global__ void p_ary_search(int search, int array_length, int *arr, int *ret_val )
{
const int num_threads = blockDim.x * gridDim.x;
const int thread = blockIdx.x * blockDim.x + threadIdx.x;
int set_size = array_length;
ret_val[0] = -1; // return value
ret_val[1] = 0; // offset
while(set_size != 0)
{
// Get the offset of the array, initially set to 0
int offset = ret_val[1];
// I think this is necessary in case a thread gets ahead, and resets offset before it's read
// This isn't necessary for the unit tests to pass, but I still like it here
__syncthreads();
// Get the next index to check
int index_to_check = get_index_to_check(thread, num_threads, set_size, offset);
// If the index is outside the bounds of the array then lets not check it
if (index_to_check < array_length)
{
// If the next index is outside the bounds of the array, then set it to maximum array size
int next_index_to_check = get_index_to_check(thread + 1, num_threads, set_size, offset);
if (next_index_to_check >= array_length)
{
next_index_to_check = array_length - 1;
}
// If we're at the mid section of the array reset the offset to this index
if (search > arr[index_to_check] && (search < arr[next_index_to_check]))
{
ret_val[1] = index_to_check;
}
else if (search == arr[index_to_check])
{
// Set the return var if we hit it
ret_val[0] = index_to_check;
}
}
// Since this is a p-ary search divide by our total threads to get the next set size
set_size = set_size / num_threads;
// Sync up so no threads jump ahead and get a bad offset
__syncthreads();
}
}
Even if I try bigger arrays, the time ratio is not any better.
You have way too many divergent branches in your code so you're essentially serializing the entire process on the GPU. You want to break up the work so that all the threads in the same warp take the same path in the branch. See page 47 of the CUDA Best Practices Guide.
I'm must admit I'm not entirely sure what what your kernel does, but am I right in assuming that you are looking for just one index that satisfies your search criteria? If so then have a look at the reduction sample that comes with CUDA for some pointers on how to structure and optimize such a query. (What your are doing is essentially trying to reduce the closest index to your query)
Some quick pointers though:
You are performing an awful lot of reads and writes to global memory, which is incredibly slow. Try using shared memory instead.
Secondly remember that __syncthreads() only syncs threads in the same block, so your reads/writes to global memory won't necessarily get synced across all threads (though the latency from you global memory writes may actually make it appear as if they do)

How to synchronize global memory between multiple kernel launches?

I want to launch multiple times the following kernel in a FOR LOOP (pseudo):
__global__ void kernel(t_dev is input array in global mem) {
__shared__ PREC tt[BLOCK_DIM];
if (thid < m) {
tt[thid] = t_dev.data[ii]; // MEM READ!
}
... // MODIFY
__syncthreads();
if (thid < m) {
t_dev.data[thid] = tt[thid]; // MEM WRITE!
}
__threadfence(); // or __syncthreads(); //// NECESSARY!! but why?
}
What I do conceptually is I read in values from t_dev . modify them, and write out to global mem again! and then I start the same kernel again!!
Why do I need obviously the _threadfence or __syncthread
otherwise the result get wrong, because, memory writes are not finished when the same kernel starts again. Thats what happens here, my GTX580 has device overlap enabled,
But why are global mem writes not finished when the next kernel starts... is this because of the device overlap or because its always like that? I thought, when we launch kernel after kernel, mem write/reads are finished after one kernel... :-)
Thanks for your answers!
SOME CODE :
for(int kernelAIdx = 0; kernelAIdx < loops; kernelAIdx++){
proxGPU::sorProxContactOrdered_1threads_StepA_kernelWrap<PREC,SorProxSettings1>(
mu_dev,x_new_dev,T_dev,x_old_dev,d_dev,
t_dev,
kernelAIdx,
pConvergedFlag_dev,
m_absTOL,m_relTOL);
proxGPU::sorProx_StepB_kernelWrap<PREC,SorProxSettings1>(
t_dev,
T_dev,
x_new_dev,
kernelAIdx
);
}
These are thw two kernels which are in the loop, the t_dev and x_new_dev, is moved from Step A to Step B,
Kernel A looks as follows:
template<typename PREC, int THREADS_PER_BLOCK, int BLOCK_DIM, int PROX_PACKAGES, typename TConvexSet>
__global__ void sorProxContactOrdered_1threads_StepA_kernel(
utilCuda::Matrix<PREC> mu_dev,
utilCuda::Matrix<PREC> y_dev,
utilCuda::Matrix<PREC> T_dev,
utilCuda::Matrix<PREC> x_old_dev,
utilCuda::Matrix<PREC> d_dev,
utilCuda::Matrix<PREC> t_dev,
int kernelAIdx,
int maxNContacts,
bool * convergedFlag_dev,
PREC _absTOL, PREC _relTOL){
//__threadfence() HERE OR AT THE END; THEN IT WORKS???? WHY
// Assumend 1 Block, with THREADS_PER_BLOCK Threads and Column Major Matrix T_dev
int thid = threadIdx.x;
int m = min(maxNContacts*PROX_PACKAGE_SIZE, BLOCK_DIM); // this is the actual size of the diagonal block!
int i = kernelAIdx * BLOCK_DIM;
int ii = i + thid;
//First copy x_old_dev in shared
__shared__ PREC xx[BLOCK_DIM]; // each thread writes one element, if its in the limit!!
__shared__ PREC tt[BLOCK_DIM];
if(thid < m){
xx[thid] = x_old_dev.data[ii];
tt[thid] = t_dev.data[ii];
}
__syncthreads();
PREC absTOL = _absTOL;
PREC relTOL = _relTOL;
int jj;
//PREC T_iijj;
//Offset the T_dev_ptr to the start of the Block
PREC * T_dev_ptr = PtrElem_ColM(T_dev,i,i);
PREC * mu_dev_ptr = &mu_dev.data[PROX_PACKAGES*kernelAIdx];
__syncthreads();
for(int j_t = 0; j_t < m ; j_t+=PROX_PACKAGE_SIZE){
//Select the number of threads we need!
// Here we process one [m x PROX_PACKAGE_SIZE] Block
// First Normal Direction ==========================================================
jj = i + j_t;
__syncthreads();
if( ii == jj ){ // select thread on the diagonal ...
PREC x_new_n = (d_dev.data[ii] + tt[thid]);
//Prox Normal!
if(x_new_n <= 0.0){
x_new_n = 0.0;
}
/* if( !checkConverged(x_new,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
xx[thid] = x_new_n;
tt[thid] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
// Select only m threads!
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t];
}
// ====================================================================================
// wee need to syncronize here because one threads finished lambda_t2 with shared mem tt, which is updated from another thread!
__syncthreads();
// Second Tangential Direction ==========================================================
jj++;
__syncthreads();
if( ii == jj ){ // select thread on diagonal, one thread finishs T1 and T2 directions.
// Prox tangential
PREC lambda_T1 = (d_dev.data[ii] + tt[thid]);
PREC lambda_T2 = (d_dev.data[ii+1] + tt[thid+1]);
PREC radius = (*mu_dev_ptr) * xx[thid-1];
PREC absvalue = sqrt(lambda_T1*lambda_T1 + lambda_T2*lambda_T2);
if(absvalue > radius){
lambda_T1 = (lambda_T1 * radius ) / absvalue;
lambda_T2 = (lambda_T2 * radius ) / absvalue;
}
/*if( !checkConverged(lambda_T1,xx[thid],absTOL,relTOL)){
*convergedFlag_dev = 0;
}
if( !checkConverged(lambda_T2,xx[thid+1],absTOL,relTOL)){
*convergedFlag_dev = 0;
}*/
//Write the two values back!
xx[thid] = lambda_T1;
tt[thid] = 0.0;
xx[thid+1] = lambda_T2;
tt[thid+1] = 0.0;
}
// all threads not on the diagonal fall into this sync!
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+1];
}
__syncthreads();
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
__syncthreads();
if(thid < m){
tt[thid] += T_dev_ptr[thid] * xx[j_t+2];
}
// ====================================================================================
__syncthreads();
// move T_dev_ptr 1 column
T_dev_ptr = PtrColOffset_ColM(T_dev_ptr,1,T_dev.outerStrideBytes);
// move mu_ptr to nex contact
__syncthreads();
mu_dev_ptr = &mu_dev_ptr[1];
__syncthreads();
}
__syncthreads();
// Write back the results, dont need to syncronize because
// do it anyway to be safe for testing first!
if(thid < m){
y_dev.data[ii] = xx[thid]; THIS IS UPDATED IN KERNEL B
t_dev.data[ii] = tt[thid]; THIS IS UPDATED IN KERNEL B
}
//__threadfence(); /// THIS STUPID THREADFENCE MAKES IT WORKING!
I compare the solution at the end with the CPU, and HERE I put everywhere I can a syncthread around only to be safe, for the start! (this code does gauss seidel stuff)
but it does not work at all without the THREAD_FENCE at the END or at the BEGINNIG where it does not make sense...
Sorry for so much code, but probably you can guess where the problem comes, frome because I am bit at my end, with explainig why this happens?
We checked the algorithm several times, there is no memory error (reported from Nsight) or
other stuff, every thing works fine... Kernel A is launched with ONE Block only!
If you launch the successive instances of the kernel into the same stream, each kernel launch is synchronous compared to the kernel instance before and after it. The programming model guarantees it. CUDA only permits simultaneous kernel execution on kernels launched into different streams of the same context, and even then overlapping kernel execution only happens if the scheduler determines that sufficient resources are available to do so.
Neither __threadfence nor __syncthreads will have the effect you seem to be thinking about - __threadfence works only at the scope of all active threads and __syncthreads is an intra-block barrier operation. If you really want kernel to kernel synchronization, you need to use one of the host side synchronization calls, like cudaThreadSynchronize (pre CUDA 4.0) or cudaDeviceSynchronize (cuda 4.0 and later), or the per-stream equivalent if you are using streams.
While I am a bit surprised by what you are experiencing, I believe your explanation may be correct.
Writes to global memory, with an exception of atomic functions, are not guaranteed to be immediately visible by other threads (from the same, or from different blocks). By putting __threadfence() you halt the current thread until the writes are in fact visible. This might be important in particular when you are using global memory with a cache (the Fermi series).
One thing to note: Kernel calls are asynchronous. While your first kernel call is being handled by the GPU, the host may issue another call. The next kernel will not run in parallel with your current one, but will launch as soon as the current one finishes, esentially hiding the latency caused by the CPU->GPU communication.
Using cudaThreadSynchronise halts the host thread until all the CUDA tasks are done. It may help you, but it will also prevent you from hiding the CPU->GPU communication latency. Do note, that using synchronous memory access (e.g. cudaMemcpy, without "Async" suffix) esentially behaves like cudaThreadSynchronise too.