constructor for SpriteBatch explained - libgdx

This is a definition of SpriteBatch constructor from docs:
SpriteBatch()
Constructs a new SpriteBatch with a size of 1000, one buffer, and the default shader.
Buffer is like temporary storage for data that needs to be set on screen. So one buffer means one piece of memory in RAM? The size parameter is the number of bytes in this piece of memory?

Related

The shared memory size is limited to the maximum thread number when using AtomicAdd function

I use atomic operation to calculate summation of the values like histogram.
So, I use shared memory first to store the value in the block, and the values stored at the shared memory in each block are saved in the global memory next.
The whole code is follows.
__global__ void KERNEL_RIGID_force_sum(part1*P1,part3*P3,int_t*nop_sol,Real*xcm,Real*ycm,Real*zcm,Real*sum_fx,Real*sum_fy,Real*sum_fz)
{
int_t i=threadIdx.x+blockIdx.x*blockDim.x;
if(i>=k_num_part2) return;
if(P1[i].i_type==3) return;
// if(P1[i].p_type<RIGID) return;
// initilalize accumulation array in shared memory
__shared__ int_t tmp_nop[128];
__shared__ Real tmp_xcm[128],tmp_ycm[128],tmp_zcm[128];
__shared__ Real tmp_fx[128],tmp_fy[128],tmp_fz[128];
tmp_nop[threadIdx.x]=0;
tmp_xcm[threadIdx.x]=0;
tmp_ycm[threadIdx.x]=0;
tmp_zcm[threadIdx.x]=0;
tmp_fx[threadIdx.x]=0;
tmp_fy[threadIdx.x]=0;
tmp_fz[threadIdx.x]=0;
__syncthreads();
Real xi,yi,zi;
Real fxi,fyi,fzi;
int_t ptypei;
ptypei=P1[i].p_type;
xi=P1[i].x;
yi=P1[i].y;
zi=P1[i].z;
fxi=P3[i].ftotalx;
fyi=P3[i].ftotaly;
fzi=P3[i].ftotalz;
// save values to shared memory
atomicAdd(&tmp_nop[ptypei],1);
atomicAdd(&tmp_xcm[ptypei],xi);
atomicAdd(&tmp_ycm[ptypei],yi);
atomicAdd(&tmp_zcm[ptypei],zi);
atomicAdd(&tmp_fx[ptypei],fxi);
atomicAdd(&tmp_fy[ptypei],fyi);
atomicAdd(&tmp_fz[ptypei],fzi);
__syncthreads();
// save shared memory values to global memory
atomicAdd(&nop_sol[threadIdx.x],tmp_nop[threadIdx.x]);
atomicAdd(&xcm[threadIdx.x],tmp_xcm[threadIdx.x]);
atomicAdd(&ycm[threadIdx.x],tmp_ycm[threadIdx.x]);
atomicAdd(&zcm[threadIdx.x],tmp_zcm[threadIdx.x]);
atomicAdd(&sum_fx[threadIdx.x],tmp_fx[threadIdx.x]);
atomicAdd(&sum_fy[threadIdx.x],tmp_fy[threadIdx.x]);
atomicAdd(&sum_fz[threadIdx.x],tmp_fz[threadIdx.x]);
}
But, there are some problems.
Because the number of thread block is 128 in my code, I allocate shared memory and global memory size as 128.
How can I do if I want to use shared memory larger than max number of thread size 1,024? (when there are more than 1,024 p_type)
If I allocate shared memory size as 1,024 or higher value, system says
ptxas error : Entry function '_Z29KERNEL_RIGID_force_sum_sharedP17particles_array_1P17particles_array_3PiPdS4_S4_S4_S4_S4_' uses too much shared data (0xd000 bytes, 0xc000 max)
To put it simply, I don't know what to do when the size to perform reduction is more than 1,024.
Is it possible to calculate using anything else other than threadIdx.x?
Could you give me some advice?
Shared memory is limited in size. The default limits for most GPUs is 48KB. It has no direct connection to the number of threads in the threadblock. Some GPUs can go as high as 96KB, but you haven't indicated what GPU you are running on. The error you are getting is not directly related to the number of threads per block you have, but to the amount of shared memory you are requesting per block.
If the amount of shared memory you need exceeds the shared memory available, you'll need to come up with another algorithm. For example, a shared memory reduction using atomics (what you seem to have here) could be converted into an equivalent operation using global atomics.
Another approach would be to determine if it is possible to reduce the size of the array elements you are using. I have no idea what your types (Real, int_t) correspond to, but depending on the types, you may be able to get larger array sizes by converting to 16-bit types. cc7.x or higher devices can do atomic add operations on 16-bit floating point, for example, and with a bit of effort you can even do atomics on 8-bit integers.

Is there any way to dynamically allocate constant memory? CUDA

I'm confused about copying arrays to constant memory.
According to programming guide there's at least one way to allocate constant memory and use it in order to store an array of values. And this is called static memory allocation:
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));
According to programming guide again we can use:
__device__ float* devPointer;
float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));
It looks like dynamic constant memory allocation is used, but I'm not sure about it. And also no qualifier __constant__ is used here.
So here are some questions:
Is this pointer stored in constant memory?
Is assigned (by this pointer) memory stored in constant memory too?
Is this pointer constant? And it's not allowed to change that pointer using device or host function. But is changing values of array prohibited or not? If changing values of array is allowed, then does it mean that constant memory is not used to store this values?
The developer can declare up to 64K of constant memory at file scope. In SM 1.0, the constant memory used by the toolchain (e.g. to hold compile-time constants) was separate and distinct from the constant memory available to developers, and I don't think this has changed since. The driver dynamically manages switching between different views of constant memory as it launches kernels that reside in different compilation units. Although you cannot allocate constant memory dynamically, this pattern suffices because the 64K limit is not system-wide, it only applies to compilation units.
Use the first pattern cited in your question: statically declare the constant data and update it with cudaMemcpyToSymbol before launching kernels that reference it. In the second pattern, only reads of the pointer itself will go through constant memory. Reads using the pointer will be serviced by the normal L1/L2 cache hierarchy.

cuda kernel for add(a,b,c) using texture objects for a & b - works correctly for 'increment operation' add(a,b,a)?

I want to implement a cuda function 'add(a,b,c)' for adding (component-wise) two one-channel floating-point images 'a' and 'b' together and storing the result in the floating-point image 'c'. So 'c = a + b'.
The function will be implemented by first binding texture objects 'aTex' and 'bTex' to the pitch-linear images 'a' and 'b', and then accessing the image 'a' and 'b' inside the kernel only via the texture objects 'aTex' and 'bTex'. The sum is stored in 'c' via a simple write to global memory.
What happens now if I call the function for incrementing 'a' by 'b' - so I call 'add(a,b,a)' ? Because now, the image 'a' is used in the kernel on two places - from 'a' I read in the value via the texture object 'aTex', and I also store values in 'a' via the write to global memory. Is it possible that this usage of the 'add' function leads to incorrect results ?
The GPU's texture is not coherent. This means that a global memory write to a particular location of the global memory underlying a texture may or may not be reflected during a subsequent texture access to that same location. So there is a read-after-write hazard in such a scenario.
If, however, the code performs a global memory write to a particular location of the global memory underlying a texture, and that location subsequently is never read from via the texture during the lifetime of the kernel, there is no read-after-write hazard, and the code will behave as expected: The updated data in global memory can be accessed by a subsequent kernel in any manner desired, including texture access, as the texture cache is cleared upon a kernel launch.
I have personally used this approach to speed up in-place operations with small strides as the texture read path provided higher load performance. An example would be the BLAS-1 operation [D|S|Z|C]SCAL in CUBLAS, which scales each array element by a scalar.

How do I write data from Pixmap to Framebuffer's depth buffer?

I'm implementing texture masking using framebuffer's depth buffer following this example:
https://github.com/mattdesl/lwjgl-basics/wiki/LibGDX-Masking#masking-with-depth-buffer
I got it working with ShapeRenderer altering depth buffer, but now I want to alter depth buffer with a Pixmap. I need all non-opaque pxiels from pixmap to be written to depth buffer as 1.0 depth value, and all opaque ones as 0.0 depth value.
I see a solution in writing individual pixels from Pixmap with ShapeRenderer. But that seems rather inefficient. Is there a more appropriate and efficient way?
Texture myTex = new Texture(Pixmap pixmap);
Should be pretty trivial to draw a texture to a FrameBuffer, no?

CUDA: Move content of a volume texture using only one kernel and a threadfence

I want to move the content of a volume texture along the vector vecShift. I think of a kernel like this:
__global__ void
moveVolume(int* vecShift)
{
// Determine position of current voxel as ptDest
// Determine position of voxel we copy the content from as ptSrc
// Read value at ptSrc and store it to voxelColor
// __threadfence()
// Write voxelColor to voxel at position ptDest
}
The threadfence will ensure that ALL voxels have read the contents of their "partner" and there will be no write to ptDest before every voxel has done the read-operation, does it?
If this is true, why I (sometimes) get artifacts of a blurry kind? Or do I have a wrong opinion on the functionality of threadfence?
As talonmies explains in the comments, using __threadfence() here is neither necessary nor sufficient. __threadfence() does not provide global barrier synchronization, it simply ensures that before the thread that calls __threadfence() proceeds, all writes by that thread before the fence are visible to all other active threads in the kernel launch.
What you really want here is to double buffer your volume data (i.e. write to a different array than you read). You cannot overwrite other parts of the array unless you can guarantee that they are only read by other threads in the same thread block. Otherwise you have a race condition and your program is incorrect.
Note: even in a sequential (CPU) implementation, you would need to double buffer your data for this type of computation!
What you are implementing is very similar to an advection kernel, as would be used in fluid dynamics simulations, and I'm sure there are multiple examples of what you want on the web (parallel or sequential).
Mark