The nSight profiler tells me that the following kernel uses 52 registers per thread:
//Just the first lines of the kernel.
__global__ void voles_kernel(float *params, int *ctrl_params,
float dt, float currTime,
float *dev_voles, float *dev_weasels,
curandStateMtgp32 *state)
{
__shared__ float dev_params[9];
__shared__ int BuYeSimStep[4];
if(threadIdx.x < 4)
{
BuYeSimStep[threadIdx.x] = ctrl_params[threadIdx.x];
}
if(threadIdx.x < 9){
dev_params[threadIdx.x] = params[threadIdx.x];
}
__syncthreads();
float currVole = curand_uniform(&state[blockIdx.x]) + 3.0;
float currWeas = curand_uniform(&state[blockIdx.x]) + 0.1;
float oldVole = currVole;
float oldWeas = currWeas;
int jj;
if (blockIdx.x * blockDim.x + threadIdx.x < BuYeSimStep[2])
{
int dayIndex = 0;
/* Not declaring any new variable from here on, just doing arithmetics.
....... */
If each register has 4 bytes I don't understand how we get to 52 registers, even
assuming that the arrays params[9] and ctrl_params[4] end up in registers (in which
case using shared memory as I did doesn't make sense). I would
like to increase occupancy, but I don't get why I'm using so many registers.
Any ideas?
It's generally difficult to look at C code and predict the register usage from it. The compiler may aggressively optimize code by increasing register usage, perhaps to save an instruction here or there. You seem to be making an assumption that register usage can be predicted from your C code variable allocations, and while there is some connection between the two, you cannot assume register usage can be computed directly from C code variable allocations.
Since you haven't provided your code, nobody can actually help with the register usage. If you want to better understand the register usage, you will need to look at the PTX code directly. To do this, compile your code using nvcc with the -ptx switch, and inspect the resultant .ptx file directly. To do this you may wish to refer to the PTX documentation as well as the nvcc documentation to look at the various compiler options.
You haven't provided your code, so it's not really possible to make any direct suggestions, but you may be able to reduce register usage by reducing constant usage, reducing or refactoring arithmetic usage, switching from double to float, and I'm sure there are many other suggestions as well. Register usage will also be affected if you are passing the -G switch to the compiler.
You can limit the compiler's usage of registers per thread by passing the -maxrregcount switch to nvcc with an appropriate parameter, such as -maxrregcount 20 which will instruct the compiler to limit itself to 20 registers per thread. This tactic may not give good results, however, or you may need to tune the parameter to a value which doesn't sacrifice too much performance. However you may find an optimum choice which doesn't sacrifice too much basic performance but allows you to improve occupancy. If you constrain the compiler too much, it will begin to spill it's needed register usage to local memory, which will generally reduce performance.
You should also be aware that you can pass -Xptxas -v to nvcc which will give useful output about the compiler's register usage and other related data (spilling, etc.) at compile time.
If you want to increase the occupancy, a direct way is using compiler flag: maxregcount to restrict the usage of registers, but it may suffer a performance loss because some registers will be spilled to local memory, which is very slow.
I suggest you debug your code with Eclipse Nsight.
Create a breakpoint at the first line of your kernel and step to there.
In Debug Perspective, inside the CUDA Thread, you have the current stack trace. Right-click on the stack and click on "Instruction Stepping Mode". The window "Disassembly" will open your kernel PTX Assembly. You can continue stepping in your kernel to track the correlation of your source code and the assembly. So you can discover which register is used for.
Related
I have a C++ class container that allocates, lets say, 1GB of memory of plain objects (e.g. built-ins).
I need to copy part of the object to the GPU.
To accelerate and simplify the transfer I want to register the CPU memory as non-pageable ("pinning"), e.g. with cudaHostRegister(void*, size, ...) before copying.
(This seems to be a good way to copy further subsets of the memory with minimal logic. For example if plain cudaMemcpy is not enough.)
Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.
I may want to register only part because of efficiency, but also because deep down in the call trace I might have lost information of the original allocated pointer.
In other words, can the pointer argument to cudaHostRegister be the something else other than an allocated pointer? in particular an arithmetic result deriving from allocated memory, but still within the allocated range.
It seems to work but I don't understand if, in general, "pinning" part of an allocation can corrupt somehow the allocated block.
UPDATE: My concern is that allocation is actually mentioned in the documentation for the cudaHostRegister flag options:
cudaHostRegisterDefault: On a system with unified virtual addressing, the memory will be both mapped and portable. On a system
with no unified virtual addressing, the memory will be neither mapped
nor portable.
cudaHostRegisterPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one
that performed the allocation.
cudaHostRegisterMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling
cudaHostGetDevicePointer().
cudaHostRegisterIoMemory: The passed memory pointer is treated as pointing to some memory-mapped I/O space, e.g. belonging to a
third-party PCIe device, and it will marked as non cache-coherent and
contiguous.
cudaHostRegisterReadOnly: The passed memory pointer is treated as pointing to memory that is considered read-only by the device. On
platforms without cudaDevAttrPageableMemoryAccessUsesHostPageTables,
this flag is required in order to register memory mapped to the CPU as
read-only. Support for the use of this flag can be queried from the
device attribute cudaDeviceAttrReadOnlyHostRegisterSupported. Using
this flag with a current context associated with a device that does
not have this attribute set will cause cudaHostRegister to error with
cudaErrorNotSupported.
This is a rule-of-thumb answer rather than a proper one:
When the CUDA documentation does not guarantee something is guaranteed to work - you'll need to assume it doesn't. Because if it does happen to work - for you, right now, on the system you have - it might stop working in the future; or on another system; or in another usage scenario.
More specifically - memory pinning happens at page resolution, so unless the part you want to pin starts and ends on a physical page boundary, the CUDA driver will need to pin some more memory before and after the region you asked for - which it could do, but it's going an extra mile to accommodate you, and I doubt that would happen without documentation.
I also suggest you file a bug report via developer.nvidia.com , asking that they clarify this point in the documentation. My experience is that there's... something like a 50% chance they'll do something about such a bug report.
Finally - you could just try it: Write a program which copies to the GPU with and without the pinning of the part-of-the-region, and see whether there's a throughput difference.
Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.
While I agree that the documentation could be clearer, I think the answer to the question is 'Yes'.
Here's why: The alternative interpretation would be that only whole memory sections returned by, say, malloc should be allowed to be registered. However, this is unworkable, because malloc could, behind the scenes, have one big section allocated, and only give the user parts of it. So even if you (the user) were cudaHostRegistering those sections returned by malloc, they'd actually be fragments of some bigger previously allocated chunk of memory anyway.
By the way, Linux has a similar kernel call to lock memory called mlock. It accepts arbitrary memory ranges.
One of the other answers claimed (until this test was posted):
If you need to copy the part-of-the-object just once to the GPU - there's no use in using cudaHostRegister(), because it will likely itself copy the data, physically, elsewhere - so you won't be saving anything
But this is incorrect: registering is worth it, if the chunk of memory being copied is big enough, even if the copying is done only once. I'm seeing about a 2x speed-up with this code (comment out the line indicated), or about 50% if unregistering is also done between the timers.
#include <chrono>
#include <iostream>
#include <vector>
#include <cuda_runtime_api.h>
int main()
{
std::size_t giga = 1024*1024*1024;
std::vector<char> src(giga, 3);
char* dst = 0;
if(cudaMalloc((void**)&dst, giga)) return 1;
cudaDeviceSynchronize();
auto t0 = std::chrono::system_clock::now();
if(cudaHostRegister(src.data() + src.size()/2, giga/8, cudaHostRegisterDefault)) return 1; // comment out this line
if(cudaMemcpy(dst, src.data() + src.size()/2, giga/8, cudaMemcpyHostToDevice)) return 1;
cudaDeviceSynchronize();
auto t1 = std::chrono::system_clock::now();
auto d = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
std::cout << (d / 1e6) << " seconds" << std::endl;
// un-register and free
}
Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?
Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.
In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:
#include <stdio.h>
#define LOOPS 100000
__device__ volatile int test2 = 0;
__device__ int test3 = 32767;
__global__ void kernel(){
for (int i = 0; i < LOOPS; i++){
unsigned long time = clock64();
// while (clock64() < (time + (threadIdx.x * 1000)));
int start = test2;
atomicAdd((int *)&test2, 1);
int end = test2;
int diff = end - start;
atomicMin(&test3, diff);
}
}
int main() {
kernel<<<1, 1024>>>();
int result;
cudaMemcpyFromSymbol(&result, test3, sizeof(int));
printf("result = %d threads\n", result);
return 0;
}
I compile with:
nvcc -O3 -arch=sm_20 -o t331 t331.cu
I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.
We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1
*my test case is CUDA 5, Quadro5000, RHEL 5.5
Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.
SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.
SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:
%clock is defined as a unsigned 32-bit read-only cycle counter that wraps silently and updates every issue cycle
all threads in a warp read the same value for %clock
due to warp launch latency and instruction fetch warps on the same SM but different warp schedulers cannot issue the first instruction of a warp on the same cycle
All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.
SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.
SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.
From the CUDA Programming Guide (v. 5.5):
The CUDA programming model assumes a device with a weakly-ordered
memory model, that is:
The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the data is observed being
written by another CUDA or host thread;
The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the read instructions appear in
the program for instructions that are independent of each other
However, do we have a guarantee that the (dependent) memory operations as seen from the single thread are actually consistent? If I do - say:
arr[x] = 1;
int z = arr[y];
where x happens to be equal to y, and no other thread is touching the memory, do I have a guarantee that z is 1? Or do I still need to put some volatile or a barrier between those two operations?
In response to Orpedo's answer.
If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
My problem is what optimizations (done either by compiler or hardware) are allowed?
It could happen --- for example --- that store instruction is non-blocking and the load instruction that follows somehow is managed by the memory controller faster than the already queued-up store.
I don't know CUDA hardware. Do I have a guarantee that the above will never happen?
The CUDA Programming Guide simply stating, that you cannot predict in which order the threads is executed, but every single thread will still run as a sequential thread.
In the example you state, where x and y are the same and NO OTHER THREAD is touching the memory, you DO have a guarantee that z = 1.
Here the point being, that if you have several threads dooing operations on the same data (e.g. an array), you are NOT guaranteed that thread #9 executes before #10.
Take an example:
__device__ void sum_all(float *x, float *result, int size N){
x[threadId.x] = threadId.x;
result[threadId.x] = 0;
for(int i = 0; i < N; i++)
result[threadId.x] += x[threadID.x];
}
Here we have some dumb function, which SHOULD fill a shared array (x) with the numbers from m ... n (read from one number to another number), and then sum up the numbers already put into the array and store the result in another array.
Given that you your lowest indexed thread is enumerated thread #0, you would expect that the first time your code runs this code x should contain
x[] = {0, 0, 0 ... 0} and result[] = {0, 0, 0 ... 0}
next for thread #1
x[] = {0, 1, 0 ... 0} and result[] = {0, 1, 0 ... 0}
next for thread #2
x[] = {0, 1, 2 ... 0} and result[] = {0, 1, 3 ... 0}
and so forth.
But this is NOT guaranteed. You can't know if e.g. thread #3 runs first, hence changing the array x[] before thread #0 runs. You actually don't even know if the arrays are changed by some other thread while you are executing the code.
I am not sure, if this is explicitly stated in the CUDA documentation (I wouldn't expect it to be), as this is a basic principle of computing. Basically what you are asking is, if running your code on a GFX will change the functionality of your code.
The cores of a GPU are generally the same, as that of a CPU, just with less control-arithmetics, a smaller instructionset and typically only supporting single-precision.
In a CUDA-GPU there is 1 program counter for each Warp (section of 32 synchronous cores). Like a CPU, the program counter increases by magnitude of one address element after each instruction, unless you have branches or jumps. This gives the sequential flow of the program, and this can not be changed.
Branches and jumps can only be introduced by the software running on the core, and hence is determined by your compiler. Compiler optimizations can in fact change the functionality of your code, but only in the case where the code is implemented "wrong" with respect to the compiler
So in short - Your code will always be executed in the order it is ordered in the memory, no matter if it is executed on a CPU or a GPU. If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
Hope this was clear enough :)
As far as I understood you're basically asking whether memory dependencies and alias analysis information are being respected in the CUDA compiler.
The answer to that question is, assuming that the CUDA compiler is free of bugs, yes because as Robert noted the CUDA compiler uses LLVM under the hood and two basic modules (which, at the moment, I really don't think they could be excluded by the pipeline) are:
Memory dependence analysis
Alias Analysis
These two passes detect memory locations potentially pointing to the same address and use live-analysis on variables (even out of the block scope) to avoid dangerous optimizations (e.g. you can't write in a live variable before its next read, the data may still be useful).
I don't know the compiler internals but assuming (as any other reasonably trusted compiler) that it will do its best to be bug-free, the analysis that take place in there should really not bother you at all and assure you that at least in theory what you just presented as an example (i.e. the dependent-load faster than the store) cannot happen.
What guarantee you that? Nothing but the fact that the company is giving a compiler to use, and there are disclaimers in case it doesn't for exceptional cases :)
Also: aside from the compiler topic, the instruction execution is also dependent on the hardware specification. In this case, a SIMT hardware instruction issuing unit
cfr. http://www.csl.cornell.edu/~cbatten/pdfs/kim-simt-vstruct-isca2013.pdf and all the referenced papers for more information
I launch a very simple kernel <<<1,512>>> on a CUDA Fermi GPU.
__global__ void kernel(){
int x1,x2;
x1=5;
x2=1;
for (int k=0;k<=1000000;k++)
{
x1+=x2;
}
}
The kernel is very simple, it does 10^6 additions and does not transfer anything back to global memory. The result is correct, i.e. after the loop x1 (in all its 512 thread instances) contains 10^6 + 5
I am trying to measure the execution time of the kernel. using both visual studio parallel nsight and nvvp. Nsight measures 2.5 microseconds and nvvp measures 4 microseconds.
The issue is the following: I may increase largely the size of the loop eg to 10^8 and the time remains constant. Same if I decrease the loop size a lot. Why does this happen?
Please note that if I use shared memory or global memory inside the loop, the measurements reflect the work being performed (i.e. there is proportionality).
As noted, CUDA compiler optimisation is very aggressive at removing dead code. Because x2 doesn't participate in a value which is written to memory, it and the loop can be removed. The compiler will also pre-calculate any results which can be deduced at compile time, so if all the constants in the loop are known to the compiler, it can compute the final result and replace it with a constant.
To get around both of these problems, rewrite your code like this:
__global__
void kernel(int *out, int x0, bool flag)
{
int x1 = x0, x2 = 1;
for (int k=0; k<=1000000; k++) {
x1+=x2;
}
if (flag) out[threadIdx.x + blockIdx.x*blockDim.x] = x1;
}
and then run it like this:
kernel<<<1,512>>>((int *)0, 5, false);
By passing the initial value of x1 as an argument to the kernel, you ensure that the loop result isn't available to the compiler. The flag makes the memory store conditional, and then memory store makes the whole calculation unsafe to remove. As long as the flag is set to false at runtime, there is no store performed, so that doesn't effect the timing of the loop.
Because compiler eliminates the dead paths. You code doesn't actually do anything. Look at the assembly.
If you are actually seeing the value, then the compiler may have just optimized out the loop as it can know the value during compile time.
When you write out the register contents to shared memory, compiler cannot guarantee that the result will not be used, and hence the value will actually be computed. In other words, the value you compute must be used somewhere eventually or written to memory otherwise its computation will be dropped.
I know that there is function clock() in CUDA where you can put in kernel code and query the GPU time. But I wonder if such a thing exists in OpenCL? Is there any way to query the GPU time in OpenCL? (I'm using NVIDIA's tool kit).
There is no OpenCL way to query clock cycles directly. However, OpenCL does have a profiling mechanism that exposes incremental counters on compute devices. By comparing the differences between ordered events, elapsed times can be measured. See clGetEventProfilingInfo.
Just for others coming her for help: Short introduction to profiling kernel runtime with OpenCL
Enable profiling mode:
cmdQueue = clCreateCommandQueue(context, *devices, CL_QUEUE_PROFILING_ENABLE, &err);
Profiling kernel:
cl_event prof_event;
clEnqueueNDRangeKernel(cmdQueue, kernel, 1 , 0, globalWorkSize, NULL, 0, NULL, &prof_event);
Read profiling data in:
cl_ulong ev_start_time=(cl_ulong)0;
cl_ulong ev_end_time=(cl_ulong)0;
clFinish(cmdQueue);
err = clWaitForEvents(1, &prof_event);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL);
Calculate kernel execution time:
float run_time_gpu = (float)(ev_end_time - ev_start_time)/1000; // in usec
Profiling of individual work-items / work-goups is NOT possible yet.
You can set globalWorkSize = localWorkSize for profiling. Then you have only one workgroup.
Btw: Profiling of a single work-item (some work-items) isn't very helpful. With only some work-items you won't be able to hide memory latencies and the overhead leading to not meaningful measurements.
Try this (Only work with NVidia OpenCL of course) :
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time));
return clock_time;
}
The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL. The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock.
I have never tested this with OpenCL but use it in CUDA.
Please be warned that the compiler may reorder or remove the register read.
On NVIDIA you can use the following:
typedef unsigned long uint64_t; // if you haven't done so earlier
inline uint64_t n_nv_Clock()
{
uint64_t n_clock;
asm volatile("mov.u64 %0, %%clock64;" : "=l" (n_clock)); // make sure the compiler will not reorder this
return n_clock;
}
The volatile keyword tells the optimizer that you really mean it and don't want it moved / optimized away. This is a standard way of doing so both in PTX and e.g. in gcc.
Note that this returns clocks, not nanoseconds. You need to query for device clock frequency (using clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, 0))). Also note that on older devices there are two frequencies (or three if you count the memory frequency which is irrelevant in this case): the device clock and the shader clock. What you want is the shader clock.
With the 64-bit version of the register you don't need to worry about overflowing as it generally takes hundreds of years. On the other hand, the 32-bit version can overflow quite often (you can still recover the result - unless it overflows twice).
Now, 10 years later after the question was posted I did some tests on NVidia. I tried running the answers given by user 'Spectral' and 'the swine'. Answer given by 'Spectral' does not work. I always got same invalid values returned by clock_time function.
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time)); // this is wrong
return clock_time;
}
After subtracting start and end time I got zero.
So had a look at the PTX assembly which in PyOpenCL you can get this way:
kernel_string = """
your OpenCL code
"""
prg = cl.Program(ctx, kernel_string).build()
print(prg.binaries[0].decode())
It turned out that the clock command was optimized away! So there was no '%clock' instruction in the printed assembly.
Looking into Nvidia's PTX documentation I found the following:
'Normally any memory that is written to will be specified as an out operand, but if there is a hidden side effect on user memory (for example, indirect access of a memory location via an operand), or if you want to stop any memory optimizations around the asm() statement performed during generation of PTX, you can add a "memory" clobbers specification after a 3rd colon, e.g.:'
So the function that actually work is this:
uint clock_time()
{
uint clock_time;
asm volatile ("mov.u32 %0, %%clock;" : "=r"(clock_time) :: "memory");
return clock_time;
}
The assembly contained lines like:
// inline asm
mov.u32 %r13, %clock;
// inline asm
The version given by 'the swine' also works.