My understanding of memory bandwidth is that it is reported using the SI units, kilo=10^3, mega=10^6 etc (although memory size is obviously reported in 2^n format).
The CUDA bandwidthTest sample seems to flaunt this. The below sample is straight from the SDK sample, where memSize is an integer describing the size of the array and is 32*2^20 by default and MEMCOPY_ITERATIONS is an integer.
Let's say elapsed time is 1000ms and MEMCOPY_ITERATIONS=1, the results would be 64MB/s, but the MB is of the form 2^20. Is my assumption correct, and if so is the binary notation of bandwidth accepted?
I thought it wasn't.
//calculate bandwidth in MB/s
bandwidthInMBs = 2.0f * (1e3f * memSize * (float)MEMCOPY_ITERATIONS) /
(elapsedTimeInMs * (float)(1 << 20));
EDIT: On the off chance that anyone ever searches for this again, an altered bandwidthTest that reports in SI MB/s is here, adapted from the CUDA 5.5 SDK and including the Visual Studio Projects.
bandwidthTest gives results in binary MB/s?
Yes.
is the binary notation of bandwidth accepted?
Perhaps not.
(those were the only 2 questions I could find.)
Since the binary megabyte is larger than the SI megabyte, it would seem that the bandwidthTest sample code is under-reporting if you interpret the results according to SI units. As a sample code, it's primary purpose is to educate and instruct, not conform to some definition.
You have the source code -- you can make your version of it report any way that you want.
Related
I'm using the vDSP framework for a real-time audio application based on FFT computation.
After having lots of problems trying to figure out why the algorithm was producing incorrect results, I found out the following comment on the official vDSP FFT help code (DemonstrateFFT.c, lines 242, 416, 548)
/* Zero the signal before timing because repeated FFTs on non-zero
data can cause abnormalities such as infinities, NaNs, and
subnormal numbers.
*/
In order to reproduce the error, just comment line 247 (no zero the signal) and add something similar to the following line at line 273 (just after the vDSP_fft_zrip method)
if (isnan(Observed.realp[0])) printf("Iteration %lu: NaN\n",i); // it would work with any of the components of Observed
It is interesting to observe that reducing N (i.e. increasing the amount of FFTs per time unit) makes the zrip algorithm to fail before, which kinds of makes sense since the comment advices about performing repeated FFTs.
The behavior is also observed with the vDSP_fft_zrop algorithm.
I'm really wondering what's the point about performing FFTs of "zero data" as advised on the comment. Either I'm missing something important, or definitely the vDSP framework is not suited at all for real-time audio processing.
Normal 16 and 24-bit "real time" audio samples will not see this issue.
But benchmarks can create bigger and smaller numbers that can exceed the range of double precision floats when iterated enough times, and when using many functions, not just FFTs. Try iterating exp() fed back to itself, that will blow up even faster. It's a problem one encounters using any finite precision computer arithmetic (not just the ARM and x86 CPUs that vDSP uses).
I'm doing matrix multiplication on a GTX1080 GPU using JCuda, version 0.8.0RC with CUDA 8.0. I load two matrices A and B into the device in row-major vector form, and read the product matrix from the device. But I'm finding that I run out of device memory earlier than I would expect. For example, if matrix A is dimensioned 100000 * 5000 = 500 million entries = 2GB worth of float values, then:
cuMemAlloc(MatrixA, 100000 * 5000 * Sizeof.FLOAT);
works fine. But if I increase the number or rows to 110000 from 100000, I get the following error on this call (which is made before the memory allocations for matrices B and C, so those are not part of the problem):
Exception in thread "main" jcuda.CudaException: CUDA_ERROR_OUT_OF_MEMORY
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:344)
at jcuda.driver.JCudaDriver.cuMemAlloc(JCudaDriver.java:3714)
at JCudaMatrixMultiply.main(JCudaMatrixMultiply.java:84) (my code)
The issue is that allocating a matrix of this size on the device should take only about 2.2GB, and the GTX1080 has 8GB of memory, so I don't see why I'm running out of memory. Does anyone have any thoughts on this? It's true that I'm using the JCuda 0.8.0RC with the release version of CUDA 8, but I tried downloading the RC version of CUDA 8 (8.0.27) to use with JCuda 0.8.0RC and had some problems getting it to work. If versions compatibility is likely to be the issue, however, I can try again.
Matrices of 100000 * 5000 are pretty big, of course, and I won't need to work with larger matrices for a while on my neural network project, but I would like to be confident that I can use all 8GB of memory on this new card. Thanks for any help.
tl;dr:
When calling
cuMemAlloc(MatrixA, (long)110000 * 5000 * Sizeof.FLOAT);
// ^ cast to long here
or alternatively
cuMemAlloc(MatrixA, 110000L * 5000 * Sizeof.FLOAT);
// ^ use the "long" literal suffix here
it should work.
The last argument to cuMemAlloc is of type size_t. This is an implementation-specific unsigned integer type for "arbitrary" sizes. The closest possible primitive type in Java for this is long. And in general, every size_t in CUDA is mapped to long in JCuda. In this case, the Java long is passed as a jlong into the JNI layer, and this is simply cast to size_t for the actual native call.
(The lack of unsigned types in Java and the odd plethora of integer types in C can still cause problems. Sometimes, the C types and the Java types just don't match. But as long as the allocation is not larger than 9 Million Terabytes (!), a long should be fine here...)
But the comment by havogt lead to the right track. What happens here is indeed an integer overflow: The computation of the actual value
110000 * 5000 * Sizeof.FLOAT = 2200000000
is by default done using the int type in Java, and this is where the overflow happens: 2200000000 is larger than Integer.MAX_VALUE. The result will be a negative value. When this is cast to the (unsigned) size_t value in the JNI layer, it will become a ridiculosly large positive value, that clearly causes the error.
When doing the computation using long values, either by explicitly casting to long or by appending the L suffix to one of the literals, the value is passed to CUDA as the proper long value of 2200000000.
I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated.
I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here
→ http://files.getwebb.org/view-cre62u4d.html
Main points of the code:
patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
several multi-dimensional arrays are allocated
the algorithm is ran over the patterns, using the arrays allocated just before.
If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.
I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there
→ http://files.getwebb.org/view-cre62u4c.html
To be precise, here are the differences with the original host-only code:
f() and fder(), which are used by the algorithm, become device
functions.
parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of
0.00001
the “w” array is initialized using a fixed value (0.5), not rand()
anymore
a Data structure is allocated in device’s memory, and the data are
sent in device’s memory after they have been loaded from adult.data
in host’s memory
I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:
Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);
It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.
I see two possibilities:
the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
the CUDA API is not behaving like it should!
I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.
Thanks.
Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.
In your backpropagation kernel, in the forward path, you have a sequence of code like this:
/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) { // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i]; // a[0][i] = input i
}
In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.
By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:
cuda-memcheck test_bp
assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.
So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:
a[0] = (double *)malloc( N[0] * sizeof(double *) );
^ oops!!
a[0][] is intended to hold double data but you're allocating pointer storage.
As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.
Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:
a[layer] = (double *)malloc( N[layer] * sizeof(double) );
which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.
You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.
I am very new to cuda and started reading about parallel programming and cuda just a few weeks ago. After I installed the cuda toolkit, I was browsing the sdk samples (which come with the installation of the toolkit) and wanted to try some of them out. I started with matrixMul from 0_Simple folder. This program executes fine (I am using Visual Studio 2010).
Now I want to change the size of the matrices and try with a bigger one (for example 960X960 or 1024x1024). In this case, something crashes (I get black screen, and then the message: display driver stopped responding and has recovered).
I am changing this two lines in the code (from main function):
dim3 dimsA(8*4*block_size, 8*4*block_size, 1);
dim3 dimsB(8*4*block_size, 8*4*block_size, 1);
before they were:
dim3 dimsA(5*2*block_size, 5*2*block_size, 1);
dim3 dimsB(5*2*block_size, 5*2*block_size, 1);
Can someone point to me what I am doing wrong. and should I alter something else in this example for it to work properly. Thx!
Edit: like some of you suggested, i changed the timeout value (0 somehow did not work for me, I set the timeout to 60), so my driver does not crash, but I get huge list of errors, like:
... ... ...
Error! Matrix[409598]=6.40005159, ref=6.39999986 error term is > 1e-5
Error! Matrix[409599]=6.40005159, ref=6.39999986 error term is > 1e-5
Does this got something to do with the allocation of the memory. Should I make changes there and what could they be?
Your new problem is actually just the strict tolerances provided in the NVidia example. Your kernel is running correctly. It's just complaining that accumluated error is greater than the limit that they had set for this example. This is just because you're doing a lot more math operations which are all accumulating error. If you look at the numbers it's giving you, you're only off of the reference answer by about 0.00005, which is not unusual after a lot of single-precision floating-point math. The reason you're getting these errors now and not with the default matrix sizes is that the original matricies were smaller and thus required a lot less operations to multiply. Matrix multiplication of N x N matricies requires on the order of N^3 operations, so the number of operations required increases much faster than the size of the matrix and the accumulated error would increase in proportion with the number of operations.
If you look near the end of the runTest() function, there's a call to computeGold() which computes the reference answer on your CPU. There should then be a call to something like shrCompareL2fe that compares the results. The last parameter to this is a tolerance. If you increase the size of this tolerance (say, to 1e-3 or 1e-4 instead of 1e-5,) you should eliminate these error messages. Note that there may be a couple of these calls. The version of the SDK examples that I have has an optional CUBLAS implementation, so it has a comparison for that against the gold, too. The one right after the print statement that says "Comparing CUDA matrixMul & Host results" is the one you'd want to change.
I'd advise looking at the indexing used in the kernel (matrixMulCUDA) a bit closer - it sounds like you're writing to unallocated memory.
More specifically, is the only thing that you changed the dimsA and dimsB variables? Inside the kernel they use the thread and block index to access the data - did you also increase the data size accordingly? There is no bounds checking going on in the kernel, so if you just change the kernel launch configuration, but not the data, then odds are you're writing past your data into some other memory
Have you disabled Timeout Detection and Recovery (TDR) in Windows? It is entirely possible that your code is running fine but that the larger matricies caused the kernel execution to exceed Windows' timeout, which causes Windows to assume the card is locked up, so it resets the card and gives you a message identical to the one you describe. Even if that is not your problem here, you definitely want to disable that before doing any serious CUDA work in Windows. The timeout is quite short by default, since normal graphics rendering should take small fractions of a second per frame.
See this post on the NVidia forums that describes TDR and how to turn it off:
WDDM TDR - NVidia devtalk forum
In particular, you probably want to set the key HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrLevel to 0 (Detection Disabled).
Alternatively, you can increase the timeout period by setting
HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrDelay. It defaults to 2 and is specified in seconds. Personally, I have found that TDR is always annoying when doing work in CUDA, so I just turn it off entirely. IIRC, you need to restart your system for any TDR-related changes to take effect.
in the reduction.pdf ,it introduces the reduction method through 7 steps ,there are 16777216 elements,in the 1th step,the effective bandwidth is 2.083 GB/S,how 2.083GB/S come out? and how the 2th step bandwidth 4.854GB/s come out?
The bandwidth figures are calculated using the number of bytes in the reduction input data divided by the execution time (note there are 2^22 integers = 16777216 bytes). The calculation is clearly shown on page 10 of the pdf that ships in the SDK in reduction/doc.