I am faced with cuda invalid resource handle error when allocating buffer on gpu.
1, I download the code from git clone https://github.com/Funatiq/gossip.git.
2, I built this project in the gossip directory: git submodule update --init && make. Then I got the compile binary excute here.
3, then I generate a scatter and gather plan for my main GPU, here, it is 0.
$python3 scripts/plan_from_topology_asynch.py gather 0
$python3 scripts/plan_from_topology_asynch.py scatter 0
then it generates scatter_plan.json and gather_plan.json.
4, finally, I execute the plan:
./execute scatter_gather scatter_plan.json gather_plan.json
The error was pointing to these lines of code:
std::vector<size_t> bufs_lens_scatter = scatter.calcBufferLengths(table[main_gpu]);
print_buffer_sizes(bufs_lens_scatter);
std::vector<data_t *> bufs(num_gpus);
std::vector<size_t> bufs_lens(bufs_lens_scatter);
TIMERSTART(malloc_buffers)
for (gpu_id_t gpu = 0; gpu < num_gpus; ++gpu) {
cudaSetDevice(context.get_device_id(gpu)); CUERR
cudaMalloc(&bufs[gpu], sizeof(data_t)*bufs_lens[gpu]); CUERR
}
TIMERSTOP(malloc_buffers)
The detailed error is shown as:
RUN: scatter_gather
INFO: 32768 bytes (scatter_gather)
TIMING: 0.463872 ms (malloc_devices)
TIMING: 0.232448 ms (zero_gpu_buffers)
TIMING: 0.082944 ms (init_data)
TIMING: 0.637952 ms (multisplit)
Partition Table:
470 489 534 553 514 515 538 483
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Required buffer sizes:
0 538 717 604 0 344 0 687
TIMING: 3.94455e-31 ms (malloc_buffers)
CUDA error: invalid resource handle : executor.cuh, line 405
For reference, I attached the complete error report here. The curious part is that the author cannot reproduce these error on his server. But when I ran it on DGX workstation with 8 GPUs. This error occurs. I doubt if it is cuda programming error or environment specific issues.
The code has a defect in it, in the handling of cudaEventRecord() as used in the TIMERSTART and TIMERSTOP macros defined here and used here (with the malloc_buffers label).
CUDA events have a device association, impliclitly defined, when they are created. That means they are associated with the device selected by the most recent cudaSetDevice() call. As stated in the programming guide:
cudaEventRecord() will fail if the input event and input stream are associated to different devices.
(note that each device has its own null stream - these events are being recorded into the null stream)
And if we run the code with cuda-memcheck, we observe that the invalid resource handle error is indeed being returned by a call to cudaEventRecord().
Specifically referring to the code here:
...
std::vector<size_t> bufs_lens(bufs_lens_scatter);
TIMERSTART(malloc_buffers)
for (gpu_id_t gpu = 0; gpu < num_gpus; ++gpu) {
cudaSetDevice(context.get_device_id(gpu)); CUERR
cudaMalloc(&bufs[gpu], sizeof(data_t)*bufs_lens[gpu]); CUERR
}
TIMERSTOP(malloc_buffers)
The TIMERSTART macro defines and creates 2 cuda events, one of which it immediately records (the start event). The TIMERSTOP macro uses the timer stop event that was created in the TIMERSTART macro. However, we can see that the intervening code has likely changed the device from the one that was in effect when these two events were created (due to the cudaSetDevice call in the for-loop). Therefore, the cudaEventRecord (and cudaEventElapsedTime) calls are failing due to this invalid usage.
As a proof point, when I add cudaSetDevice calls to the macro definitions as follows:
#define TIMERSTART(label) \
cudaEvent_t timerstart##label, timerstop##label; \
float timerdelta##label; \
cudaSetDevice(0); \
cudaEventCreate(&timerstart##label); \
cudaEventCreate(&timerstop##label); \
cudaEventRecord(timerstart##label, 0);
#endif
#ifndef __CUDACC__
#define TIMERSTOP(label) \
stop##label = std::chrono::system_clock::now(); \
std::chrono::duration<double> \
timerdelta##label = timerstop##label-timerstart##label; \
std::cout << "# elapsed time ("<< #label <<"): " \
<< timerdelta##label.count() << "s" << std::endl;
#else
#define TIMERSTOP(label) \
cudaSetDevice(0); \
cudaEventRecord(timerstop##label, 0); \
cudaEventSynchronize(timerstop##label); \
cudaEventElapsedTime( \
&timerdelta##label, \
timerstart##label, \
timerstop##label); \
std::cout << \
"TIMING: " << \
timerdelta##label << " ms (" << \
#label << \
")" << std::endl;
#endif
The code runs without error for me. I'm not suggesting this is the correct fix. The correct fix may be to properly set the device before calling the macro. It seems evident that either the macro writer did not expect this kind of usage, or else was unaware of the hazard.
The only situation I could imagine where the error would not occur would be in a single-device system. When the code maintainer responded to your issue that they could not reproduce the issue, my guess is they have not tested the code on a multi-device system. As near as I can tell, the error would be unavoidable in a multi-device setup.
Related
I added -fPCI to the QEMU source file compilation option and added -shared to the final link command, so that QEMU has become a shared library that can be dynamically loaded. I started trying to understand QEMU from then on. I use dlopen to dynamically load qemu and use dlsym to search for functions in qemu. This is my code:
#include<iostream>
#include<dlfcn.h>
#include<stdint.h>
using namespace std;
int main(int argc,char* argv[],char* envp[])
{
void* handle = dlopen("/home/jack/qemu/qemu-5.0.0/arm-softmmu/libqemu-system-arm.so",RTLD_NOW);
if(handle == nullptr)
{
printf("%s\n",dlerror());
return 0;
}
void (* qemu_init )(int,char**,char**);
void (* qemu_main_loop )(void);
void (* qemu_cleanup )(void);
bool (* main_loop_should_exit )(void);
void (* main_loop_wait )(int);
int64_t (* cpu_get_icount )(void);
int64_t (* cpu_get_icount_raw )(void);
int64_t (* cpu_icount_to_ns )(int64_t);
int64_t (* cpu_get_clock )(void);
int64_t (* cpu_get_ticks )(void);
#define GET_SYMBOL_AND_CHECK(X) *((void**)(&X)) = dlsym(handle,#X);if(nullptr == X){printf("lost symbol: "#X"\n");return 0;}
GET_SYMBOL_AND_CHECK(qemu_init);
GET_SYMBOL_AND_CHECK(qemu_main_loop);
GET_SYMBOL_AND_CHECK(qemu_cleanup);
GET_SYMBOL_AND_CHECK(main_loop_should_exit);
GET_SYMBOL_AND_CHECK(main_loop_wait);
GET_SYMBOL_AND_CHECK(cpu_get_icount);
GET_SYMBOL_AND_CHECK(cpu_get_icount_raw);
GET_SYMBOL_AND_CHECK(cpu_icount_to_ns);
GET_SYMBOL_AND_CHECK(cpu_get_clock);
GET_SYMBOL_AND_CHECK(cpu_get_ticks);
#undef GET_SYMBOL_AND_CHECK
char* _argv[]=
{
"qemu-system-arm",
"-M",
"vexpress-a9",
"-nographic",
"-kernel",
"/home/jack/temp/u-boot-2015.01/u-boot",
"-icount",
"1",
"-singlestep",
"-S",
"-s",
};
int _argc=sizeof(_argv) / sizeof(_argv[0]);
qemu_init(_argc,_argv,envp);
while(!main_loop_should_exit())
{
main_loop_wait(false);
//my test code:
int64_t icount_raw = cpu_get_icount_raw();
int64_t icount = cpu_get_icount();
int64_t ticks = cpu_get_ticks();
int64_t clock = cpu_get_clock();
printf("----------icount_raw: %jd\n",icount_raw);
printf("----------icount: %jd\n",icount);
printf("----------ticks: %jd\n",ticks);
printf("----------clock: %jd\n",clock);
}
qemu_cleanup();
dlclose(handle);
return 0;
}
The output of the program is as follows:
----------icount_raw: 0
----------icount: 0
----------ticks: 0
----------clock: 27595
----------icount_raw: 0
----------icount: 0
----------ticks: 0
----------clock: 47394
U-Boot 2015.01 (May 25 2020 - 14:42:11)
DRAM: 128 MiB
WARNING: Caches not enabled
Flash: 128 MiB
MMC: MMC: 0
*** Warning - bad CRC, using default environment
In: serial
Out: serial
Err: serial
Net: smc911x-0
Warning: smc911x-0 using MAC address from net device
Warning: Your board does not use generic board. Please read
doc/README.generic-board and take action. Boards not
upgraded by the late 2014 may break or be removed.
Hit any key to stop autoboot: 2 ----------icount_raw: 60040125
----------icount: 120271139
----------ticks: 120271139
----------clock: 1001128004
----------icount_raw: 119738560
----------icount: 239668009
----------ticks: 239668009
----------clock: 2002239949
----------icount_raw: 180295711
----------icount: 360782311
----------ticks: 360782311
----------clock: 3003347066
----------icount_raw: 240405702
----------icount: 481002293
----------ticks: 481002293
----------clock: 4004446427
----------icount_raw: 300858002
----------icount: 601906893
----------ticks: 601906893
----------clock: 5005552419
----------icount_raw: 361297422
----------icount: 722785733
----------ticks: 722785733
----------clock: 6006625721
----------icount_raw: 420679210
----------icount: 841549309
----------ticks: 841549309
----------clock: 7007717838
----------icount_raw: 424900860
----------icount: 849992609
----------ticks: 849992609
----------clock: 7082080834
----------icount_raw: 424900883
----------icount: 849992655
----------ticks: 849992655
----------clock: 7082105752
----------icount_raw: 424900906
----------icount: 849992701
----------ticks: 849992701
----------clock: 7082120318
QEMU: Terminated
I modeled QEMU's main loop and wrote this while loop. I print the acquired data every time in the loop, and I find that icount_raw can indicate the number of instructions currently executed by the CPU. Regarding other data, I am still confused. When this program is running, the uboot program can run normally. I find that the frequency of printing data on the screen is about once every second. Each time icount_raw increases a lot. When I use gdb to remotely control the program to run When using the "si" command, icount_raw is incremented by 1 each time, which is what I want to achieve: each time QEMU executes only one instruction, it can return to the main loop. I want to know how to modify the code of QEMU so that every time QEMU executes an instruction, it can return to the main loop instead of using gdb's "si" command. In the future, I want to know how to control QEMU to return to the main loop after each execution of N instructions. This N can be set freely by me.I understand that QEMU's event loop is based on Glib, and I think my problem may require modifying the code that calls Glib in QEMU.
Trying to take the internals of QEMU and put them into a DLL is something that's entirely unsupported, and so you're on your own for trying to figure out how to fix bugs in it, I'm afraid.
In general the behaviour you describe is expected: QEMU compiles guest code into "translation blocks" which correspond to multiple guest instructions, and it then also tries to directly create jumps between these translation blocks where it can. This is important for performance: we don't want to return to the top level execution loop more often than we absolutely have to.
Some of these optimisations are controllable: in upstream QEMU, the -singlestep command line option means "put only one guest instruction in each TB", and the "-d nochain" option means "don't do the optimisation that links TBs together". Mostly it's useful for debugging purposes to do this kind of thing: the behaviour is more easily comprehensible and debug logs easier to read. The downside is that the performance goes through the floor.
I found what Thrust can provide is quite limited, as below code shows:
I end up to have 9*9*2 (1 multiple + 1 reduce) Thrust calls, which is 162 kernel launches.
While if I write my own kernel, only 1 kernel launch needed.
for(i=1;i<=9;i++)
{
for(j=i;j<=9;j++)
{
ATA[i][j]=0;
for(m=1;m<=50000;m++)
ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
}
}
Then I end up with below Thrust implementation:
for(i=1;i<=dim0;i++)
{
for(j=i;j<=dim0;j++)
{
thrust::transform(t_d_X+(idx0[i]-1)*(1+iNumPaths)+1, t_d_X+(idx0[i]-1)*(1+iNumPaths)+iNumPaths+1, t_d_X+(idx0[j]-1)*(1+iNumPaths)+1,t_d_cdataMulti, thrust::multiplies<double>());
ATA[i][j] = thrust::reduce(t_d_cdataMulti, t_d_cdataMulti+iNumPaths, (double) 0, thrust::plus<double>());
}
}
Some analysis:
transform_reduce: will NOT help, as there is a pointer redirect idx0[i], and basically there are 2 arrays involved. 1st one is X[idx0[i]], 2nd one is X[idx0[j]]
reduce_by_key: will help. But I need to store all interim results into one big array, and prepare a huge mapping key table with same size. Will try it out.
transform_iterator: will NOT help, same reason as 1.
Think I can't avoid writing my own kernel?
I'll bet #m.s. can provide a more efficient approach. But here is one possible approach. In order to get the entire computation reduced to a single kernel call by thrust, it is necessary to handle everything with a single thrust algorithm call. At the heart of the operation, we are summing many computations together, to fill a matrix. Therefore I believe thrust::reduce_by_key is an appropriate thrust algorithm to use. This means we must realize all other transformations using various thrust "fancy iterators", which are mostly covered in the thrust getting started guide.
Attempting to do this (handle everything with a single kernel call) makes the code very dense and hard to read. I don't normally like to demonstrate thrust this way, but since it is the crux of your question, it cannot be avoided. Therefore let's unpack the sequence of operations contained in the call to reduce_by_key, approximately from the inward out. The general basis of this algorithm is to "flatten" all data into a single long logical vector. Let's assume for understanding that our square matrix dimensions are only 2x2 and the length of our m vector is 3. You can think of the "flattening" or linear-index conversion like this:
linear index: 0 1 2 3 4 5 6 7 8 9 10 11
i index: 0 0 0 0 0 0 1 1 1 1 1 1
j index: 0 0 0 1 1 1 0 0 0 1 1 1
m index: 0 1 2 0 1 2 0 1 2 0 1 2
k index: 0 0 0 1 1 1 2 2 2 3 3 3
The "k index" above is our keys that will ultimately be used by reduce_by_key to collect product terms together, for each element of the matrix. Note that the code has EXT_I, EXT_J, EXT_M, and EXT_K helper macros which will define, using thrust placeholders, the operation to be performed on the linear index (created using a counting_iterator) to produce the various other "indices".
The first thing we will need to do is construct a suitable thrust operation to convert the linear index into the transformed value of idx0[i] (again, working from "inward to outward"). We can do this with a permutation iterator on idx0 vector, with a transform_iterator supplying the "map" for the permutation iterator - this transform iterator just converts the linear index (mb) to an "i" index:
thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I))
Now we need to combine the result from step 1 with the other index - m in this case, to generate a linearized version of the 2D index into X (d_X is the vector-linearized version of X). To do this, we will combine the result of step one in a zip_iterator with another transform iterator that creates the m index. This zip_iterator will be passed to a transform_iterator which takes the two indices and converts it into a linearized index to "look into" the d_X vector:
thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx()))
create_Xidx is the functor that takes the two computed indices and converts it into the linear index into d_X
With the result from step 2, we can then use a permutation iterator to grab the appropriate value from d_X for the first term in the multiplication:
thrust::make_permutation_iterator(d_X.begin(), {code from step 2})
repeat steps 1,2,3, using EXT_J instead of EXT_I, to create the second term in the multiplication:
X[idx0[i]][m]*X[idx0[j]][m]
Place the terms created in step 3 and 4 into a zip_iterator, for use by the transform_iterator that will multiply the two together (using my_mult functor) to create the actual product:
thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple({result from step 3}, {result from step 4}, my_mult())
The remainder of the reduce_by_key is fairly straightforward. We create the keys index as described previously, and then use it to sum together the various products for each element of the square matrix.
Here is a fully worked example:
$ cat t875.cu
#include <iostream>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
// rows
#define D1 9
// cols
#define D2 9
// size of m
#define D3 50
// helpers to convert linear indices to i,j,m or "key" indices
#define EXT_I (_1/(D2*D3))
#define EXT_J ((_1/(D3))%D2)
#define EXT_M (_1%D3)
#define EXT_K (_1/D3)
void test_cpu(float ATA[][D2], float X[][D3], int idx0[]){
for(int i=0;i<D1;i++)
{
for(int j=0;j<D2;j++)
{
ATA[i][j]=0;
for(int m=0;m<D3;m++)
ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
}
}
}
using namespace thrust::placeholders;
struct create_Xidx : public thrust::unary_function<thrust::tuple<int, int>, int>{
__host__ __device__
int operator()(thrust::tuple<int, int> &my_tuple){
return (thrust::get<0>(my_tuple) * D3) + thrust::get<1>(my_tuple);
}
};
struct my_mult : public thrust::unary_function<thrust::tuple<float, float>, float>{
__host__ __device__
float operator()(thrust::tuple<float, float> &my_tuple){
return thrust::get<0>(my_tuple) * thrust::get<1>(my_tuple);
}
};
int main(){
//synthesize data
float ATA[D1][D2];
float X[D1][D3];
int idx0[D1];
thrust::host_vector<float> h_X(D1*D3);
thrust::host_vector<int> h_idx0(D1);
for (int i = 0; i < D1; i++){
idx0[i] = (i + 2)%D1; h_idx0[i] = idx0[i];
for (int j = 0; j < D2; j++) {ATA[i][j] = 0;}
for (int j = 0; j < D3; j++) {X[i][j] = j%(i+1); h_X[i*D3+j] = X[i][j];}}
thrust::device_vector<float> d_ATA(D1*D2);
thrust::device_vector<float> d_X = h_X;
thrust::device_vector<int> d_idx0 = h_idx0;
// helpers
thrust::counting_iterator<int> mb = thrust::make_counting_iterator(0);
thrust::counting_iterator<int> me = thrust::make_counting_iterator(D1*D2*D3);
// perform computation
thrust::reduce_by_key(thrust::make_transform_iterator(mb, EXT_K), thrust::make_transform_iterator(me, EXT_K), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_X.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx())), thrust::make_permutation_iterator(d_X.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_J)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx())))), my_mult()), thrust::make_discard_iterator(), d_ATA.begin());
thrust::host_vector<float> h_ATA = d_ATA;
test_cpu(ATA, X, idx0);
std::cout << "GPU: CPU: " << std::endl;
for (int i = 0; i < D1*D2; i++)
std::cout << i/D1 << "," << i%D2 << ":" << h_ATA[i] << " " << ATA[i/D1][i%D2] << std::endl;
}
$ nvcc -o t875 t875.cu
$ ./t875
GPU: CPU:
0,0:81 81
0,1:73 73
0,2:99 99
0,3:153 153
0,4:145 145
0,5:169 169
0,6:219 219
0,7:0 0
0,8:25 25
1,0:73 73
1,1:169 169
1,2:146 146
1,3:193 193
1,4:212 212
1,5:313 313
1,6:280 280
1,7:0 0
1,8:49 49
2,0:99 99
2,1:146 146
2,2:300 300
2,3:234 234
2,4:289 289
2,5:334 334
2,6:390 390
2,7:0 0
2,8:50 50
3,0:153 153
3,1:193 193
3,2:234 234
3,3:441 441
3,4:370 370
3,5:433 433
3,6:480 480
3,7:0 0
3,8:73 73
4,0:145 145
4,1:212 212
4,2:289 289
4,3:370 370
4,4:637 637
4,5:476 476
4,6:547 547
4,7:0 0
4,8:72 72
5,0:169 169
5,1:313 313
5,2:334 334
5,3:433 433
5,4:476 476
5,5:841 841
5,6:604 604
5,7:0 0
5,8:97 97
6,0:219 219
6,1:280 280
6,2:390 390
6,3:480 480
6,4:547 547
6,5:604 604
6,6:1050 1050
6,7:0 0
6,8:94 94
7,0:0 0
7,1:0 0
7,2:0 0
7,3:0 0
7,4:0 0
7,5:0 0
7,6:0 0
7,7:0 0
7,8:0 0
8,0:25 25
8,1:49 49
8,2:50 50
8,3:73 73
8,4:72 72
8,5:97 97
8,6:94 94
8,7:0 0
8,8:25 25
$
Notes:
If you profile the above code with e.g. nvprof --print-gpu-trace ./t875, you will witness two kernel calls. The first is associated with the device_vector creation. The second kernel call handles the entire reduce_by_key operation.
I don't know if all this is slower or faster than your CUDA kernel, since you haven't provided it. Sometimes, expertly written CUDA kernels can be faster than thrust algorithms doing the same operation.
It's quite possible that what I have here is not precisely the algorithm you had in mind. For example, your code suggests you're only filling in a triangular portion of ATA. But your description (9*9*2) suggests you want to populate every position in ATA. Nevertheless, my intent is not to give you a black box but to demonstrate how you can use various thrust approaches to achieve whatever it is you want in a single kernel call.
My running config:
- CUDA Toolkit 5.5
- NVidia Nsight Eclipse edition
- Ubuntu 12.04 x64
- CUDA device is NVidia GeForce GTX 560: cc=20, sm=21 (as you can see I can use blocks up to 1024 threads)
I render my display on iGPU (Intel HD Graphics), so I can use Nsight debugger.
However I encountered some weird behaviour, when I set threads > 960.
Code:
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void mytest() {
float a, b;
b = 1.0F;
a = b / 1.0F;
}
int main(void) {
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
// Here I run my kernel
mytest<<<1, 961>>>();
err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "error=%s\n", cudaGetErrorString(err));
exit (EXIT_FAILURE);
}
// Reset the device and exit
err = cudaDeviceReset();
if (err != cudaSuccess) {
fprintf(stderr, "Failed to deinitialize the device! error=%s\n",
cudaGetErrorString(err));
exit (EXIT_FAILURE);
}
printf("Done\n");
return 0;
}
And... it doesn't work. The problem is in the last line of code with float division. Every time I try to divide by float, my code compiles, but doesn't work. The output error at runtime is:
error=too many resources requested for launch
Here's what I get in debug, when I step it over:
warning: Cuda API error detected: cudaLaunch returned (0x7)
Build output using -Xptxas -v:
12:57:39 **** Incremental Build of configuration Debug for project block_size_test ****
make all
Building file: ../src/vectorAdd.cu
Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -G -g -O0 -m64 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -odir "src" -M -o "src/vectorAdd.d" "../src/vectorAdd.cu"
/usr/local/cuda-5.5/bin/nvcc --compile -G -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_21 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -m64 -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -x cu -o "src/vectorAdd.o" "../src/vectorAdd.cu"
../src/vectorAdd.cu(7): warning: variable "a" was set but never used
../src/vectorAdd.cu(7): warning: variable "a" was set but never used
ptxas info : 4 bytes gmem, 8 bytes cmem[14]
ptxas info : Function properties for _ZN4dim3C1Ejjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function '_Z6mytestv' for 'sm_21'
ptxas info : Function properties for _Z6mytestv
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 8 bytes cumulative stack size, 32 bytes cmem[0]
ptxas info : Function properties for _ZN4dim3C2Ejjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Finished building: ../src/vectorAdd.cu
Building target: block_size_test
Invoking: NVCC Linker
/usr/local/cuda-5.5/bin/nvcc --cudart static -m64 -link -o "block_size_test" ./src/vectorAdd.o
Finished building target: block_size_test
12:57:41 Build Finished (took 1s.659ms)
When I add -keep key, the compiler generates .cubin file, but I can't read it to find out the values of smem and reg, following this topic too-many-resources-requested-for-launch-how-to-find-out-what-resources-/. At least nowadays this file must have some different format.
Therefore I'm forced to use 256 threads per block, which is probably not a bad idea, considering this .xls: CUDA_Occupancy_calculator.
Anyway. Any help will be appreciated.
I filled the CUDA Occupancy calculator file with the current informations :
Compute capability : 2.1
Threads per block : 961
Registers per thread : 34
Shared memory : 0
I got 0% occupancy, limited by registers count.
If you set the number of thread to 960, you have 63% occupancy, which explains why it works.
Try to limit the count of registers to 32 and set the numbers of threads to 1024 to have 67% occupancy.
To limit the count of registers, use the following option :
nvcc [...] --maxrregcount=32
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
I'm trying to figure out why cudaMemcpyToSymbol is not working for me. (But cudaMemcpy does.)
// symbols:
__constant__ float flt[480]; // 1920 bytes
__constant__ int ints[160]; // 640 bytes
// func code follows:
float* pFlts;
cudaMalloc((void**)&pFlts, 1920+640); // chunk of gpu mem (floats & ints)
// This does NOT work properly:
cudaMemcpyToSymbol(flt,pFlts,1920,0,cudaMemcpyDeviceToDevice); // first copy
cudaMemcpyToSymbol(ints,pFlts,640,1920,cudaMemcpyDeviceToDevice); // second copy
The second copy is trashing the contents of the first copy (flt), and the second copy does not happen. (If I remove the second copy, the first copy works fine.)
Results:
GpuDumpFloatMemory<<<1,1>>>(0x500500000, 13, 320) TotThrds=1 ** Source of 1st copy
0x500500500: float[320]= 1.000
0x500500504: float[321]= 0.866
0x500500508: float[322]= 0.500
0x50050050c: float[323]= -0.000
0x500500510: float[324]= -0.500
0x500500514: float[325]= -0.866
0x500500518: float[326]= -1.000
0x50050051c: float[327]= -0.866
0x500500520: float[328]= -0.500
0x500500524: float[329]= 0.000
0x500500528: float[330]= 0.500
0x50050052c: float[331]= 0.866
0x500500530: float[332]= 1.000
GpuDumpFloatMemory<<<1,1>>>(0x500100a98, 13, 320) TotThrds=1 ** Dest of 1st copy
0x500100f98: float[320]= 0.000
0x500100f9c: float[321]= 0.500
0x500100fa0: float[322]= 0.866
0x500100fa4: float[323]= 1.000
0x500100fa8: float[324]= 0.866
0x500100fac: float[325]= 0.500
0x500100fb0: float[326]= -0.000
0x500100fb4: float[327]= -0.500
0x500100fb8: float[328]= -0.866
0x500100fbc: float[329]= -1.000
0x500100fc0: float[330]= -0.866
0x500100fc4: float[331]= -0.500
0x500100fc8: float[332]= 0.000
GpuDumpIntMemory<<<1,1>>>(0x500500780, 13, 0) TotThrds=1 ** Source of 2nd copy
0x500500780: int[0]= 1
0x500500784: int[1]= 1
0x500500788: int[2]= 1
0x50050078c: int[3]= 1
0x500500790: int[4]= 1
0x500500794: int[5]= 1
0x500500798: int[6]= 1
0x50050079c: int[7]= 1
0x5005007a0: int[8]= 1
0x5005007a4: int[9]= 1
0x5005007a8: int[10]= 1
0x5005007ac: int[11]= 1
0x5005007b0: int[12]= 0
GpuDumpIntMemory<<<1,1>>>(0x500100818, 13, 0) TotThrds=1 ** Dest of 2nd copy
0x500100818: int[0]= 0
0x50010081c: int[1]= 0
0x500100820: int[2]= 0
0x500100824: int[3]= 0
0x500100828: int[4]= 0
0x50010082c: int[5]= 0
0x500100830: int[6]= 0
0x500100834: int[7]= 0
0x500100838: int[8]= 0
0x50010083c: int[9]= 0
0x500100840: int[10]= 0
0x500100844: int[11]= 0
0x500100848: int[12]= 0
The following works properly:
cudaMemcpyToSymbol(flt,pFlts,1920,0,cudaMemcpyDeviceToDevice); // first copy
int* pTemp;
cudaGetSymbolAddress((void**) &pTemp, ints);
cudaMemcpy(ints,pFlts+480,640,cudaMemcpyDeviceToDevice); // second copy
Results:
GpuDumpFloatMemory<<<1,1>>>(0x500500000, 13, 320) TotThrds=1 ** Source of first copy
0x500500500: float[320]= 1.000
0x500500504: float[321]= 0.866
0x500500508: float[322]= 0.500
0x50050050c: float[323]= -0.000
0x500500510: float[324]= -0.500
0x500500514: float[325]= -0.866
0x500500518: float[326]= -1.000
0x50050051c: float[327]= -0.866
0x500500520: float[328]= -0.500
0x500500524: float[329]= 0.000
0x500500528: float[330]= 0.500
0x50050052c: float[331]= 0.866
0x500500530: float[332]= 1.000
GpuDumpFloatMemory<<<1,1>>>(0x500100a98, 13, 320) TotThrds=1 ** Dest of first copy
0x500100f98: float[320]= 1.000
0x500100f9c: float[321]= 0.866
0x500100fa0: float[322]= 0.500
0x500100fa4: float[323]= -0.000
0x500100fa8: float[324]= -0.500
0x500100fac: float[325]= -0.866
0x500100fb0: float[326]= -1.000
0x500100fb4: float[327]= -0.866
0x500100fb8: float[328]= -0.500
0x500100fbc: float[329]= 0.000
0x500100fc0: float[330]= 0.500
0x500100fc4: float[331]= 0.866
0x500100fc8: float[332]= 1.000
GpuDumpIntMemory<<<1,1>>>(0x500500780, 13, 0) TotThrds=1 ** Source of 2nd copy
0x500500780: int[0]= 1
0x500500784: int[1]= 1
0x500500788: int[2]= 1
0x50050078c: int[3]= 1
0x500500790: int[4]= 1
0x500500794: int[5]= 1
0x500500798: int[6]= 1
0x50050079c: int[7]= 1
0x5005007a0: int[8]= 1
0x5005007a4: int[9]= 1
0x5005007a8: int[10]= 1
0x5005007ac: int[11]= 1
0x5005007b0: int[12]= 0
GpuDumpIntMemory<<<1,1>>>(0x500100818, 13, 0) TotThrds=1 ** Destination of 2nd copy
0x500100818: int[0]= 1
0x50010081c: int[1]= 1
0x500100820: int[2]= 1
0x500100824: int[3]= 1
0x500100828: int[4]= 1
0x50010082c: int[5]= 1
0x500100830: int[6]= 1
0x500100834: int[7]= 1
0x500100838: int[8]= 1
0x50010083c: int[9]= 1
0x500100840: int[10]= 1
0x500100844: int[11]= 1
0x500100848: int[12]= 0
When I look at the bad case, it appears as though something has happened to the symbol table. As in, the data of the first copy destination is very familiar. Not like it has been overwritten, just moved. Like the pointer is wrong.
The second copy looks broken to me. You have defined this array:
__constant__ int ints[160]; // 640 bytes
which as correctly noted is 640 bytes long.
Your second copy is like this:
cudaMemcpyToSymbol(ints,pFlts,640,1920,cudaMemcpyDeviceToDevice); // second copy
Which says, "copy a total of 640 bytes, from pFlts array to ints array, with the storage location in the ints array beginning at 1920 bytes from the start of the array."
This won't work. The ints array is only 640 bytes long. You can't pick as your destination a location that is 1920 bytes into it.
From the documentation for cudaMemcpyToSymbol :
offset- Offset from start of symbol in bytes
In this case the symbol is ints
Probably what you want is:
cudaMemcpyToSymbol(ints,pFlts+480,640,0,cudaMemcpyDeviceToDevice); // second copy
EDIT:
In response to the questions in the comments about error checking, I crafted this simple test program:
#include <stdio.h>
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__constant__ int ints[160];
int main(){
int *d_ints;
cudaError_t mystatus;
cudaMalloc((void **)&d_ints, sizeof(int)*160);
cudaCheckErrors("cudamalloc fail");
mystatus = cudaMemcpyToSymbol(ints, d_ints, 160*sizeof(int), 1920, cudaMemcpyDeviceToDevice);
if (mystatus != cudaSuccess) printf("returned value was not cudaSuccess\n");
cudaCheckErrors("cudamemcpytosymbol fail");
printf("OK!\n");
return 0;
}
When I compile and run this, I get the following output:
returned value was not cudaSuccess
Fatal error: cudamemcpytosymbol fail (invalid argument at t94.cu:26)
*** FAILED - ABORTING
This indicates that both the error return value from the cudaMemcpyToSymbol function call and the cudaGetLastError() method return an error in this case. If I change the 1920 parameter to zero in this test case, the error goes away.
I am curious to understand the divide by zero exception handling in linux. When divide by zero operation is performed, a trap is generated i.e. INT0 is sent to the processor and ultimately SIGFPE signal is sent to the process that performed the operation.
As I see, the divide by zero exception is registered in trap_init() function as
set_trap_gate(0, ÷_error);
I want to know in detail, what all happens in between the INT0 being generated and before the SIGFPE being sent to the process?
Trap handler is registered in the trap_init function from arch/x86/kernel/traps.c
void __init trap_init(void)
..
set_intr_gate(X86_TRAP_DE, ÷_error);
set_intr_gate writes the address of the handler function into idt_table x86/include/asm/desc.h.
How is the divide_error function defined? As a macro in traps.c
DO_ERROR_INFO(X86_TRAP_DE, SIGFPE, "divide error", divide_error, FPE_INTDIV,
regs->ip)
And the macro DO_ERROR_INFO is defined a bit above in the same traps.c:
193 #define DO_ERROR_INFO(trapnr, signr, str, name, sicode, siaddr) \
194 dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
195 { \
196 siginfo_t info; \
197 enum ctx_state prev_state; \
198 \
199 info.si_signo = signr; \
200 info.si_errno = 0; \
201 info.si_code = sicode; \
202 info.si_addr = (void __user *)siaddr; \
203 prev_state = exception_enter(); \
204 if (notify_die(DIE_TRAP, str, regs, error_code, \
205 trapnr, signr) == NOTIFY_STOP) { \
206 exception_exit(prev_state); \
207 return; \
208 } \
209 conditional_sti(regs); \
210 do_trap(trapnr, signr, str, regs, error_code, &info); \
211 exception_exit(prev_state); \
212 }
(Actually it defines the do_divide_error function which is called by the small asm-coded stub "entry point" with prepared arguments. The macro is defined in entry_32.S as ENTRY(divide_error) and entry_64.S as macro zeroentry: 1303 zeroentry divide_error do_divide_error)
So, when a user divides by zero (and this operation reaches the retirement buffer in OoO), hardware generates a trap, sets %eip to divide_error stub, it sets up the frame and calls the C function do_divide_error. The function do_divide_error will create the siginfo_t struct describing the error (signo=SIGFPE, addr= address of failed instruction,etc), then it will try to inform all notifiers, registered with register_die_notifier (actually it is a hook, sometimes used by the in-kernel debugger "kgdb"; kprobe's kprobe_exceptions_notify - only for int3 or gpf; uprobe's arch_uprobe_exception_notify - again only int3, etc).
Because DIE_TRAP is usually not blocked by the notifier, the do_trap function will be called. It has a short code of do_trap:
139 static void __kprobes
140 do_trap(int trapnr, int signr, char *str, struct pt_regs *regs,
141 long error_code, siginfo_t *info)
142 {
143 struct task_struct *tsk = current;
...
157 tsk->thread.error_code = error_code;
158 tsk->thread.trap_nr = trapnr;
170
171 if (info)
172 force_sig_info(signr, info, tsk);
...
175 }
do_trap will send a signal to the current process with force_sig_info, which will "Force a signal that the process can't ignore".. If there is an active debugger for the process (our current process is ptrace-ed by gdb or strace), then send_signal will translate the signal SIGFPE to the current process from do_trap into SIGTRAP to debugger. If no debugger - the signal SIGFPE should kill our process while saving the core file, because that is the default action for SIGFPE (check man 7 signal in the section "Standard signals", search for SIGFPE in the table).
The process can't set SIGFPE to ignore it (I'm not sure here: 1), but it can define its own signal handler to handle the signal (example of handing SIGFPE another). This handler may just print %eip from siginfo, run backtrace() and die; or it even may try to recover the situation and return to the failed instruction. This may be useful for example in some JITs like qemu, java, or valgrind; or in high-level languages like java or ghc, which can turn SIGFPE into a language exception and programs in these languages can handle the exception (for example, spaghetti from openjdk is in hotspot/src/os/linux/vm/os_linux.cpp).
There is a list of SIGFPE handlers in debian via codesearch for siagaction SIGFPE or for signal SIGFPE