Removing runtime errors with the -G flag [duplicate] - cuda

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
I'm trying to figure out why cudaMemcpyToSymbol is not working for me. (But cudaMemcpy does.)
// symbols:
__constant__ float flt[480]; // 1920 bytes
__constant__ int ints[160]; // 640 bytes
// func code follows:
float* pFlts;
cudaMalloc((void**)&pFlts, 1920+640); // chunk of gpu mem (floats & ints)
// This does NOT work properly:
cudaMemcpyToSymbol(flt,pFlts,1920,0,cudaMemcpyDeviceToDevice); // first copy
cudaMemcpyToSymbol(ints,pFlts,640,1920,cudaMemcpyDeviceToDevice); // second copy
The second copy is trashing the contents of the first copy (flt), and the second copy does not happen. (If I remove the second copy, the first copy works fine.)
Results:
GpuDumpFloatMemory<<<1,1>>>(0x500500000, 13, 320) TotThrds=1 ** Source of 1st copy
0x500500500: float[320]= 1.000
0x500500504: float[321]= 0.866
0x500500508: float[322]= 0.500
0x50050050c: float[323]= -0.000
0x500500510: float[324]= -0.500
0x500500514: float[325]= -0.866
0x500500518: float[326]= -1.000
0x50050051c: float[327]= -0.866
0x500500520: float[328]= -0.500
0x500500524: float[329]= 0.000
0x500500528: float[330]= 0.500
0x50050052c: float[331]= 0.866
0x500500530: float[332]= 1.000
GpuDumpFloatMemory<<<1,1>>>(0x500100a98, 13, 320) TotThrds=1 ** Dest of 1st copy
0x500100f98: float[320]= 0.000
0x500100f9c: float[321]= 0.500
0x500100fa0: float[322]= 0.866
0x500100fa4: float[323]= 1.000
0x500100fa8: float[324]= 0.866
0x500100fac: float[325]= 0.500
0x500100fb0: float[326]= -0.000
0x500100fb4: float[327]= -0.500
0x500100fb8: float[328]= -0.866
0x500100fbc: float[329]= -1.000
0x500100fc0: float[330]= -0.866
0x500100fc4: float[331]= -0.500
0x500100fc8: float[332]= 0.000
GpuDumpIntMemory<<<1,1>>>(0x500500780, 13, 0) TotThrds=1 ** Source of 2nd copy
0x500500780: int[0]= 1
0x500500784: int[1]= 1
0x500500788: int[2]= 1
0x50050078c: int[3]= 1
0x500500790: int[4]= 1
0x500500794: int[5]= 1
0x500500798: int[6]= 1
0x50050079c: int[7]= 1
0x5005007a0: int[8]= 1
0x5005007a4: int[9]= 1
0x5005007a8: int[10]= 1
0x5005007ac: int[11]= 1
0x5005007b0: int[12]= 0
GpuDumpIntMemory<<<1,1>>>(0x500100818, 13, 0) TotThrds=1 ** Dest of 2nd copy
0x500100818: int[0]= 0
0x50010081c: int[1]= 0
0x500100820: int[2]= 0
0x500100824: int[3]= 0
0x500100828: int[4]= 0
0x50010082c: int[5]= 0
0x500100830: int[6]= 0
0x500100834: int[7]= 0
0x500100838: int[8]= 0
0x50010083c: int[9]= 0
0x500100840: int[10]= 0
0x500100844: int[11]= 0
0x500100848: int[12]= 0
The following works properly:
cudaMemcpyToSymbol(flt,pFlts,1920,0,cudaMemcpyDeviceToDevice); // first copy
int* pTemp;
cudaGetSymbolAddress((void**) &pTemp, ints);
cudaMemcpy(ints,pFlts+480,640,cudaMemcpyDeviceToDevice); // second copy
Results:
GpuDumpFloatMemory<<<1,1>>>(0x500500000, 13, 320) TotThrds=1 ** Source of first copy
0x500500500: float[320]= 1.000
0x500500504: float[321]= 0.866
0x500500508: float[322]= 0.500
0x50050050c: float[323]= -0.000
0x500500510: float[324]= -0.500
0x500500514: float[325]= -0.866
0x500500518: float[326]= -1.000
0x50050051c: float[327]= -0.866
0x500500520: float[328]= -0.500
0x500500524: float[329]= 0.000
0x500500528: float[330]= 0.500
0x50050052c: float[331]= 0.866
0x500500530: float[332]= 1.000
GpuDumpFloatMemory<<<1,1>>>(0x500100a98, 13, 320) TotThrds=1 ** Dest of first copy
0x500100f98: float[320]= 1.000
0x500100f9c: float[321]= 0.866
0x500100fa0: float[322]= 0.500
0x500100fa4: float[323]= -0.000
0x500100fa8: float[324]= -0.500
0x500100fac: float[325]= -0.866
0x500100fb0: float[326]= -1.000
0x500100fb4: float[327]= -0.866
0x500100fb8: float[328]= -0.500
0x500100fbc: float[329]= 0.000
0x500100fc0: float[330]= 0.500
0x500100fc4: float[331]= 0.866
0x500100fc8: float[332]= 1.000
GpuDumpIntMemory<<<1,1>>>(0x500500780, 13, 0) TotThrds=1 ** Source of 2nd copy
0x500500780: int[0]= 1
0x500500784: int[1]= 1
0x500500788: int[2]= 1
0x50050078c: int[3]= 1
0x500500790: int[4]= 1
0x500500794: int[5]= 1
0x500500798: int[6]= 1
0x50050079c: int[7]= 1
0x5005007a0: int[8]= 1
0x5005007a4: int[9]= 1
0x5005007a8: int[10]= 1
0x5005007ac: int[11]= 1
0x5005007b0: int[12]= 0
GpuDumpIntMemory<<<1,1>>>(0x500100818, 13, 0) TotThrds=1 ** Destination of 2nd copy
0x500100818: int[0]= 1
0x50010081c: int[1]= 1
0x500100820: int[2]= 1
0x500100824: int[3]= 1
0x500100828: int[4]= 1
0x50010082c: int[5]= 1
0x500100830: int[6]= 1
0x500100834: int[7]= 1
0x500100838: int[8]= 1
0x50010083c: int[9]= 1
0x500100840: int[10]= 1
0x500100844: int[11]= 1
0x500100848: int[12]= 0
When I look at the bad case, it appears as though something has happened to the symbol table. As in, the data of the first copy destination is very familiar. Not like it has been overwritten, just moved. Like the pointer is wrong.

The second copy looks broken to me. You have defined this array:
__constant__ int ints[160]; // 640 bytes
which as correctly noted is 640 bytes long.
Your second copy is like this:
cudaMemcpyToSymbol(ints,pFlts,640,1920,cudaMemcpyDeviceToDevice); // second copy
Which says, "copy a total of 640 bytes, from pFlts array to ints array, with the storage location in the ints array beginning at 1920 bytes from the start of the array."
This won't work. The ints array is only 640 bytes long. You can't pick as your destination a location that is 1920 bytes into it.
From the documentation for cudaMemcpyToSymbol :
offset- Offset from start of symbol in bytes
In this case the symbol is ints
Probably what you want is:
cudaMemcpyToSymbol(ints,pFlts+480,640,0,cudaMemcpyDeviceToDevice); // second copy
EDIT:
In response to the questions in the comments about error checking, I crafted this simple test program:
#include <stdio.h>
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__constant__ int ints[160];
int main(){
int *d_ints;
cudaError_t mystatus;
cudaMalloc((void **)&d_ints, sizeof(int)*160);
cudaCheckErrors("cudamalloc fail");
mystatus = cudaMemcpyToSymbol(ints, d_ints, 160*sizeof(int), 1920, cudaMemcpyDeviceToDevice);
if (mystatus != cudaSuccess) printf("returned value was not cudaSuccess\n");
cudaCheckErrors("cudamemcpytosymbol fail");
printf("OK!\n");
return 0;
}
When I compile and run this, I get the following output:
returned value was not cudaSuccess
Fatal error: cudamemcpytosymbol fail (invalid argument at t94.cu:26)
*** FAILED - ABORTING
This indicates that both the error return value from the cudaMemcpyToSymbol function call and the cudaGetLastError() method return an error in this case. If I change the 1920 parameter to zero in this test case, the error goes away.

Related

Why linking and calling a C function with variadic arguments in Rust causes random functions to be called? [duplicate]

This question already has an answer here:
Why does my C strlen() in Rust also count string slice inside print! macro after `s` variable?
(1 answer)
Closed last year.
I was playing with Rust FFI and I was trying to link printf C function, which has variadic arguments, with my Rust executable. After I ran the executable I witnessed some strange behavior.
This is my Rust code:
use cty::{c_char, c_int};
extern "C" {
fn printf(format: *const c_char, ...) -> c_int;
}
fn main() {
unsafe {
printf("One number: %d\n".as_ptr() as *const i8, 69);
printf("Two numbers: %d %d\n".as_ptr() as *const i8, 11, 12);
printf(
"Three numbers: %d %d %d\n".as_ptr() as *const i8,
30,
31,
32,
);
}
}
This is the output after running cargo run:
cargo run
Compiling c-bindings v0.1.0 (/home/costin/Rust/c-bindings)
Finished dev [unoptimized + debuginfo] target(s) in 0.20s
Running `target/debug/c-bindings`
One number: 1
Two numbers: 1 11376
Three numbers: 0 0 0
Two numbers: 11 12
Three numbers: 0 0 56
Three numbers: 30 31 32
I looks like the first printf calls the second and the third with random parameters, then the second printf calls the third one with random parameters because the expected output should have been:
One number: 1
Two numbers: 11 12
Three numbers: 30 31 32
Can anyone explain to me why is this strange behavior happening?
Rust strings are not null-terminated like printf expects. You'll need to either manually include \0 at the end of your format strings or use CString:
printf("One number: %d\n\0".as_ptr() as *const i8, 69);
use std::ffi::CString;
let s = CString::new("One number: %d\n").unwrap();
printf(s.as_ptr(), 69);

CUDA invalid resource handle error when allocating gpu memory buffer

I am faced with cuda invalid resource handle error when allocating buffer on gpu.
1, I download the code from git clone https://github.com/Funatiq/gossip.git.
2, I built this project in the gossip directory: git submodule update --init && make. Then I got the compile binary excute here.
3, then I generate a scatter and gather plan for my main GPU, here, it is 0.
$python3 scripts/plan_from_topology_asynch.py gather 0
$python3 scripts/plan_from_topology_asynch.py scatter 0
then it generates scatter_plan.json and gather_plan.json.
4, finally, I execute the plan:
./execute scatter_gather scatter_plan.json gather_plan.json
The error was pointing to these lines of code:
std::vector<size_t> bufs_lens_scatter = scatter.calcBufferLengths(table[main_gpu]);
print_buffer_sizes(bufs_lens_scatter);
std::vector<data_t *> bufs(num_gpus);
std::vector<size_t> bufs_lens(bufs_lens_scatter);
TIMERSTART(malloc_buffers)
for (gpu_id_t gpu = 0; gpu < num_gpus; ++gpu) {
cudaSetDevice(context.get_device_id(gpu)); CUERR
cudaMalloc(&bufs[gpu], sizeof(data_t)*bufs_lens[gpu]); CUERR
}
TIMERSTOP(malloc_buffers)
The detailed error is shown as:
RUN: scatter_gather
INFO: 32768 bytes (scatter_gather)
TIMING: 0.463872 ms (malloc_devices)
TIMING: 0.232448 ms (zero_gpu_buffers)
TIMING: 0.082944 ms (init_data)
TIMING: 0.637952 ms (multisplit)
Partition Table:
470 489 534 553 514 515 538 483
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Required buffer sizes:
0 538 717 604 0 344 0 687
TIMING: 3.94455e-31 ms (malloc_buffers)
CUDA error: invalid resource handle : executor.cuh, line 405
For reference, I attached the complete error report here. The curious part is that the author cannot reproduce these error on his server. But when I ran it on DGX workstation with 8 GPUs. This error occurs. I doubt if it is cuda programming error or environment specific issues.
The code has a defect in it, in the handling of cudaEventRecord() as used in the TIMERSTART and TIMERSTOP macros defined here and used here (with the malloc_buffers label).
CUDA events have a device association, impliclitly defined, when they are created. That means they are associated with the device selected by the most recent cudaSetDevice() call. As stated in the programming guide:
cudaEventRecord() will fail if the input event and input stream are associated to different devices.
(note that each device has its own null stream - these events are being recorded into the null stream)
And if we run the code with cuda-memcheck, we observe that the invalid resource handle error is indeed being returned by a call to cudaEventRecord().
Specifically referring to the code here:
...
std::vector<size_t> bufs_lens(bufs_lens_scatter);
TIMERSTART(malloc_buffers)
for (gpu_id_t gpu = 0; gpu < num_gpus; ++gpu) {
cudaSetDevice(context.get_device_id(gpu)); CUERR
cudaMalloc(&bufs[gpu], sizeof(data_t)*bufs_lens[gpu]); CUERR
}
TIMERSTOP(malloc_buffers)
The TIMERSTART macro defines and creates 2 cuda events, one of which it immediately records (the start event). The TIMERSTOP macro uses the timer stop event that was created in the TIMERSTART macro. However, we can see that the intervening code has likely changed the device from the one that was in effect when these two events were created (due to the cudaSetDevice call in the for-loop). Therefore, the cudaEventRecord (and cudaEventElapsedTime) calls are failing due to this invalid usage.
As a proof point, when I add cudaSetDevice calls to the macro definitions as follows:
#define TIMERSTART(label) \
cudaEvent_t timerstart##label, timerstop##label; \
float timerdelta##label; \
cudaSetDevice(0); \
cudaEventCreate(&timerstart##label); \
cudaEventCreate(&timerstop##label); \
cudaEventRecord(timerstart##label, 0);
#endif
#ifndef __CUDACC__
#define TIMERSTOP(label) \
stop##label = std::chrono::system_clock::now(); \
std::chrono::duration<double> \
timerdelta##label = timerstop##label-timerstart##label; \
std::cout << "# elapsed time ("<< #label <<"): " \
<< timerdelta##label.count() << "s" << std::endl;
#else
#define TIMERSTOP(label) \
cudaSetDevice(0); \
cudaEventRecord(timerstop##label, 0); \
cudaEventSynchronize(timerstop##label); \
cudaEventElapsedTime( \
&timerdelta##label, \
timerstart##label, \
timerstop##label); \
std::cout << \
"TIMING: " << \
timerdelta##label << " ms (" << \
#label << \
")" << std::endl;
#endif
The code runs without error for me. I'm not suggesting this is the correct fix. The correct fix may be to properly set the device before calling the macro. It seems evident that either the macro writer did not expect this kind of usage, or else was unaware of the hazard.
The only situation I could imagine where the error would not occur would be in a single-device system. When the code maintainer responded to your issue that they could not reproduce the issue, my guess is they have not tested the code on a multi-device system. As near as I can tell, the error would be unavoidable in a multi-device setup.

Why nvprof does not have metrics on floating point division operations?

Using nvprof to measure floating point operations of my sample kernels, it seems that there is no metrics for flop_count_dp_div, and the actual double-precision division operations is measured in terms of add/mul/fma of double-precision and even some fma of single-precision operations.
I am wondering why is the case, and how to deduce the dynamic number of division operations of a kernel from nvprof report if I don't have the source code?
My simple test kernel:
#include <iostream>
__global__ void mul(double a, double* x, double* y) {
y[threadIdx.x] = a * x[threadIdx.x];
}
__global__ void div(double a, double* x, double* y) {
y[threadIdx.x] = a / x[threadIdx.x];
}
int main(int argc, char* argv[]) {
const int kDataLen = 4;
double a = 2.0f;
double host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
double host_y[kDataLen];
// Copy input data to device.
double* device_x;
double* device_y;
cudaMalloc(&device_x, kDataLen * sizeof(double));
cudaMalloc(&device_y, kDataLen * sizeof(double));
cudaMemcpy(device_x, host_x, kDataLen * sizeof(double),
cudaMemcpyHostToDevice);
// Launch the kernel.
mul<<<1, kDataLen>>>(a, device_x, device_y);
div<<<1, kDataLen>>>(a, device_x, device_y);
// Copy output data to host.
cudaDeviceSynchronize();
cudaMemcpy(host_y, device_y, kDataLen * sizeof(double),
cudaMemcpyDeviceToHost);
// Print the results.
for (int i = 0; i < kDataLen; ++i) {
std::cout << "y[" << i << "] = " << host_y[i] << "\n";
}
cudaDeviceReset();
return 0;
}
And nvprof output of the two kernels:
nvprof --metrics flop_count_sp \
--metrics flop_count_sp_add \
--metrics flop_count_sp_mul \
--metrics flop_count_sp_fma \
--metrics flop_count_sp_special \
--metrics flop_count_dp \
--metrics flop_count_dp_add \
--metrics flop_count_dp_mul \
--metrics flop_count_dp_fma \
./a.out
==14380== NVPROF is profiling process 14380, command: ./a.out
==14380== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "mul(double, double*, double*)" (done)
Replaying kernel "div(double, double*, double*)" (done)
y[0] = 24 internal events
y[1] = 1
y[2] = 0.666667
y[3] = 0.5
==14380== Profiling application: ./a.out
==14380== Profiling result:
==14380== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GeForce GTX 1080 Ti (0)"
Kernel: mul(double, double*, double*)
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
1 flop_count_dp Floating Point Operations(Double Precision) 4 4 4
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 4 4 4
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
Kernel: div(double, double*, double*)
1 flop_count_sp Floating Point Operations(Single Precision) 8 8 8
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 4 4 4
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 4 4 4
1 flop_count_dp Floating Point Operations(Double Precision) 44 44 44
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 4 4 4
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 20 20 20
it seems that there is no metrics for flop_count_dp_div, t
Because there are no floating point division instructions in CUDA hardware.
and the actual double-precision division operations is measured in terms of add/mul/fma of double-precision and even some fma of single-precision operations.
Because floating point division is implemented using a Newton Raphson iterative method using multiply-add and multiply operations. Possibly even in mixed precision (thus the single precision operations)
how to deduce the dynamic number of division operations of a kernel from nvprof report if I don't have the source code?
You really can't.

Can Thrust transform_reduce work with 2 arrays?

I found what Thrust can provide is quite limited, as below code shows:
I end up to have 9*9*2 (1 multiple + 1 reduce) Thrust calls, which is 162 kernel launches.
While if I write my own kernel, only 1 kernel launch needed.
for(i=1;i<=9;i++)
{
for(j=i;j<=9;j++)
{
ATA[i][j]=0;
for(m=1;m<=50000;m++)
ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
}
}
Then I end up with below Thrust implementation:
for(i=1;i<=dim0;i++)
{
for(j=i;j<=dim0;j++)
{
thrust::transform(t_d_X+(idx0[i]-1)*(1+iNumPaths)+1, t_d_X+(idx0[i]-1)*(1+iNumPaths)+iNumPaths+1, t_d_X+(idx0[j]-1)*(1+iNumPaths)+1,t_d_cdataMulti, thrust::multiplies<double>());
ATA[i][j] = thrust::reduce(t_d_cdataMulti, t_d_cdataMulti+iNumPaths, (double) 0, thrust::plus<double>());
}
}
Some analysis:
transform_reduce: will NOT help, as there is a pointer redirect idx0[i], and basically there are 2 arrays involved. 1st one is X[idx0[i]], 2nd one is X[idx0[j]]
reduce_by_key: will help. But I need to store all interim results into one big array, and prepare a huge mapping key table with same size. Will try it out.
transform_iterator: will NOT help, same reason as 1.
Think I can't avoid writing my own kernel?
I'll bet #m.s. can provide a more efficient approach. But here is one possible approach. In order to get the entire computation reduced to a single kernel call by thrust, it is necessary to handle everything with a single thrust algorithm call. At the heart of the operation, we are summing many computations together, to fill a matrix. Therefore I believe thrust::reduce_by_key is an appropriate thrust algorithm to use. This means we must realize all other transformations using various thrust "fancy iterators", which are mostly covered in the thrust getting started guide.
Attempting to do this (handle everything with a single kernel call) makes the code very dense and hard to read. I don't normally like to demonstrate thrust this way, but since it is the crux of your question, it cannot be avoided. Therefore let's unpack the sequence of operations contained in the call to reduce_by_key, approximately from the inward out. The general basis of this algorithm is to "flatten" all data into a single long logical vector. Let's assume for understanding that our square matrix dimensions are only 2x2 and the length of our m vector is 3. You can think of the "flattening" or linear-index conversion like this:
linear index: 0 1 2 3 4 5 6 7 8 9 10 11
i index: 0 0 0 0 0 0 1 1 1 1 1 1
j index: 0 0 0 1 1 1 0 0 0 1 1 1
m index: 0 1 2 0 1 2 0 1 2 0 1 2
k index: 0 0 0 1 1 1 2 2 2 3 3 3
The "k index" above is our keys that will ultimately be used by reduce_by_key to collect product terms together, for each element of the matrix. Note that the code has EXT_I, EXT_J, EXT_M, and EXT_K helper macros which will define, using thrust placeholders, the operation to be performed on the linear index (created using a counting_iterator) to produce the various other "indices".
The first thing we will need to do is construct a suitable thrust operation to convert the linear index into the transformed value of idx0[i] (again, working from "inward to outward"). We can do this with a permutation iterator on idx0 vector, with a transform_iterator supplying the "map" for the permutation iterator - this transform iterator just converts the linear index (mb) to an "i" index:
thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I))
Now we need to combine the result from step 1 with the other index - m in this case, to generate a linearized version of the 2D index into X (d_X is the vector-linearized version of X). To do this, we will combine the result of step one in a zip_iterator with another transform iterator that creates the m index. This zip_iterator will be passed to a transform_iterator which takes the two indices and converts it into a linearized index to "look into" the d_X vector:
thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx()))
create_Xidx is the functor that takes the two computed indices and converts it into the linear index into d_X
With the result from step 2, we can then use a permutation iterator to grab the appropriate value from d_X for the first term in the multiplication:
thrust::make_permutation_iterator(d_X.begin(), {code from step 2})
repeat steps 1,2,3, using EXT_J instead of EXT_I, to create the second term in the multiplication:
X[idx0[i]][m]*X[idx0[j]][m]
Place the terms created in step 3 and 4 into a zip_iterator, for use by the transform_iterator that will multiply the two together (using my_mult functor) to create the actual product:
thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple({result from step 3}, {result from step 4}, my_mult())
The remainder of the reduce_by_key is fairly straightforward. We create the keys index as described previously, and then use it to sum together the various products for each element of the square matrix.
Here is a fully worked example:
$ cat t875.cu
#include <iostream>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
// rows
#define D1 9
// cols
#define D2 9
// size of m
#define D3 50
// helpers to convert linear indices to i,j,m or "key" indices
#define EXT_I (_1/(D2*D3))
#define EXT_J ((_1/(D3))%D2)
#define EXT_M (_1%D3)
#define EXT_K (_1/D3)
void test_cpu(float ATA[][D2], float X[][D3], int idx0[]){
for(int i=0;i<D1;i++)
{
for(int j=0;j<D2;j++)
{
ATA[i][j]=0;
for(int m=0;m<D3;m++)
ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
}
}
}
using namespace thrust::placeholders;
struct create_Xidx : public thrust::unary_function<thrust::tuple<int, int>, int>{
__host__ __device__
int operator()(thrust::tuple<int, int> &my_tuple){
return (thrust::get<0>(my_tuple) * D3) + thrust::get<1>(my_tuple);
}
};
struct my_mult : public thrust::unary_function<thrust::tuple<float, float>, float>{
__host__ __device__
float operator()(thrust::tuple<float, float> &my_tuple){
return thrust::get<0>(my_tuple) * thrust::get<1>(my_tuple);
}
};
int main(){
//synthesize data
float ATA[D1][D2];
float X[D1][D3];
int idx0[D1];
thrust::host_vector<float> h_X(D1*D3);
thrust::host_vector<int> h_idx0(D1);
for (int i = 0; i < D1; i++){
idx0[i] = (i + 2)%D1; h_idx0[i] = idx0[i];
for (int j = 0; j < D2; j++) {ATA[i][j] = 0;}
for (int j = 0; j < D3; j++) {X[i][j] = j%(i+1); h_X[i*D3+j] = X[i][j];}}
thrust::device_vector<float> d_ATA(D1*D2);
thrust::device_vector<float> d_X = h_X;
thrust::device_vector<int> d_idx0 = h_idx0;
// helpers
thrust::counting_iterator<int> mb = thrust::make_counting_iterator(0);
thrust::counting_iterator<int> me = thrust::make_counting_iterator(D1*D2*D3);
// perform computation
thrust::reduce_by_key(thrust::make_transform_iterator(mb, EXT_K), thrust::make_transform_iterator(me, EXT_K), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_X.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx())), thrust::make_permutation_iterator(d_X.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_J)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx())))), my_mult()), thrust::make_discard_iterator(), d_ATA.begin());
thrust::host_vector<float> h_ATA = d_ATA;
test_cpu(ATA, X, idx0);
std::cout << "GPU: CPU: " << std::endl;
for (int i = 0; i < D1*D2; i++)
std::cout << i/D1 << "," << i%D2 << ":" << h_ATA[i] << " " << ATA[i/D1][i%D2] << std::endl;
}
$ nvcc -o t875 t875.cu
$ ./t875
GPU: CPU:
0,0:81 81
0,1:73 73
0,2:99 99
0,3:153 153
0,4:145 145
0,5:169 169
0,6:219 219
0,7:0 0
0,8:25 25
1,0:73 73
1,1:169 169
1,2:146 146
1,3:193 193
1,4:212 212
1,5:313 313
1,6:280 280
1,7:0 0
1,8:49 49
2,0:99 99
2,1:146 146
2,2:300 300
2,3:234 234
2,4:289 289
2,5:334 334
2,6:390 390
2,7:0 0
2,8:50 50
3,0:153 153
3,1:193 193
3,2:234 234
3,3:441 441
3,4:370 370
3,5:433 433
3,6:480 480
3,7:0 0
3,8:73 73
4,0:145 145
4,1:212 212
4,2:289 289
4,3:370 370
4,4:637 637
4,5:476 476
4,6:547 547
4,7:0 0
4,8:72 72
5,0:169 169
5,1:313 313
5,2:334 334
5,3:433 433
5,4:476 476
5,5:841 841
5,6:604 604
5,7:0 0
5,8:97 97
6,0:219 219
6,1:280 280
6,2:390 390
6,3:480 480
6,4:547 547
6,5:604 604
6,6:1050 1050
6,7:0 0
6,8:94 94
7,0:0 0
7,1:0 0
7,2:0 0
7,3:0 0
7,4:0 0
7,5:0 0
7,6:0 0
7,7:0 0
7,8:0 0
8,0:25 25
8,1:49 49
8,2:50 50
8,3:73 73
8,4:72 72
8,5:97 97
8,6:94 94
8,7:0 0
8,8:25 25
$
Notes:
If you profile the above code with e.g. nvprof --print-gpu-trace ./t875, you will witness two kernel calls. The first is associated with the device_vector creation. The second kernel call handles the entire reduce_by_key operation.
I don't know if all this is slower or faster than your CUDA kernel, since you haven't provided it. Sometimes, expertly written CUDA kernels can be faster than thrust algorithms doing the same operation.
It's quite possible that what I have here is not precisely the algorithm you had in mind. For example, your code suggests you're only filling in a triangular portion of ATA. But your description (9*9*2) suggests you want to populate every position in ATA. Nevertheless, my intent is not to give you a black box but to demonstrate how you can use various thrust approaches to achieve whatever it is you want in a single kernel call.

divide by zero exception handling in Linux

I am curious to understand the divide by zero exception handling in linux. When divide by zero operation is performed, a trap is generated i.e. INT0 is sent to the processor and ultimately SIGFPE signal is sent to the process that performed the operation.
As I see, the divide by zero exception is registered in trap_init() function as
set_trap_gate(0, &divide_error);
I want to know in detail, what all happens in between the INT0 being generated and before the SIGFPE being sent to the process?
Trap handler is registered in the trap_init function from arch/x86/kernel/traps.c
void __init trap_init(void)
..
set_intr_gate(X86_TRAP_DE, &divide_error);
set_intr_gate writes the address of the handler function into idt_table x86/include/asm/desc.h.
How is the divide_error function defined? As a macro in traps.c
DO_ERROR_INFO(X86_TRAP_DE, SIGFPE, "divide error", divide_error, FPE_INTDIV,
regs->ip)
And the macro DO_ERROR_INFO is defined a bit above in the same traps.c:
193 #define DO_ERROR_INFO(trapnr, signr, str, name, sicode, siaddr) \
194 dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
195 { \
196 siginfo_t info; \
197 enum ctx_state prev_state; \
198 \
199 info.si_signo = signr; \
200 info.si_errno = 0; \
201 info.si_code = sicode; \
202 info.si_addr = (void __user *)siaddr; \
203 prev_state = exception_enter(); \
204 if (notify_die(DIE_TRAP, str, regs, error_code, \
205 trapnr, signr) == NOTIFY_STOP) { \
206 exception_exit(prev_state); \
207 return; \
208 } \
209 conditional_sti(regs); \
210 do_trap(trapnr, signr, str, regs, error_code, &info); \
211 exception_exit(prev_state); \
212 }
(Actually it defines the do_divide_error function which is called by the small asm-coded stub "entry point" with prepared arguments. The macro is defined in entry_32.S as ENTRY(divide_error) and entry_64.S as macro zeroentry: 1303 zeroentry divide_error do_divide_error)
So, when a user divides by zero (and this operation reaches the retirement buffer in OoO), hardware generates a trap, sets %eip to divide_error stub, it sets up the frame and calls the C function do_divide_error. The function do_divide_error will create the siginfo_t struct describing the error (signo=SIGFPE, addr= address of failed instruction,etc), then it will try to inform all notifiers, registered with register_die_notifier (actually it is a hook, sometimes used by the in-kernel debugger "kgdb"; kprobe's kprobe_exceptions_notify - only for int3 or gpf; uprobe's arch_uprobe_exception_notify - again only int3, etc).
Because DIE_TRAP is usually not blocked by the notifier, the do_trap function will be called. It has a short code of do_trap:
139 static void __kprobes
140 do_trap(int trapnr, int signr, char *str, struct pt_regs *regs,
141 long error_code, siginfo_t *info)
142 {
143 struct task_struct *tsk = current;
...
157 tsk->thread.error_code = error_code;
158 tsk->thread.trap_nr = trapnr;
170
171 if (info)
172 force_sig_info(signr, info, tsk);
...
175 }
do_trap will send a signal to the current process with force_sig_info, which will "Force a signal that the process can't ignore".. If there is an active debugger for the process (our current process is ptrace-ed by gdb or strace), then send_signal will translate the signal SIGFPE to the current process from do_trap into SIGTRAP to debugger. If no debugger - the signal SIGFPE should kill our process while saving the core file, because that is the default action for SIGFPE (check man 7 signal in the section "Standard signals", search for SIGFPE in the table).
The process can't set SIGFPE to ignore it (I'm not sure here: 1), but it can define its own signal handler to handle the signal (example of handing SIGFPE another). This handler may just print %eip from siginfo, run backtrace() and die; or it even may try to recover the situation and return to the failed instruction. This may be useful for example in some JITs like qemu, java, or valgrind; or in high-level languages like java or ghc, which can turn SIGFPE into a language exception and programs in these languages can handle the exception (for example, spaghetti from openjdk is in hotspot/src/os/linux/vm/os_linux.cpp).
There is a list of SIGFPE handlers in debian via codesearch for siagaction SIGFPE or for signal SIGFPE