CUDA shared memory efficiency at 50%?

CUDA shared memory efficiency at 50%? - cuda

I have the following code that performs a tiled matrix transpose using shared memory to improve performance. The shared memory is padded with 1 column to avoid bank conflict for a 32x32 thread block.
__global__ void transpose_tiled_padded(float *A, float *B, int n)
{
int i_in = blockDim.x*blockIdx.x + threadIdx.x;
int j_in = blockDim.y*blockIdx.y + threadIdx.y;
int i_out = blockDim.x*blockIdx.y + threadIdx.x;
int j_out = blockDim.y*blockIdx.x + threadIdx.y;
extern __shared__ float tile[];
// coalesced read of A rows to (padded) shared tile column (transpose)
tile[threadIdx.y + threadIdx.x*(blockDim.y+1)] = A[i_in + j_in*n];
__syncthreads();
// coalesced write from (padded) shared tile column to B rows
B[i_out + j_out*n] = tile[threadIdx.x + threadIdx.y*(blockDim.x+1)];
}
Running this code, I get 100% shared memory efficiency in the NVIDIA visual profiler, as I expect. But, when I run it with a 16x16 thread block, I only get 50% efficiency. Why is that? As far as I can tell, no thread in a warp reads from the same bank with this layout. Or am I mistaken?

Yes, you are mistaken.
Considering this (read) access for warp 0 in a 16x16 block:
tile[threadIdx.x + threadIdx.y*(blockDim.x+1)];
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
"index"
Here are the relevant calculations for each thread in the warp:
warp lane: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 23 25 26 27 28 29 30 31
threadIdx.x: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
threadIdx.y: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
"index": 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
bank: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0
So we see that for this warp, the first and the last thread both read from bank 0. This results in a 2-way bank conflict, 2-way serialization, and 50% efficiency.

Related

Timed out script

I'm using the following code in order to avoid using an IMPORTRANGE() formula. The data I'm getting is over 50k rows so that's why I'm using App Script.
But now I'm getting the following error:
Exception: Service Spreadsheets timed out while accessing document with id
function Live_Data_importing() {
var raw_live = SpreadsheetApp.openByUrl("someURL").getSheetByName("Sheet1");
var selected_columns = raw_live.getRange("A:Q").getValues().map(([a,,,,e,f,g,,i,j,,l,,n,o,]) => [a,e,f,g,i,j,l,n,o] );
var FF_filtered = selected_columns.filter(row=>row[8]=="somedata");
var FF_Hourly_live_data = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("somesheet");
FF_Hourly_live_data.getRange(2, 1, FF_filtered.length, FF_filtered[0].length).setValues(FF_filtered);
}
How can I fix this error? I also made a copy of the file but script is still running the same error.

Tested this on smaller set of data
It works okay
function Live_Data_importing() {
const ss = SpreadsheetApp.getActive();
var raw_live = ss.getSheetByName("Sheet3");
var selected_columns = raw_live.getRange("A1:Q" + raw_live.getLastRow()).getValues().map(([a, , , , e, f, g, , i, j, , l, , n, o,]) => [a, e, f, g, i, j, l, n, o]);//fix this range
var FF_filtered = selected_columns.filter(row => row[8] == 0);//My data is all integers
var FF_Hourly_live_data = ss.getSheetByName("Sheet4");//output sheet
FF_Hourly_live_data.getRange(2, 1, FF_filtered.length, FF_filtered[0].length).setValues(FF_filtered);
}
Data:
COL1
COL2
COL3
COL4
COL5
COL6
COL7
COL8
COL9
COL10
COL11
COL12
COL13
COL14
COL15
COL16
COL17
COL18
COL19
COL20
0
1
18
13
0
7
5
11
15
2
1
17
0
8
11
18
0
16
1
8
0
16
16
9
5
0
16
8
14
0
7
4
3
15
8
19
18
4
18
10
9
14
7
16
17
13
17
4
8
6
19
18
9
18
12
1
8
19
12
10
13
17
4
2
10
3
7
16
14
13
1
5
16
12
2
3
14
3
0
9
4
17
14
5
9
13
18
0
4
1
14
1
12
7
10
1
15
4
14
5
1
17
13
17
3
1
17
8
19
7
14
18
2
17
1
16
2
19
13
15
7
4
9
0
11
14
11
7
12
14
3
19
14
13
18
12
16
19
19
6
7
6
10
5
15
8
0
8
2
16
4
14
18
14
2
16
16
5
15
16
4
5
9
6
6
2
1
15
14
8
19
8
4
10
12
12
4
6
10
12
11
15
3
4
1
3
17
19
10
4
11
2
10
16
12
1
6
3
0
3
11
0
14
8
13
13
4
2
16
18
10
14
5
3
7
4
7
9
5
15
0
6
5
1
2
18
1
1
11
13
7
13
5
15
5
13
17
18
14
19
0
17
10
6
10
16
16
0
18
19
12
8
15
1
11
4
19
4
17
14
4
14
14
6
16
8
15
4
5
2
4
5
14
14
16
9
0
16
0
4
8
3
8
5
12
15
5
19
14
0
1
6
12
2
19
10
10
19
13
19
0
5
14
7
2
17
3
10
2
14
2
5
0
18
5
1
13
13
3
13
10
18
18
4
18
11
0
7
13
9
18
13
9
2
0
16
15
3
9
14
12
15
15
7
16
13
14
1
4
12
18
9
6
14
18
11
16
2
18
15
15
0
2
4
6
16
5
18
2
6
14
18
11
1
9
15
13
8
8
19
11
19
14
0
7
15
1
10
8
9
14
14
15
3
11
13
4
10
5
16

Basic CUDA load and warp transpose

I want to implement a basic blocked load and warp transpose using CUDA 9.0's shuffle operations. I'm aware of the cub and trove implementations, but I'm restricted to compiling with nvrtc and the standard header includes make these libraries difficult to cater for. I'm not looking for anything fancy, just some integer, float and double shuffles on data with dimension a power of 2.
Visualising an example with warp size 8, I want to go from:
correlation
0 1 2 3
lane 0 0 8 16 24
lane 1 1 9 17 25
lane 2 2 10 18 26
lane 3 3 11 19 27
lane 4 4 12 20 28
lane 5 5 13 21 29
lane 6 6 14 22 30
lane 7 7 15 23 31
to this structure:
correlation
0 1 2 3
lane 0 0 1 2 3
lane 1 8 9 10 11
lane 2 16 17 18 19
lane 3 24 25 26 27
lane 4 4 5 6 7
lane 5 12 13 14 15
lane 6 20 21 22 23
lane 7 28 29 30 31
I feel this should be really simple but I can't figure out what I've done incorrectly. I think that the basic transposition loop should look like:
int loads[ncorrs];
int values[ncorrs];
int lane_id = threadIdx.x & (warp_size - 1);
// 0 0 0 0 4 4 4 4 8 8 8 8 ....
int base_idx = lane_id & (warp_size - ncorrs);
// 0 1 2 3 0 1 2 3 0 1 2 3
int src_corr = lane_id & (ncorrs - 1);
for(int corr=0; corr < ncorrs; ++corr)
{
int src_lane = base_idx + corr;
values[corr] = __shfl_sync(mask, loads[src_corr],
src_lane, warp_size);
}
So given the example data above, if we're in lane 5, I expect that the following indexing should occur:
base_idx == 4;
src_corr == 1;
corr == [0, 1, 2, 3]
src_lane == [4, 5, 6, 7]
values == [12, 13, 14 15]
But instead the following is happening (33's are from later in the data):
correlation
0 1 2 3
lane 0 0 0 0 0
lane 1 4 4 4 4
lane 2 12 12 12 12
lane 3 16 16 16 16
lane 4 20 20 20 20
lane 5 24 24 24 24
lane 6 28 28 28 28
lane 7 33 33 33 33
What am I doing incorrectly? Full implementation for a warp size of 32:
#include <cstdlib>
#include <cstdio>
#include "cuda.h"
#define ncorr 4
#define warp_size 32
template <int ncorrs>
__global__ void kernel(
int * input,
int * output,
int N)
{
// This should provide 0 0 0 0 4 4 4 4 8 8 8 8 ...
#define base_idx(lane_id) (lane_id & (warp_size - ncorrs))
// This should provide 0 1 2 3 0 1 2 3 0 1 2 3
#define corr_idx(lane_id) (lane_id & (ncorrs - 1))
int n = blockIdx.x*blockDim.x + threadIdx.x;
int lane_id = threadIdx.x & (warp_size - 1);
if(n >= N)
{ return; }
// Input correlation handled by this thread
int src_corr = corr_idx(lane_id);
int mask = __activemask();
if(threadIdx.x == 0)
{ printf("mask %d\n", mask); }
int loads[ncorrs];
int values[ncorrs];
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{ loads[corr] = input[n + corr*N]; }
__syncthreads();
printf("[%d, %d] %d %d %d %d\n",
lane_id, base_idx(lane_id),
loads[0], loads[1],
loads[2], loads[3]);
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{
int src_lane = base_idx(lane_id) + corr;
values[corr] = __shfl_sync(mask, loads[src_corr],
src_lane, warp_size);
}
printf("[%d, %d] %d %d %d %d\n",
lane_id, base_idx(lane_id),
values[0], values[1],
values[2], values[3]);
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{ output[n + corr*N] = values[corr]; }
}
void print_data(int * data, int N)
{
for(int n=0; n < N; ++n)
{
printf("% -3d: ", n);
for(int c=0; c < ncorr; ++c)
{
printf("%d ", data[n*ncorr + c]);
}
printf("\n");
}
}
int main(void)
{
int * host_input;
int * host_output;
int * device_input;
int * device_output;
int N = 32;
host_input = (int *) malloc(sizeof(int)*N*ncorr);
host_output = (int *) malloc(sizeof(int)*N*ncorr);
printf("malloc done\n");
cudaMalloc((void **) &device_input, sizeof(int)*N*ncorr);
cudaMalloc((void **) &device_output, sizeof(int)*N*ncorr);
printf("cudaMalloc done\n");
for(int i=0; i < N*ncorr; ++i)
{ host_input[i] = i; }
print_data(host_input, N);
dim3 block(256, 1, 1);
dim3 grid((block.x + N - 1) / N, 1, 1);
cudaMemcpy(device_input, host_input,
sizeof(int)*N*ncorr, cudaMemcpyHostToDevice);
printf("memcpy done\n");
kernel<4> <<<grid, block>>> (device_input, device_output, N);
cudaMemcpy(host_output, device_output,
sizeof(int)*N*ncorr, cudaMemcpyDeviceToHost);
print_data(host_output, N);
cudaFree(device_input);
cudaFree(device_output);
free(host_input);
free(host_output);
}
Edit 1: Clarified that the visual example has a warp size of 8 while the full code caters for a warp size of 32

What am I doing incorrectly?
TL;DR: In short, you are transmitting the same input value to multiple output values. Here is one example, in this line of code:
values[corr] = __shfl_sync(mask, loads[src_corr],
src_lane, warp_size);
The quantity represented by loads[src_corr] is loop-invariant. Therefore you are transmitting that value to 4 warp lanes (over the 4 loop iterations) which means that value is occupying 4 output values (which is exactly what your printout data shows). That can't be right for a transpose.
Taking a somewhat longer view, with another example from your code:
I'm not sure I can read your mind, but possibly you may be confused about the warp shuffle operation. Possibly you have assumed that the destination lane can choose which value from the source lane loads[] array is desired. This is not the case. The destination lane only gets to select whatever is the value provided by the source lane. Let's take a look at your loop:
// This should provide 0 0 0 0 4 4 4 4 8 8 8 8 ...
#define base_idx(lane_id) (lane_id & (warp_size - ncorrs))
// This should provide 0 1 2 3 0 1 2 3 0 1 2 3
#define corr_idx(lane_id) (lane_id & (ncorrs - 1))
int n = blockIdx.x*blockDim.x + threadIdx.x;
int lane_id = threadIdx.x & (warp_size - 1);
...
// Input correlation handled by this thread
int src_corr = corr_idx(lane_id);
int mask = __activemask();
...
int loads[ncorrs];
int values[ncorrs];
...
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{
int src_lane = base_idx(lane_id) + corr;
values[corr] = __shfl_sync(mask, loads[src_corr], src_lane, warp_size);
}
On the first pass of the above loop, the src_lane for warp lanes 0, 1, 2, and 3 are all going to be 0. This is evident from the above excerpted code, or print it out if you're not sure. That means warp lanes 0-3 are going to be requesting whatever value is provided by warp lane 0. The value provided by warp lane 0 is loads[src_corr], but the interpretation of src_corr here is whatever value it has for warp lane 0. Therefore one and only one value will be distributed to warp lanes 0-3. This could not possibly be correct for a transpose; no input value shows up in 4 places in the output.
To fix this, we will need to modify the calculation both of src_lane and src_corr. We will also need to modify the storage location (index) per-warp-lane, at each pass the of the loop (I'm calling this new variable dest.) We can think of src_lane as defining the target value that my thread will receive. We can think of src_corr as defining which of my values I will publish to some other thread, on that loop iteration. dest is the location in my values[] array that I will store the currently received value. We can deduce the necessary pattern by carefully studying the relationship between the input value in loads[], the desired output location in values[], taking into account the appropriate warp lanes for source and destination. On the first pass of the loop, we desire this pattern:
warp lane: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
src_lane: 0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27 4 ... (where my data comes from)
src_corr: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 ... (which value I am transmitting)
dest: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 ... (where I store the received value)
On the second pass of the loop, we desire this pattern:
warp lane: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
src_lane: 8 16 24 0 9 17 25 1 10 18 26 2 11 19 27 3 19 ... (where my data comes from)
src_corr: 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 1 ... (which value I am transmitting)
dest: 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 ... (where I store the received value)
with corresponding changes for the 3rd and 4th pass of the loop. If we realize those patterns in code for your shuffle loop, it could look something like this:
$ cat t352.cu
#include <cstdlib>
#include <cstdio>
#include <assert.h>
#define ncorr 4
#define warp_size 32
template <int ncorrs>
__global__ void kernel(
int * input,
int * output,
int N)
{
// This should provide 0 0 0 0 4 4 4 4 8 8 8 8 ...
#define base_idx(lane_id) (lane_id & (warp_size - ncorrs))
// This should provide 0 1 2 3 0 1 2 3 0 1 2 3
#define corr_idx(lane_id) (lane_id & (ncorrs - 1))
int n = blockIdx.x*blockDim.x + threadIdx.x;
int lane_id = threadIdx.x & (warp_size - 1);
if(n >= N)
{ return; }
// Input correlation handled by this thread
int mask = __activemask();
if(threadIdx.x == 0)
{ printf("mask %d\n", mask); }
int loads[ncorrs];
int values[ncorrs];
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{ loads[corr] = input[n + corr*N]; }
__syncthreads();
printf("[%d, %d] %d %d %d %d\n",
lane_id, base_idx(lane_id),
loads[0], loads[1],
loads[2], loads[3]);
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{
int src_lane = ((lane_id+corr)%ncorrs)*(warp_size/ncorrs) + (lane_id/ncorrs);
int src_corr = ((ncorrs-corr)+(lane_id/(warp_size/ncorrs)))%ncorrs;
int dest = (lane_id+corr)%ncorrs;
values[dest] = __shfl_sync(mask, loads[src_corr],
src_lane, warp_size);
}
printf("[%d, %d] %d %d %d %d\n",
lane_id, base_idx(lane_id),
values[0], values[1],
values[2], values[3]);
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{ output[n + corr*N] = values[corr]; }
}
void print_data(int * data, int N)
{
for(int n=0; n < N; ++n)
{
printf("% -3d: ", n);
for(int c=0; c < ncorr; ++c)
{
printf("%d ", data[n*ncorr + c]);
}
printf("\n");
}
}
int main(void)
{
int * host_input;
int * host_output;
int * device_input;
int * device_output;
int N = 32;
host_input = (int *) malloc(sizeof(int)*N*ncorr);
host_output = (int *) malloc(sizeof(int)*N*ncorr);
printf("malloc done\n");
cudaMalloc((void **) &device_input, sizeof(int)*N*ncorr);
cudaMalloc((void **) &device_output, sizeof(int)*N*ncorr);
printf("cudaMalloc done\n");
for(int i=0; i < N*ncorr; ++i)
{ host_input[i] = i; }
print_data(host_input, N);
dim3 block(256, 1, 1);
dim3 grid((block.x + N - 1) / N, 1, 1);
cudaMemcpy(device_input, host_input,
sizeof(int)*N*ncorr, cudaMemcpyHostToDevice);
printf("memcpy done\n");
kernel<4> <<<grid, block>>> (device_input, device_output, N);
cudaMemcpy(host_output, device_output,
sizeof(int)*N*ncorr, cudaMemcpyDeviceToHost);
print_data(host_output, N);
cudaFree(device_input);
cudaFree(device_output);
free(host_input);
free(host_output);
}
$ nvcc -o t352 t352.cu
$ cuda-memcheck ./t352
========= CUDA-MEMCHECK
malloc done
cudaMalloc done
0 : 0 1 2 3
1 : 4 5 6 7
2 : 8 9 10 11
3 : 12 13 14 15
4 : 16 17 18 19
5 : 20 21 22 23
6 : 24 25 26 27
7 : 28 29 30 31
8 : 32 33 34 35
9 : 36 37 38 39
10: 40 41 42 43
11: 44 45 46 47
12: 48 49 50 51
13: 52 53 54 55
14: 56 57 58 59
15: 60 61 62 63
16: 64 65 66 67
17: 68 69 70 71
18: 72 73 74 75
19: 76 77 78 79
20: 80 81 82 83
21: 84 85 86 87
22: 88 89 90 91
23: 92 93 94 95
24: 96 97 98 99
25: 100 101 102 103
26: 104 105 106 107
27: 108 109 110 111
28: 112 113 114 115
29: 116 117 118 119
30: 120 121 122 123
31: 124 125 126 127
memcpy done
mask -1
[0, 0] 0 32 64 96
[1, 0] 1 33 65 97
[2, 0] 2 34 66 98
[3, 0] 3 35 67 99
[4, 4] 4 36 68 100
[5, 4] 5 37 69 101
[6, 4] 6 38 70 102
[7, 4] 7 39 71 103
[8, 8] 8 40 72 104
[9, 8] 9 41 73 105
[10, 8] 10 42 74 106
[11, 8] 11 43 75 107
[12, 12] 12 44 76 108
[13, 12] 13 45 77 109
[14, 12] 14 46 78 110
[15, 12] 15 47 79 111
[16, 16] 16 48 80 112
[17, 16] 17 49 81 113
[18, 16] 18 50 82 114
[19, 16] 19 51 83 115
[20, 20] 20 52 84 116
[21, 20] 21 53 85 117
[22, 20] 22 54 86 118
[23, 20] 23 55 87 119
[24, 24] 24 56 88 120
[25, 24] 25 57 89 121
[26, 24] 26 58 90 122
[27, 24] 27 59 91 123
[28, 28] 28 60 92 124
[29, 28] 29 61 93 125
[30, 28] 30 62 94 126
[31, 28] 31 63 95 127
[0, 0] 0 8 16 24
[1, 0] 32 40 48 56
[2, 0] 64 72 80 88
[3, 0] 96 104 112 120
[4, 4] 1 9 17 25
[5, 4] 33 41 49 57
[6, 4] 65 73 81 89
[7, 4] 97 105 113 121
[8, 8] 2 10 18 26
[9, 8] 34 42 50 58
[10, 8] 66 74 82 90
[11, 8] 98 106 114 122
[12, 12] 3 11 19 27
[13, 12] 35 43 51 59
[14, 12] 67 75 83 91
[15, 12] 99 107 115 123
[16, 16] 4 12 20 28
[17, 16] 36 44 52 60
[18, 16] 68 76 84 92
[19, 16] 100 108 116 124
[20, 20] 5 13 21 29
[21, 20] 37 45 53 61
[22, 20] 69 77 85 93
[23, 20] 101 109 117 125
[24, 24] 6 14 22 30
[25, 24] 38 46 54 62
[26, 24] 70 78 86 94
[27, 24] 102 110 118 126
[28, 28] 7 15 23 31
[29, 28] 39 47 55 63
[30, 28] 71 79 87 95
[31, 28] 103 111 119 127
0 : 0 32 64 96
1 : 1 33 65 97
2 : 2 34 66 98
3 : 3 35 67 99
4 : 4 36 68 100
5 : 5 37 69 101
6 : 6 38 70 102
7 : 7 39 71 103
8 : 8 40 72 104
9 : 9 41 73 105
10: 10 42 74 106
11: 11 43 75 107
12: 12 44 76 108
13: 13 45 77 109
14: 14 46 78 110
15: 15 47 79 111
16: 16 48 80 112
17: 17 49 81 113
18: 18 50 82 114
19: 19 51 83 115
20: 20 52 84 116
21: 21 53 85 117
22: 22 54 86 118
23: 23 55 87 119
24: 24 56 88 120
25: 25 57 89 121
26: 26 58 90 122
27: 27 59 91 123
28: 28 60 92 124
29: 29 61 93 125
30: 30 62 94 126
31: 31 63 95 127
========= ERROR SUMMARY: 0 errors
$
I believe the above code fairly clearly demonstrates a 32x4 -> 4x32 transpose. I think it is "closest" to the code you presented. It does not do the set of 4x8 transposes you depicted in your diagrams.
I acknowledge that the calculations of src_corr, src_lane, and dest are not completely optimized. But they generate the correct indexing. I assume you can work out how to optimally generate those from the patterns you already have.
I think its entirely possible the above code has bugs for other dimensions. I've not tried it on anything except the 32x4 case. Nevertheless I think I have indicated what is fundamentally wrong with your code, and demonstrated a pathway to get to proper indexing.
A square matrix transpose up to 32x32 can be done at the warp level using a simpler method

why output tex3D data are different from initial data?

I make this program to practice cudaMemcpy3D() and Texture Memory.
Here comes the questions,when I print out tex3D data,it is not same as initial data.The value I get is ncrss times the initial value, and there are ncrss interval numbers which equal to 0 between each other. If I set nsubs to 2 or other bigger one, the time should be ncrss*nsubs and interval will be ncrss*nsubs.
Can you piont out where I made the mistakes. I think it probably is make_cudaPitchedPtr at line 61, or make_cudaExtent at line 56. And also may related with the way of array storaged.
So I come here for your help,appreciate for your comments and advices.
1 #include<stdio.h>
2 #include<stdlib.h>
3 #include<cuda_runtime.h>
4 #include<helper_functions.h>
5 #include<helper_cuda.h>
6 #ifndef MIN
7 #define MIN(A,B) ((A) < (B) ? (A) : (B))
8 #endif
9 #ifndef MAX
10 #define MAX(A,B) ((A) > (B) ? (A) : (B))
11 #endif
12
13 texture<float,cudaTextureType3D,cudaReadModeElementType> vel_tex;
14
15 __global__ void mckernel(int ntab)
16 {
17 const int biy=blockIdx.y;//sub
18 const int bix=blockIdx.x;//crs
19 const int tid=threadIdx.x;
20
21 float test;
22 test=tex3D(vel_tex,biy,bix,tid);
23 printf("test=%f,bix=%d,tid=%d\n",test,bix,tid);
24
25 }
26
27 int main()
28 {
29 int n=10;//208
30 int ntab=10;
31 int submin=1;
32 int crsmin=1;
33 int submax=1;
34 int crsmax=2;
35 int subinc=1;
36 int crsinc=1;
37
38 int ncrss,nsubs;
39 ncrss=(crsmax-crsmin)/crsinc + 1;
40 nsubs=(submax-submin)/subinc + 1;
41 dim3 BlockPerGrid(ncrss,nsubs,1);
42 dim3 ThreadPerBlock(n,1,1);
43
44 float vel[nsubs][ncrss][ntab];
45 int i,j,k;
46 for(i=0;i<nsubs;i++)
47 for(j=0;j<ncrss;j++)
48 for(k=0;k<ntab;k++)
49 vel[i][j][k]=k;
50 for(i=0;i<nsubs;i++)
51 for(j=0;j<ncrss;j++)
52 for(k=0;k<ntab;k++)
53 printf("vel[%d][%d][%d]=%f\n",i,j,k,vel[i][j][k]);
54
55 cudaChannelFormatDesc velchannelDesc=cudaCreateChannelDesc<float>();
56 cudaExtent velExtent=make_cudaExtent(nsubs,ncrss,ntab);
57 cudaArray *d_vel;
58 cudaMalloc3DArray(&d_vel,&velchannelDesc,velExtent);
59
60 cudaMemcpy3DParms velParms = {0};
61 velParms.srcPtr=make_cudaPitchedPtr((void*)vel,sizeof(float)*nsubs,nsubs,ncrss);
62 velParms.dstArray=d_vel;
63 velParms.extent=velExtent;
64 velParms.kind=cudaMemcpyHostToDevice;
65 cudaMemcpy3D(&velParms);
66
67 cudaBindTextureToArray(vel_tex,d_vel);
68
69 printf("kernel start\n");
70 cudaDeviceSynchronize();
71 mckernel<<<BlockPerGrid,ThreadPerBlock>>>(ntab);
72 printf("kernel end\n");
73
74 cudaUnbindTexture(vel_tex);
75 cudaFreeArray(d_vel);
76 cudaDeviceReset();
77 return 0 ;
78 }
Here comes the printf data,nsubs=1 and ncrss=2;
1 vel[0][0][0]=0.000000
2 vel[0][0][1]=1.000000
3 vel[0][0][2]=2.000000
4 vel[0][0][3]=3.000000
5 vel[0][0][4]=4.000000
6 vel[0][0][5]=5.000000
7 vel[0][0][6]=6.000000
8 vel[0][0][7]=7.000000
9 vel[0][0][8]=8.000000
10 vel[0][0][9]=9.000000
11 vel[0][1][0]=0.000000
12 vel[0][1][1]=1.000000
13 vel[0][1][2]=2.000000
14 vel[0][1][3]=3.000000
15 vel[0][1][4]=4.000000
16 vel[0][1][5]=5.000000
17 vel[0][1][6]=6.000000
18 vel[0][1][7]=7.000000
19 vel[0][1][8]=8.000000
20 vel[0][1][9]=9.000000
21 kernel start
22 kernel end
23 test=1.000000,bix=1,tid=0
24 test=3.000000,bix=1,tid=1
25 test=5.000000,bix=1,tid=2
26 test=7.000000,bix=1,tid=3
27 test=9.000000,bix=1,tid=4
28 test=1.000000,bix=1,tid=5
29 test=3.000000,bix=1,tid=6
30 test=5.000000,bix=1,tid=7
31 test=7.000000,bix=1,tid=8
32 test=9.000000,bix=1,tid=9
33 test=0.000000,bix=0,tid=0
34 test=2.000000,bix=0,tid=1
35 test=4.000000,bix=0,tid=2
36 test=6.000000,bix=0,tid=3
37 test=8.000000,bix=0,tid=4
38 test=0.000000,bix=0,tid=5
39 test=2.000000,bix=0,tid=6
40 test=4.000000,bix=0,tid=7
41 test=6.000000,bix=0,tid=8
42 test=8.000000,bix=0,tid=9

After a night thinking ,I find out the problem.
the cuda array load as M[fast][mid][low] while c array is M[low][mid][fast].
so dim3(),cudaExtent(),pitchedPtr()should be same to [low][mid][fast] or at least should be same as each other.

Puzzle: find number in a given row and column in 2-dimensional array where no number may occur twice

We have a two-dimensional array with the number 0 in the upper left corner. The rest of the array is then filled with numbers so that each index contains the smallest positive integer possible that already exists neither on the same row or column.
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 0 3 2 5 4 7 6 9 8 11 10 13 12 15 14 17 16
2 3 0 1 6 7 4 5 10 11 8 9 14 15 12 13 18 19
3 2 1 0 7 6 5 4 11 10 9 8 15 14 13 12 19 18
4 5 6 7 0 1 2 3 12 13 14 15 8 9 10 11 20 21
5 4 7 6 1 0 3 2 13 12 15 14 9 8 11 10 21 20
6 7 4 5 2 3 0 1 14 15 12 13 10 11 8 9 22 23
7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 23 22
8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 24 25
9 8 11 10 13 12 15 14 1 0 3 2 5 4 7 6 25 24
10 11 8 9 14 15 12 13 2 3 0 1 6 7 4 5 26 27
11 10 9 8 15 14 13 12 3 2 1 0 7 6 5 4 27 26
12 13 14 15 8 9 10 11 4 5 6 7 0 1 2 3 28 29
13 12 15 14 9 8 11 10 5 4 7 6 1 0 3 2 29 28
14 15 12 13 10 11 8 9 6 7 4 5 2 3 0 1 30 31
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 31 30
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1
17 16 19 18 21 20 23 22 25 24 27 26 29 28 31 30 1 0
Given the row and the column in such array, I need to be able to find the number in the specified index in less than one second on a relatively new desktop PC (for row and column less than a million). My brute-force attempts so far have been so futile that it's clearly not the way I want to go with this. Presumably there must be a way to find out the number in question, in linear time (?), that doesn't require computing all the preceding numbers in the array.

Observation shows that the operator is the bitwise XOR (represent each operand as a binary number, XOR together the corresponding bits, read as binary).
Now on to prove it is the XOR:
Since the XOR with one argument fixed is a bijection on the other argument, the "that exists neither on the same row or column" is satisfied.
Now it just suffices to prove the "smallest" part, namely that any lower value already occurs if we reduce either operand:
foreach A >= 0, B >= 0, F >= 0:
(A xor B > F) => (exists D: D xor B = F) or (exists E: A xor E = F)
or equivalently
foreach 0 <= A, 0 <= B, 0 <= F < (A XOR B)
(exists D: D xor B = F) or (exists E: A xor E = F)
Note that we are no longer concerned about our operator, we're proving the minimality of XOR.
Define C = A xor B
if A = 0, B = 0, then minimality is satisfied.
Now, if A and B have the same magnitude (the same bit length), then clearing the top bit of both will not change C. Clearing the top bit is a translation towards the origin in the matrix, so if a smaller value exists above or to the left after translation, it is at the same relative position before the translation.
A and B must have a different magnitude to be a counter-example. XOR (as well as the operator under consideration) are symmetric, so assume A > B.
If F is of greater magnitude than A, then it's not smaller, and thus it's not a counter-example.
If F has the same magnitude as A, then clear the highest bit in A and in F. This is a translation in the table. It changes the values, but not their ordering, so if a smaller value exists above or to the left after translation, it is at the same relative position before the translation.
If F has a smaller magnitude than A, then, by the pigeonhole principle and the properties of XOR, there exists a D with a smaller magnitude than A such that D xor B = F.
summary: The proof that XOR satisfies the conditions imposed onto the solution follows from the symmetries of XOR, its magnitude-preserving properties and its bijection properties. We can find each smaller element than A xor B by reducing A, B and the challenge until they're all zero or of different magnitude (at which point we apply the pigeonhole principle to prove the challenge can be countered without actually countering it).

Code Golf: Zigzag pattern scanning

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
The Challenge
The shortest code by character count that takes a single input integer N (N >= 3) and returns an array of indices that when iterated would traverse an NxN matrix according to the JPEG "zigzag" scan pattern. The following is an example traversal over an 8x8 matrixsrc:
Examples
(The middle matrix is not part of the input or output, just a representation of the NxN matrix the input represents.)
1 2 3
(Input) 3 --> 4 5 6 --> 1 2 4 7 5 3 6 8 9 (Output)
7 8 9
1 2 3 4
(Input) 4 --> 5 6 7 8 --> 1 2 5 9 6 3 4 7 10 13 14 11 8 12 15 16 (Output)
9 10 11 12
13 14 15 16
Notes
The resulting array's base should be appropriate for your language (e.g., Matlab arrays are 1-based, C++ arrays are 0-based).
This is related to this question.
Bonus
Extend your answer to take two inputs N and M (N, M >=3) and perform the same scan over an NxM matrix. (In this case N would be the number of columns and M the number of rows.)
Bonus Examples
1 2 3 4
(Input) 4 3 --> 5 6 7 8 --> 1 2 5 9 6 3 4 7 10 11 8 12 (Output)
9 10 11 12
1 2 3
(Input) 3 4 --> 4 5 6 --> 1 2 4 7 5 3 6 8 10 11 9 12 (Output)
7 8 9
10 11 12

J, 13 15 characters
;<#|.`</.i.2$
Usage:
;<#|.`</.i.2$ 3
0 1 3 6 4 2 5 7 8
;<#|.`</.i.2$ 4
0 1 4 8 5 2 3 6 9 12 13 10 7 11 14 15
Explanation
(NB. is J's comment indicator)
; NB. Link together...
<#|.`< NB. ... 'take the reverse of' and 'take normally'
/. NB. ... applied to alternating diagonals of...
i. NB. ... successive integers starting at 0 and counting up to fill an array with dimensions of...
2$ NB. ... the input extended cyclically to a list of length two.
J, bonus, 13 characters
;<#|.`</.i.|.
Usage:
;<#|.`</.i.|. 3 4
0 1 3 6 4 2 5 7 9 10 8 11
;<#|.`</.i.|. 9 6
0 1 9 18 10 2 3 11 19 27 36 28 20 12 4 5 13 21 29 37 45 46 38 30 22 14 6 7 15 23 31 39 47 48 40 32 24 16 8 17 25 33 41 49 50 42 34 26 35 43 51 52 44 53

Python, 92, 95, 110, 111, 114, 120, 122, 162, 164 chars
N=input()
for a in sorted((p%N+p/N,(p%N,p/N)[(p%N-p/N)%2],p)for p in range(N*N)):print a[2],
Testing:
$ echo 3 | python ./code-golf.py
0 1 3 6 4 2 5 7 8
$ echo 4 | python ./code-golf.py
0 1 4 8 5 2 3 6 9 12 13 10 7 11 14 15
This solution easily generalizes for NxM boards: tweak the input processing and replace N*N with N*M:
N,M=map(int,raw_input().split())
for a in sorted((p%N+p/N,(p%N,p/N)[(p%N-p/N)%2],p)for p in range(N*M)):print a[2],
I suspect there's some easier/shorter way to read two numbers.
Testing:
$ echo 4 3 | python ./code-golf.py
0 1 4 8 5 2 3 6 9 10 7 11

Ruby, 69 89 chars
n=gets.to_i
puts (0...n*n).sort_by{|p|[t=p%n+p/n,[p%n,p/n][t%2]]}*' '
89 chars
n=gets.to_i
puts (0...n*n).map{|p|[t=p%n+p/n,[p%n,p/n][t%2],p]}.sort.map{|i|i[2]}.join' '
Run
> zigzag.rb
3
0 1 3 6 4 2 5 7 8
> zigzag.rb
4
0 1 4 8 5 2 3 6 9 12 13 10 7 11 14 15
Credits to doublep for the sort method.

F#, 126 chars
let n=stdin.ReadLine()|>int
for i=0 to 2*n do for j in[id;List.rev].[i%2][0..i]do if i-j<n&&j<n then(i-j)*n+j|>printf"%d "
Examples:
$ echo 3 | fsi --exec Program.fsx
0 1 3 6 4 2 5 7 8
$ echo 4 | fsi --exec Program.fsx
0 1 4 8 5 2 3 6 9 12 13 10 7 11 14 15

Golfscript, 26/30 32/36 45 59 characters
Shortest non-J solution so far:
Updated sort (don't tell the others!) - 30 chars:
~:1.*,{..1/\1%+.2%.+(#*]}$' '* #solution 1
#~:\.*,{.\/1$\%+.1&#.~if]}$' '* #solution 2
#~\:1*,{..1/\1%+.2%.+(#*]}$' '* #(bonus)
#~\:\*,{.\/1$\%+.1&#.~if]}$' '* #(bonus)
Straight implementation - 36 chars:
~:#.*,{[.#%:|\#/:^+|^- 2%^|if]}$' '*
#~\:#*,{[.#%:|\#/:^+|^- 2%^|if]}$' '* #(bonus)
If you can provide output as "013642578" instead of "0 1 3 6 4 2 5 7 8", then you can remove the last 4 characters.
Credit to doublep for the sorting technique.
Explanation:
~\:#* #read input, store first number into #, multiply the two
, #put range(#^2) on the stack
{...}$ #sort array using the key in ...
" "* #join array w/ spaces
and for the key:
[ #put into an array whatever is left on the stack until ]
.#%:| #store #%n on the stack, also save it as |
\#/:^ #store #/n on the stack, also save it as ^
+ #add them together. this remains on the stack.
|^- 2%^|if #if (| - ^) % 2 == 1, then put ^ on stack, else put | on stack.
] #collect them into an array

MATLAB, 101/116 chars
Its basically a condensed version of the same answer given here, to be run directly on the command prompt:
N=input('');i=fliplr(spdiags(fliplr(reshape(1:N*N,N,N)')));i(:,1:2:end)=flipud(i(:,1:2:end));i(i~=0)'
and an extended one that read two values from the user:
S=str2num(input('','s'));i=fliplr(spdiags(fliplr(reshape(1:prod(S),S)')));i(:,1:2:end)=flipud(i(:,1:2:end));i(i~=0)'
Testing:
3
ans =
1 2 4 7 5 3 6 8 9
and
4 3
ans =
1 2 5 9 6 3 4 7 10 11 8 12

Ruby 137 130 138 characters
n=gets.to_i
def g(a,b,r,t,s);x=[s*r]*t;t==r ?[a,x,a]:[a,x,g(b,a,r,t+1,-s),x,a];end
q=0;puts ([1]+g(1,n,n-1,1,1)).flatten.map{|s|q+=s}*' '
$ zz.rb
3
1 2 4 7 5 3 6 8 9
$ zz.rb
4
1 2 5 9 6 3 4 7 10 13 14 11 8 12 15 16

C89 (280 bytes)
I guess this can still be optimized - I use four arrays to store the possible movement
vectors when hitting a wall. I guess it can be done, saving some chars at the definition, but I think it will cost more to implement the logic further down. Anyway, here you go:
t,l,b,r,i,v,n;main(int c,char**a){n=atoi(*++a);b=n%2;int T[]={n-1,1},L[]={1-n,n}
,B[]={1-n,1},R[]={n-1,n};for(c=n*n;c--;){printf("%d%c",i,c?32:10);if(i>=n*(n-1))
v=B[b=!b];else if(i%n>n-2){if(!(n%2)&&i<n)goto g;v=R[r=!r];}else if(i<n)g:v=T[t=
!t];else if(!(i%n))v=L[l=!l];i+=v;}}
compiles with a few warnings, but as far as I know it is portable C89. I'm actually not sure
whether my algorithm is clever at all, maybe you can get way shorter with a better one
(haven't taken the time to understand the other solutions yet).

Haskell 117 characters
i s=concatMap(\q->d q++(reverse.d$q+1))[0,2..s+s]
where d r=[x+s*(r-x)|x<-[0..r],x<s&&(r-x)<s]
main=readLn>>=print.i
Runs:
$ echo 3 | ./Diagonals
[0,1,3,6,4,2,5,7,8]
$ echo 4 | ./Diagonals
[0,1,4,8,5,2,3,6,9,12,13,10,7,11,14,15]
The rectangular variant is a little longer, at 120 characters:
j(w,h)=concatMap(\q->d q++(reverse.d$q+1))[0,2..w+h]
where d r=[x+w*(r-x)|x<-[0..r],x<w&&(r-x)<h]
main=readLn>>=print.j
Input here requires a tuple:
$ echo '(4,3)' | ./Diagonals
[0,1,4,8,5,2,3,6,9,10,7,11]
$ echo '(3,4)' | ./Diagonals
[0,1,3,6,4,2,5,7,9,10,8,11]
Answers are all 0-based, and returned as lists (natural forms for Haskell).

Perl 102 characters
<>=~/ /;print map{$_%1e6," "}sort map{($x=($%=$_%$`)+($/=int$_/$`))*1e9+($x%2?$/:$%)*1e6+$_}0..$'*$`-1
usage :
echo 3 4 | perl zigzag.pl
0 1 3 6 4 2 5 7 9 10 8 11

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

CUDA shared memory efficiency at 50%? - cuda

Related

Timed out script

Basic CUDA load and warp transpose

why output tex3D data are different from initial data?

Puzzle: find number in a given row and column in 2-dimensional array where no number may occur twice

Code Golf: Zigzag pattern scanning

Categories

Resources