Multiway stable partition - cuda

Is there a way to perform multiway (>2) stable partition in Thrust?
Either stable partition or stable partition copy both are equally interesting. Currently I can only use two-way stable partition copy for purposes described above. It is clear how to use it to partition a sequence into a three parts using two predicates and two calls of thrust::stable_partition_copy. But I am sure it is technically possible to implement multiway stable partition.
I can imagine the following multiway stable partition copy (pseudocode):
using F = float;
thrust::device_vector< F > trianges{N * 3};
// fill triangles here
thrust::device_vector< F > A{N}, B{N}, C{N};
auto vertices_begin = thrust::make_tuple(A.begin(), B.begin(), C.begin());
using U = unsigned int;
auto selector = [] __host__ __device__ (U i) -> U { return i % 3; };
thrust::multiway_stable_partition_copy(p, triangles.cbegin(), triangles.cend(), selector, vertices_begin);
A.begin(), B.begin(), C.begin() should be incremented individually.
Also, I can imagine hypothetical dispatch iterator, which would do the same (and would be more useful I think).

From my knowledge of the thrust internals, there is no readily adaptable algorithm to do what you envisage.
A simple approach would be to extend your theoretical two pass three way partition to M-1 passes using a smart binary predicate, something like
template<typename T>
struct divider
{
int pass;
__host__ __device__ divider(int p) : pass(p) { };
__host__ __device__ int classify(const T &val) { .... };
__host__ __device__ bool operator()(const T &val) { return !(classify(val) > pass); };
}
which enumerates a given input into M possible subsets and returns true if the input is in the Nth or less subset, and then a loop
auto start = input.begin();
for(int i=0; i<(M-1); ++i) {
divider pred<T>(i);
result[i] = thrust::stable_partition(
thrust::device,
start,
input.end(),
pred);
start = result[i];
}
[ note all code written in a browser on a tablet while floating on a boat in the Baltic. Obviously never compiled or run. ]
This will certainly be the most space efficient, as a maximum of len(input) temporary storage is required, whereas a hypothetical single pass implementation would require M * len(input) storage, which would quickly get impractical for a large M.
Edit to add that now I'm back on land with a compiler, this seems to work as expected:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/copy.h>
#include <thrust/partition.h>
struct divider
{
int pass;
__host__ __device__
divider(int p) : pass(p) { };
__host__ __device__
int classify(const int &val) { return (val % 12); };
__host__ __device__
bool operator()(const int &val) { return !(classify(val) > pass); };
};
int main()
{
const int M = 12;
const int N = 120;
thrust::device_vector<int> input(N);
thrust::counting_iterator<int> iter(0);
thrust::copy(iter, iter+N, input.begin());
thrust::device_vector<int>::iterator result[M];
auto start = input.begin();
for(int i=0; i<(M-1); ++i) {
divider pred(i);
result[i] = thrust::stable_partition(
thrust::device,
start,
input.end(),
pred);
start = result[i];
}
int i = 0;
for(auto j=input.begin(); j!=input.end(); ++j) {
if (j == result[i]) {
i++;
std:: cout << std::endl;
}
std::cout << *j << " ";
}
return 0;
}
$ nvcc -std=c++11 -arch=sm_52 -o partition partition.cu
$ ./partition
0 12 24 36 48 60 72 84 96 108
1 13 25 37 49 61 73 85 97 109
2 14 26 38 50 62 74 86 98 110
3 15 27 39 51 63 75 87 99 111
4 16 28 40 52 64 76 88 100 112
5 17 29 41 53 65 77 89 101 113
6 18 30 42 54 66 78 90 102 114
7 19 31 43 55 67 79 91 103 115
8 20 32 44 56 68 80 92 104 116
9 21 33 45 57 69 81 93 105 117
10 22 34 46 58 70 82 94 106 118
11 23 35 47 59 71 83 95 107 119

Related

Using CUDA atomicInc to get unique indices

I have CUDA kernel where basically each thread holds a value, and it needs to add that value to one or more lists in shared memory. So for each of those lists, it needs to get an index value (unique for that list) to put the value.
The real code is different, but there are lists like:
typedef struct {
unsigned int numItems;
float items[MAX_NUM_ITEMS];
} List;
__shared__ List lists[NUM_LISTS];
The values numItems are initially all set to 0, and then a __syncthreads() is done.
To add its value to the lists, each thread does:
for(int list = 0; list < NUM_LISTS; ++list) {
if(should_add_to_list(threadIdx, list)) {
unsigned int index = atomicInc(&lists[list].numItems, 0xffffffff);
assert(index < MAX_NUM_ITEMS); // always true
lists[list].items[index] = my_value;
}
}
This works most of the time, but it seems that when making some unrelated changes in other parts of the kernel (such as not checking asserts that always succeed), sometimes two threads get the same index for one list, or indices are skipped.
The final value of numSamples always becomes correct, however.
However, when using the following custom implementation for atomicInc_ instead, it seems to work correctly:
__device__ static inline uint32_t atomicInc_(uint32_t* ptr) {
uint32_t value;
do {
value = *ptr;
} while(atomicCAS(ptr, value, value + 1) != value);
return value;
}
Are the two atomicInc functions equivalent, and is it valid to use atomicInc that way to get unique indices?
According the the CUDA programming guide, the atomic functions do not imply memory ordering constraints, and different threads can access the numSamples of different lists at the same time: could this cause it to fail?
Edit:
The real kernel looks like this:
Basically there is a list of spot blocks, containing spots. Each spot has XY coordinates (col, row). The kernel needs to find, for each spot, the spots that are in a certain window (col/row difference) around it, and put them into a list in shared memory.
The kernel is called with a fixed number of warps. A CUDA block corresponds to a group of spot blocks. (here 3) These are called the local spot blocks.
First it takes the spots from the block's 3 spot blocks, and copies them into shared memory (localSpots[]).
For this it uses one warp for each spot block, so that the spots can be read coalesced. Each thread in the warp is a spot in the local spot block.
The spot block indices are here hardcoded (blocks[]).
Then it goes through the surrounding spot blocks: These are all the spot blocks that may contain spots that are close enough to a spot in the local spot blocks. The surrounding spot block's indices are also hardcoded here (sblock[]).
In this example it only uses the first warp for this, and traverses sblocks[] iteratively. Each thread in the warp is a spot in the surrounding spot block.
It also iterates through the list of all the local spots. If the thread's spot is close enough to the local spot: It inserts it into the local spot's list, using atomicInc to get an index.
When executed, the printf shows that for a given local spot (here the one with row=37, col=977), indices are sometimes repeated or skipped.
The real code is more complex/optimized, but this code already has the problem. Here it also only runs one CUDA block.
#include <assert.h>
#include <stdio.h>
#define MAX_NUM_SPOTS_IN_WINDOW 80
__global__ void Kernel(
const uint16_t* blockNumSpotsBuffer,
XGPU_SpotProcessingBlockSpotDataBuffers blockSpotsBuffers,
size_t blockSpotsBuffersElementPitch,
int2 unused1,
int2 unused2,
int unused3 ) {
typedef unsigned int uint;
if(blockIdx.x!=30 || blockIdx.y!=1) return;
int window = 5;
ASSERT(blockDim.x % WARP_SIZE == 0);
ASSERT(blockDim.y == 1);
uint numWarps = blockDim.x / WARP_SIZE;
uint idxWarp = threadIdx.x / WARP_SIZE;
int idxThreadInWarp = threadIdx.x % WARP_SIZE;
struct Spot {
int16_t row;
int16_t col;
volatile unsigned int numSamples;
float signalSamples[MAX_NUM_SPOTS_IN_WINDOW];
};
__shared__ uint numLocalSpots;
__shared__ Spot localSpots[3 * 32];
numLocalSpots = 0;
__syncthreads();
ASSERT(numWarps >= 3);
int blocks[3] = {174, 222, 270};
if(idxWarp < 3) {
uint spotBlockIdx = blocks[idxWarp];
ASSERT(spotBlockIdx < numSpotBlocks.x * numSpotBlocks.y);
uint numSpots = blockNumSpotsBuffer[spotBlockIdx];
ASSERT(numSpots < WARP_SIZE);
size_t inOffset = (spotBlockIdx * blockSpotsBuffersElementPitch) + idxThreadInWarp;
uint outOffset;
if(idxThreadInWarp == 0) outOffset = atomicAdd(&numLocalSpots, numSpots);
outOffset = __shfl_sync(0xffffffff, outOffset, 0, 32);
if(idxThreadInWarp < numSpots) {
Spot* outSpot = &localSpots[outOffset + idxThreadInWarp];
outSpot->numSamples = 0;
uint32_t coord = blockSpotsBuffers.coord[inOffset];
UnpackCoordinates(coord, &outSpot->row, &outSpot->col);
}
}
__syncthreads();
int sblocks[] = { 29,30,31,77,78,79,125,126,127,173,174,175,221,222,223,269,270,271,317,318,319,365,366,367,413,414,415 };
if(idxWarp == 0) for(int block = 0; block < sizeof(sblocks)/sizeof(int); ++block) {
uint spotBlockIdx = sblocks[block];
ASSERT(spotBlockIdx < numSpotBlocks.x * numSpotBlocks.y);
uint numSpots = blockNumSpotsBuffer[spotBlockIdx];
uint idxThreadInWarp = threadIdx.x % WARP_SIZE;
if(idxThreadInWarp >= numSpots) continue;
size_t inOffset = (spotBlockIdx * blockSpotsBuffersElementPitch) + idxThreadInWarp;
uint32_t coord = blockSpotsBuffers.coord[inOffset];
if(coord == 0) return; // invalid surrounding spot
int16_t row, col;
UnpackCoordinates(coord, &row, &col);
for(int idxLocalSpot = 0; idxLocalSpot < numLocalSpots; ++idxLocalSpot) {
Spot* localSpot = &localSpots[idxLocalSpot];
if(localSpot->row == 0 && localSpot->col == 0) continue;
if((abs(localSpot->row - row) >= window) && (abs(localSpot->col - col) >= window)) continue;
int index = atomicInc_block((unsigned int*)&localSpot->numSamples, 0xffffffff);
if(localSpot->row == 37 && localSpot->col == 977) printf("%02d ", index); // <-- sometimes indices are skipped or duplicated
if(index >= MAX_NUM_SPOTS_IN_WINDOW) continue; // index out of bounds, discard value for median calculation
localSpot->signalSamples[index] = blockSpotsBuffers.signal[inOffset];
}
} }
Output looks like this:
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 23
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
00 01 02 02 03 03 04 05 06 07 08 09 10 11 12 06 13 14 15 16 17 18 19 20 21
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 23
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Each line is the output of one execution (the kernel is run multiple times). It is expected that indices appear in different orders. But for example on the third-last line, index 23 is repeated.
Using atomicCAS seems to fix it. Also using __syncwarp() between executions on the outer for-loop seems to fix it. But it is not clear why, and if that always fixes it.
Edit 2:
This is a full program (main.cu) that shows the problem:
https://pastebin.com/cDqYmjGb
The CMakeLists.txt:
https://pastebin.com/iB9mbUJw
Must be compiled with -DCMAKE_BUILD_TYPE=Release.
It produces this output:
00(0:00000221E40003E0)
01(2:00000221E40003E0)
02(7:00000221E40003E0)
03(1:00000221E40003E0)
03(2:00000221E40003E0)
04(3:00000221E40003E0)
04(1:00000221E40003E0)
05(4:00000221E40003E0)
06(6:00000221E40003E0)
07(2:00000221E40003E0)
08(3:00000221E40003E0)
09(6:00000221E40003E0)
10(3:00000221E40003E0)
11(5:00000221E40003E0)
12(0:00000221E40003E0)
13(1:00000221E40003E0)
14(3:00000221E40003E0)
15(1:00000221E40003E0)
16(0:00000221E40003E0)
17(3:00000221E40003E0)
18(0:00000221E40003E0)
19(2:00000221E40003E0)
20(4:00000221E40003E0)
21(4:00000221E40003E0)
22(1:00000221E40003E0)
For example the lines with 03 show that two threads (1 and 2), get the same result (3), after calling atomicInc_block on the same counter (at 0x00000221E40003E0).
According to my testing, this problem is fixed in CUDA 11.4.1 currently available here and driver 470.52.02. It may also be fixed in some earlier versions of CUDA 11.4 and 11.3, but the problem is present in CUDA 11.2.

thread work if previously thread finished work (cuda) in same block

hello I am a beginner in cuda programming.I use lock.lock () function to wait for previously thread finished work. this my code :
#include "book.h"
#include <cuda.h>
#include <conio.h>
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <math.h>
#include <fstream>
#include <string>
#include <curand.h>
#include <curand_kernel.h>
#include "lock.h"
#define pop 10
#define gen 10
#define pg pop*gen
using namespace std;
__global__ void hold(Lock lock,float* a )
{
__shared__ int cache[gen];
int tid=blockIdx.x * blockDim.x+threadIdx.x;
int cacheIndex = threadIdx.x;
if(tid<gen)
{
a[tid]=7;//this number example but in my chase this random number
}
else
{
//cache[cacheIndex]=a[tid];
int temp;
if(tid%gen==0)
{
a[tid]=tid+4;//this example number but in my chase this random number if tid==tid%gen
temp=a[tid];
tid+=blockIdx.x*gridDim.x;
}
else
{
__syncthreads();
a[tid]=temp+1;//this must a[tid]=a[tid-1]+1;
temp=a[tid];
tid+=blockIdx.x*gridDim.x;
}
cache[cacheIndex]=temp;
__syncthreads();
for (int i=0;i<gen;i++)
{
if(cacheIndex==i)
{
lock. lock();
cache[cacheIndex]=temp;
lock.unlock();
}
}
}
}
int main()
{
float time;
float* a=new float [pg];
float *dev_a;
HANDLE_ERROR( cudaMalloc( (void**)&dev_a,pg *sizeof(int) ) );
Lock lock;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
hold<<<pop,gen>>>(lock,dev_a);
HANDLE_ERROR( cudaMemcpy( a, dev_a,pg * sizeof(float),cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
for(int i=0;i<pop;i++)
{
for(int j=0;j<gen;j++)
{
cout<<a[(i*gen)+j]<<" ";
}
cout<<endl;
}
printf("hold: %3.1f ms \n", time);
HANDLE_ERROR(cudaFree(dev_a));
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );
system("pause");
return 0;
}
and this the result :
7 7 7 7 7 7 7 7 7 7
14 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0
34 0 0 0 0 0 0 0 0 0
44 0 0 0 0 0 0 0 0 0
54 0 0 0 0 0 0 0 0 0
64 0 0 0 0 0 0 0 0 0
74 0 0 0 0 0 0 0 0 0
84 0 0 0 0 0 0 0 0 0
94 0 0 0 0 0 0 0 0 0
my expected result :
7 7 7 7 7 7 7 7 7 7
14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 23 31 32 33
34 35 36 37 38 39 40 41 42 43
44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63
64 65 66 67 68 69 70 71 72 73
74 75 76 77 78 79 80 81 82 83
84 85 86 87 88 89 90 91 92 93
94 95 96 97 98 99 100 101 102 103
any one please help me to correct my code. thanks
If you want help, it would be useful to point out that some of your code (e.g. lock.h and book.h) come from the CUDA by examples book. This is not a standard part of CUDA, so if you don't indicate where it comes from, it may be confusing.
I see the following issues in your code:
You are using a __syncthreads() in a conditional block where not all threads will meet the __syncthreads() barrier:
if(tid%gen==0)
{
...
}
else
{
__syncthreads(); // illegal
}
The usage of __syncthreads() in this way is illegal because not all threads will be able to reach the __syncthreads() barrier:
__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.
You are using the temp local variable without initializing it first:
a[tid]=temp+1;//this must a[tid]=a[tid-1]+1;
note that temp is thread-local variable. It is not shared amongst threads. Therefore the above line of code (for threads in the else block) is using an unitialized value of temp.
The remainder of your kernel code:
cache[cacheIndex]=temp;
__syncthreads();
for (int i=0;i<gen;i++)
{
if(cacheIndex==i)
{
lock. lock();
cache[cacheIndex]=temp;
lock.unlock();
}
}
}
does nothing useful because it is updating shared memory locations (i.e. cache) which are never transferred back to the dev_a variable, i.e. global memory. Therefore none of this code could affect the results you print out.
It's difficult to follow what you are trying to accomplish in your code. However if you change this line (the uninitialized value):
int temp;
to this:
int temp=tid+3;
Your code will print out the data according to what you have shown.

Converting between SDP's sprop-parameter-sets and mkv's CodecPrivate

Is there some easy way to convert between h264 settings as stored in Matroska file:
+ CodecPrivate, length 36 (h.264 profile: Baseline #L2.0) hexdump
01 42 c0 14 ff e1 00 15 67 42 c0 14 da 05 07 e8
40 00 00 03 00 40 00 00 0c 03 c5 0a a8 01 00 04
68 ce 0f c8
and the same settings when streaming that matroska file using RTSP?:
a=fmtp:96 packetization-mode=1;profile-level-id=42C014;sprop-parameter-sets=Z0LAFNoFB+hAAAADAEAAAAwDxQqo,aM4PyA==
Base-64 strings decodes to this:
00000000 67 42 c0 14 da 05 07 e8 40 00 00 03 00 40 00 00 |gB......#....#..|
00000010 0c 03 c5 0a a8
00000000 68 ce 0f c8 |h...|
which partially matches the data in mkv's CodecPrivate.
Extracted conversion from raw to CodecPrivate from ffmpeg:
/*
* AVC helper functions for muxers
* Copyright (c) 2006 Baptiste Coudurier <baptiste.coudurier#smartjog.com>
* Modified by _Vi: stand-alone version (without ffmpeg)
*
* This file is based on the code from FFmpeg.
*
* FFmpeg is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public
* License as published by the Free Software Foundation; either
* version 2.1 of the License, or (at your option) any later version.
*
* FFmpeg is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public
* License along with FFmpeg; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include <string.h>
#include <stdio.h>
#define assert(x) if(!(x)) { fprintf(stderr, "Assertion failed...\n"); return -1; }
#ifndef AV_RB24
# define AV_RB24(x) \
((((const uint8_t*)(x))[0] << 16) | \
(((const uint8_t*)(x))[1] << 8) | \
((const uint8_t*)(x))[2])
#endif
#ifndef AV_RB32
# define AV_RB32(x) \
(((uint32_t)((const uint8_t*)(x))[0] << 24) | \
(((const uint8_t*)(x))[1] << 16) | \
(((const uint8_t*)(x))[2] << 8) | \
((const uint8_t*)(x))[3])
#endif
#define avio_w8(pb, x) *(*pb)++ = x;
#define avio_wb16(pb, x) *(*pb)++ = ((x)>>8); *(*pb)++ = x&0xFF;
#define avio_wb32(pb, x) *(*pb)++ = ((x)>>24); \
*(*pb)++ = ((x)>>16)&0xFF; \
*(*pb)++ = ((x)>>8)&0xFF; \
*(*pb)++ = ((x)>>0)&0xFF;
#define avio_write(pb, b, l) memcpy((*pb), b, l); (*pb)+=(l);
typedef unsigned char uint8_t;
typedef int intptr_t;
typedef unsigned long uint32_t;
static const uint8_t *ff_avc_find_startcode_internal(const uint8_t *p, const uint8_t *end)
{
const uint8_t *a = p + 4 - ((intptr_t)p & 3);
for (end -= 3; p < a && p < end; p++) {
if (p[0] == 0 && p[1] == 0 && p[2] == 1)
return p;
}
for (end -= 3; p < end; p += 4) {
uint32_t x = *(const uint32_t*)p;
// if ((x - 0x01000100) & (~x) & 0x80008000) // little endian
// if ((x - 0x00010001) & (~x) & 0x00800080) // big endian
if ((x - 0x01010101) & (~x) & 0x80808080) { // generic
if (p[1] == 0) {
if (p[0] == 0 && p[2] == 1)
return p;
if (p[2] == 0 && p[3] == 1)
return p+1;
}
if (p[3] == 0) {
if (p[2] == 0 && p[4] == 1)
return p+2;
if (p[4] == 0 && p[5] == 1)
return p+3;
}
}
}
for (end += 3; p < end; p++) {
if (p[0] == 0 && p[1] == 0 && p[2] == 1)
return p;
}
return end + 3;
}
const uint8_t *ff_avc_find_startcode(const uint8_t *p, const uint8_t *end){
const uint8_t *out= ff_avc_find_startcode_internal(p, end);
if(p<out && out<end && !out[-1]) out--;
return out;
}
int ff_avc_parse_nal_units(unsigned char **pb, const uint8_t *buf_in, int size)
{
const uint8_t *p = buf_in;
const uint8_t *end = p + size;
const uint8_t *nal_start, *nal_end;
size = 0;
nal_start = ff_avc_find_startcode(p, end);
while (nal_start < end) {
while(!*(nal_start++));
nal_end = ff_avc_find_startcode(nal_start, end);
avio_wb32(pb, nal_end - nal_start);
avio_write(pb, nal_start, nal_end - nal_start);
size += 4 + nal_end - nal_start;
nal_start = nal_end;
}
return size;
}
int ff_avc_parse_nal_units_buf(const unsigned char *buf_in, unsigned char **buf, int *size)
{
unsigned char *pbptr = *buf;
ff_avc_parse_nal_units(&pbptr, buf_in, *size);
*size = pbptr - *buf;
return 0;
}
int my_isom_write_avcc(unsigned char **pb, const uint8_t *data, int len)
{
unsigned char tmpbuf[4000];
if (len > 6) {
/* check for h264 start code */
if (AV_RB32(data) == 0x00000001 ||
AV_RB24(data) == 0x000001) {
uint8_t *buf=tmpbuf, *end, *start;
uint32_t sps_size=0, pps_size=0;
uint8_t *sps=0, *pps=0;
int ret = ff_avc_parse_nal_units_buf(data, &buf, &len);
if (ret < 0)
return ret;
start = buf;
end = buf + len;
/* look for sps and pps */
while (buf < end) {
unsigned int size;
uint8_t nal_type;
size = AV_RB32(buf);
nal_type = buf[4] & 0x1f;
if (nal_type == 7) { /* SPS */
sps = buf + 4;
sps_size = size;
} else if (nal_type == 8) { /* PPS */
pps = buf + 4;
pps_size = size;
}
buf += size + 4;
}
assert(sps);
assert(pps);
avio_w8(pb, 1); /* version */
avio_w8(pb, sps[1]); /* profile */
avio_w8(pb, sps[2]); /* profile compat */
avio_w8(pb, sps[3]); /* level */
avio_w8(pb, 0xff); /* 6 bits reserved (111111) + 2 bits nal size length - 1 (11) */
avio_w8(pb, 0xe1); /* 3 bits reserved (111) + 5 bits number of sps (00001) */
avio_wb16(pb, sps_size);
avio_write(pb, sps, sps_size);
avio_w8(pb, 1); /* number of pps */
avio_wb16(pb, pps_size);
avio_write(pb, pps, pps_size);
} else {
avio_write(pb, data, len);
}
}
return 0;
}
#define H264PRIVATE_MAIN
#ifdef H264PRIVATE_MAIN
int main() {
unsigned char data[1000];
int len = fread(data, 1, 1000, stdin);
unsigned char output[1000];
unsigned char *output_f = output;
my_isom_write_avcc(&output_f, data, len);
fwrite(output, 1, output_f - output, stdout);
return 0;
}
#endif
Inserting "00 00 00 01" before each base-64-decoded block and feeding it into that program outputs CodecPrivate:
$ printf '\x00\x00\x00\x01'\
'\x67\x42\xc0\x14\xda\x05\x07\xe8\x40\x00\x00\x03\x00\x40\x00\x00\x0c\x03\xc5\x0a\xa8'\
'\x00\x00\x00\x01'\
'\x68\xce\x0f\xc8' | ./avc_to_mkvcodecpriv | hd
00000000 01 42 c0 14 ff e1 00 15 67 42 c0 14 da 05 07 e8 |.B......gB......|
00000010 40 00 00 03 00 40 00 00 0c 03 c5 0a a8 01 00 04 |#....#..........|
00000020 68 ce 0f c8 |h...|
00000024

Expand and increment data by map count

I'm quite new to thrust (cuda), and am finding something challenging.
(Edited question to be simplified) I have an input vector and a map:
vector = [8,23,46,500,2,7,91,91]
map = [1, 0, 4, 3,1,0, 5, 3]
I want to expand this and increment the values to become:
new_vec = [8,46,47,48,49,500,501,502,2,91,92,93,94,95,91,92,93]
I realise the thrust/examples/expand.cu example already mostly does this, but I don't know how to efficiently increment the data value by the map count.
It would be helpful if someone could explain how to modify this example to achieve this.
Adapt the Thrust expand example to use exclusive_scan_by_key to rank each output element within its subsequence and then increment by that rank:
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/gather.h>
#include <thrust/scan.h>
#include <thrust/fill.h>
#include <thrust/copy.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/functional.h>
#include <iterator>
#include <iostream>
template<typename Vector>
void print(const std::string& s, const Vector& v)
{
typedef typename Vector::value_type T;
std::cout << s;
thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, " "));
std::cout << std::endl;
}
template<typename InputIterator1,
typename InputIterator2,
typename OutputIterator>
void expand_and_increment(InputIterator1 first1,
InputIterator1 last1,
InputIterator2 first2,
OutputIterator output)
{
typedef typename thrust::iterator_difference<InputIterator1>::type difference_type;
difference_type input_size = thrust::distance(first1, last1);
difference_type output_size = thrust::reduce(first1, last1);
// scan the counts to obtain output offsets for each input element
thrust::device_vector<difference_type> output_offsets(input_size);
thrust::exclusive_scan(first1, last1, output_offsets.begin());
print("output_offsets ", output_offsets);
// scatter the nonzero counts into their corresponding output positions
thrust::device_vector<difference_type> output_indices(output_size);
thrust::scatter_if
(thrust::counting_iterator<difference_type>(0),
thrust::counting_iterator<difference_type>(input_size),
output_offsets.begin(),
first1,
output_indices.begin());
// compute max-scan over the output indices, filling in the holes
thrust::inclusive_scan
(output_indices.begin(),
output_indices.end(),
output_indices.begin(),
thrust::maximum<difference_type>());
print("output_indices ", output_indices);
// gather input values according to index array (output = first2[output_indices])
OutputIterator output_end = output; thrust::advance(output_end, output_size);
thrust::gather(output_indices.begin(),
output_indices.end(),
first2,
output);
// rank output_indices
thrust::device_vector<difference_type> ranks(output_size);
thrust::exclusive_scan_by_key(output_indices.begin(), output_indices.end(),
thrust::make_constant_iterator<difference_type>(1),
ranks.begin());
print("ranks ", ranks);
// increment output by ranks
thrust::transform(output, output + output_size, ranks.begin(), output, thrust::placeholders::_1 + thrust::placeholders::_2);
}
int main(void)
{
int values[] = {8,23,46,500,2,7,91,91};
int counts[] = {1, 0, 4, 3,1,0, 5, 3};
size_t input_size = sizeof(counts) / sizeof(int);
size_t output_size = thrust::reduce(counts, counts + input_size);
// copy inputs to device
thrust::device_vector<int> d_counts(counts, counts + input_size);
thrust::device_vector<int> d_values(values, values + input_size);
thrust::device_vector<int> d_output(output_size);
// expand values according to counts
expand_and_increment(d_counts.begin(), d_counts.end(),
d_values.begin(),
d_output.begin());
std::cout << "Expanding and incrementing values according to counts" << std::endl;
print(" counts ", d_counts);
print(" values ", d_values);
print(" output ", d_output);
return 0;
}
The output:
$ nvcc expand_and_increment.cu -run
output_offsets 0 1 1 5 8 9 9 14
output_indices 0 2 2 2 2 3 3 3 4 6 6 6 6 6 7 7 7
ranks 0 0 1 2 3 0 1 2 0 0 1 2 3 4 0 1 2
Expanding and incrementing values according to counts
counts 1 0 4 3 1 0 5 3
values 8 23 46 500 2 7 91 91
output 8 46 47 48 49 500 501 502 2 91 92 93 94 95 91 92 93

cudaMemset() - does it set bytes or integers?

From online documentation:
cudaError_t cudaMemset (void * devPtr, int value, size_t count )
Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.
Parameters:
devPtr - Pointer to device memory
value - Value to set for each byte of specified memory
count - Size in bytes to set
This description doesn't appear to be correct as:
int *dJunk;
cudaMalloc((void**)&dJunk, 32*(sizeof(int));
cudaMemset(dJunk, 0x12, 32);
will set all 32 integers to 0x12, not 0x12121212. (Int vs Byte)
The description talks about setting bytes. Count and Value are described in terms of bytes. Notice count is of type size_t, and value is of type int. i.e. Set a byte-size to an int-value.
cudaMemset() is not mentioned in the prog guide.
I have to assume the behavior I am seeing is correct, and the documentation is bad.
Is there a better documentation source out there? (Where?)
Are other types supported? i.e. Would float *dJunk; work? Others?
The documentation is correct, and your interpretation of what cudaMemset does is wrong. The function really does set byte values. Your example sets the first 32 bytes to 0x12, not all 32 integers to 0x12, viz:
#include <cstdio>
int main(void)
{
const int n = 32;
const size_t sz = size_t(n) * sizeof(int);
int *dJunk;
cudaMalloc((void**)&dJunk, sz);
cudaMemset(dJunk, 0, sz);
cudaMemset(dJunk, 0x12, 32);
int *Junk = new int[n];
cudaMemcpy(Junk, dJunk, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<n; i++) {
fprintf(stdout, "%d %x\n", i, Junk[i]);
}
cudaDeviceReset();
return 0;
}
produces
$ nvcc memset.cu
$ ./a.out
0 12121212
1 12121212
2 12121212
3 12121212
4 12121212
5 12121212
6 12121212
7 12121212
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
ie. all 128 bytes set to 0, then first 32 bytes set to 0x12. Exactly as described by the documentation.