NPP BoxFilters and binary data - cuda

I'm trying to create NPP example for BoxFiltering but insted of 8-bit greyscale image I have RGBA binary data. My code looks like:
#include "./common/ImagesCPU.h"
#include "./common/ImagesNPP.h"
#include "./common/ImageIO.h"
#include "./common/Exceptions.h"
#include <npp.h>
void boxfilter_transform( Npp8u *oHostSrc, int width, int height ){
size_t size = width * height * 4;
// declare a device image and copy construct from the host image,
// i.e. upload host to device
npp::ImageNPP_8u_C4 oDeviceSrc(oHostSrc);
// create struct with box-filter mask size
NppiSize oMaskSize = {5, 5};
// create struct with ROI size given the current mask
NppiSize oSizeROI = {oDeviceSrc.width() - oMaskSize.width + 1, oDeviceSrc.height() - oMaskSize.height + 1};
// allocate device image of appropriatedly reduced size
npp::ImageNPP_8u_C4 oDeviceDst(oSizeROI.width, oSizeROI.height);
// set anchor point inside the mask to (0, 0)
NppiPoint oAnchor = {0, 0};
// run box filter
NppStatus eStatusNPP;
eStatusNPP = nppiFilterBox_8u_C4R(oDeviceSrc.data(), oDeviceSrc.pitch(),
oDeviceDst.data(), oDeviceDst.pitch(),
oSizeROI, oMaskSize, oAnchor);
NPP_ASSERT(NPP_NO_ERROR == eStatusNPP);
// declare a host image for the result
npp::ImageCPU_8u_C4 oHostDst(oDeviceDst.size());
// and copy the device result data into it
oDeviceDst.copyTo(oHostDst.data(), oHostDst.pitch());
return 0;
}
I try to compile it and get:
npp_filters.cpp: In function ‘void boxfilter_transform(Npp8u*, int, int)’:
npp_filters.cpp:18:44: error: no matching function for call to ‘npp::ImageNPP<unsigned char, 4u>::ImageNPP(Npp8u*&)’
npp_filters.cpp:18:44: note: candidates are:
./common/ImagesNPP.h:52:13: note: template<class X> npp::ImageNPP::ImageNPP(const npp::ImageCPU<D, N, X>&, bool)
./common/ImagesNPP.h:45:13: note: npp::ImageNPP<D, N>::ImageNPP(const npp::ImageNPP<D, N>&) [with D = unsigned char, unsigned int N = 4u]
./common/ImagesNPP.h:45:13: note: no known conversion for argument 1 from ‘Npp8u* {aka unsigned char*}’ to ‘const npp::ImageNPP<unsigned char, 4u>&’
./common/ImagesNPP.h:40:13: note: npp::ImageNPP<D, N>::ImageNPP(const npp::Image::Size&) [with D = unsigned char, unsigned int N = 4u]
./common/ImagesNPP.h:40:13: note: no known conversion for argument 1 from ‘Npp8u* {aka unsigned char*}’ to ‘const npp::Image::Size&’
./common/ImagesNPP.h:35:13: note: npp::ImageNPP<D, N>::ImageNPP(unsigned int, unsigned int, bool) [with D = unsigned char, unsigned int N = 4u]
./common/ImagesNPP.h:35:13: note: candidate expects 3 arguments, 1 provided
./common/ImagesNPP.h:30:13: note: npp::ImageNPP<D, N>::ImageNPP() [with D = unsigned char, unsigned int N = 4u]
./common/ImagesNPP.h:30:13: note: candidate expects 0 arguments, 1 provided
npp_filters.cpp:39:12: error: return-statement with a value, in function returning 'void' [-fpermissive]
What is wrong? Can you give me the right way?
Now my piece of code looks like:
// declare a host image object for an 8-bit RGBA image
npp::ImageCPU_8u_C4 oHostSrc(width, height);
// copy data to oHostSrc.data()
Npp8u *nDstData = oHostSrc.data();
memcpy(nDstData, data, size * sizeof(Npp8u));
// declare a device image and copy construct from the host image,
// i.e. upload host to device
npp::ImageNPP_8u_C4 oDeviceSrc(oHostSrc);
but now I can't declare a device image and copy (last line), get such error: undefined symbol: nppiMalloc_8u_C4. What it can be?

There's no constructor in npp::ImageNPP_8u_C4 for a single host pointer, you need first to wrap the host array in a npp::ImageCPU_8u_C4 like it is done for example in the ImageIO.h you included.
The point is, that your NPP-image needs to get some dimension information, which you are currently missing for oDeviceSrc.

Related

Converting uint32_t to binary in C

The main problem I'm having is to read out values in binary in C. Python and C# had some really quick/easy functions to do this, I found topic about how to do it in C++, I found topic about how to convert int to binary in C, but not how to convert uint32_t to binary in C.
What I am trying to do is to read bit by bit the 32 bits of the DR_REG_RNG_BASE address of an ESP32 (this is the address where the random values of the Random Hardware Generator of the ESP are stored).
So for the moment I was doing that:
#define DR_REG_RNG_BASE 0x3ff75144
void printBitByBit( ){
// READ_PERI_REG is the ESP32 function to read DR_REG_RNG_BASE
uint32_t rndval = READ_PERI_REG(DR_REG_RNG_BASE);
int i;
for (i = 1; i <= 32; i++){
int mask = 1 << i;
int masked_n = rndval & mask;
int thebit = masked_n >> i;
Serial.printf("%i", thebit);
}
Serial.println("\n");
}
At first I thought it was working well. But in fact it takes me out of binary representations that are totally false. Any ideas?
Your shown code has a number of errors/issues.
First, bit positions for a uint32_t (32-bit unsigned integer) are zero-based – so, they run from 0 thru 31, not from 1 thru 32, as your code assumes. Thus, in your code, you are (effectively) ignoring the lowest bit (bit #0); further, when you do the 1 << i on the last loop (when i == 32), your mask will (most likely) have a value of zero (although that shift is, technically, undefined behaviour for a signed integer, as your code uses), so you'll also drop the highest bit.
Second, your code prints (from left-to-right) the lowest bit first, but you want (presumably) to print the highest bit first, as is normal. So, you should run the loop with the i index starting at 31 and decrement it to zero.
Also, your code mixes and mingles unsigned and signed integer types. This sort of thing is best avoided – so it's better to use uint32_t for the intermediate values used in the loop.
Lastly (as mentioned by Eric in the comments), there is a far simpler way to extract "bit n" from an unsigned integer: just use value >> n & 1.
I don't have access to an Arduino platform but, to demonstrate the points made in the above discussion, here is a standard, console-mode C++ program that compares the output of your code to versions with the aforementioned corrections applied:
#include <iostream>
#include <cstdint>
#include <inttypes.h>
int main()
{
uint32_t test = 0x84FF0048uL;
int i;
// Your code ...
for (i = 1; i <= 32; i++) {
int mask = 1 << i;
int masked_n = test & mask;
int thebit = masked_n >> i;
printf("%i", thebit);
}
printf("\n");
// Corrected limits/order/types ...
for (i = 31; i >= 0; --i) {
uint32_t mask = (uint32_t)(1) << i;
uint32_t masked_n = test & mask;
uint32_t thebit = masked_n >> i;
printf("%"PRIu32, thebit);
}
printf("\n");
// Better ...
for (i = 31; i >= 0; --i) {
printf("%"PRIu32, test >> i & 1);
}
printf("\n");
return 0;
}
The three lines of output (first one wrong, as you know; last two correct) are:
001001000000000111111110010000-10
10000100111111110000000001001000
10000100111111110000000001001000
Notes:
(1) On the use of the funny-looking "%"PRu32 format specifier for printing the uint32_t types, see: printf format specifiers for uint32_t and size_t.
(2) The cast on the (uint32_t)(1) constant will ensure that the bit-shift is safe, even when int and unsigned are 16-bit types; without that, you would get undefined behaviour in such a case.
When you printing out a binary string representation of a number, you print the Most Signification Bit (MSB) first, whether the number is a uint32_t or uint16_t, so you will need to have a mask for detecting whether the MSB is a 1 or 0, so you need a mask of 0x80000000, and shift-down on each iteration.
#define DR_REG_RNG_BASE 0x3ff75144
void printBitByBit( ){
// READ_PERI_REG is the ESP32 function to read DR_REG_RNG_BASE
uint32_t rndval = READ_PERI_REG(DR_REG_RNG_BASE);
Serial.println(rndval, HEX); //print out the value in hex for verification purpose
uint32_t mask = 0x80000000;
for (int i=1; i<32; i++) {
Serial.println((rndval & mask) ? "1" : "0");
mask = (uint32_t) mask >> 1;
}
Serial.println("\n");
}
For Arduino, there are actually a couple of built-in functions that can print out the binary string representation of a number. Serial.print(x, BIN) allows you to specify the number base on the 2nd function argument.
Another function that can achieve the same result is itoa(x, str, base) which is not part of standard ANSI C or C++, but available in Arduino to allow you to convert the number x to a str with number base specified.
char str[33];
itoa(rndval, str, 2);
Serial.println(str);
However, both functions does not pad with leading zero, see the result here:
36E68B6D // rndval in HEX
00110110111001101000101101101101 // print by our function
110110111001101000101101101101 // print by Serial.print(rndval, BIN)
110110111001101000101101101101 // print by itoa(rndval, str, 2)
BTW, Arduino is c++, so don't use c tag for your post. I changed it for you.

Cuda, two streams created by a NPP function

I'm working on an image processing project with Cuda 7.5 and a GeForce GTX 650 Ti. I decided to use 2 stream, one where I apply the algorithms responsible to enhance the image and another stream where I apply an independent algorithm from the rest of the processing.
I wrote an example to show my problem. In this example I created a stream and then I used nppSetStream.
I invoked the function nppiThreshold_LTValGTVal_32f_C1R but 2 stream are used when the function is executed.
Here there's a code example:
#include <npp.h>
#include <cuda_runtime.h>
#include <cuda_profiler_api.h>
int main(void) {
int srcWidth = 1344;
int srcHeight = 1344;
int paddStride = 0;
float* srcArrayDevice;
float* srcArrayDevice2;
unsigned char* dstArrayDevice;
int status = cudaMalloc((void**)&srcArrayDevice, srcWidth * srcHeight * 4);
status = cudaMalloc((void**)&srcArrayDevice2, srcWidth * srcHeight * 4);
status = cudaMalloc((void**)&dstArrayDevice, srcWidth * srcHeight );
cudaStream_t testStream;
cudaStreamCreateWithFlags(&testStream, cudaStreamNonBlocking);
nppSetStream(testStream);
NppiSize roiSize = { srcWidth,srcHeight };
//status = cudaMemcpyAsync(srcArrayDevice, &srcArrayHost, srcWidth*srcHeight*4, cudaMemcpyHostToDevice, testStream);
int yRect = 100;
int xRect = 60;
float thrL = 50;
float thrH = 1500;
NppiSize sz = { 200, 400 };
for (int i = 0; i < 10; i++) {
int status3 = nppiThreshold_LTValGTVal_32f_C1R(srcArrayDevice + (srcWidth*yRect + xRect)
, srcWidth * 4
, srcArrayDevice2 + (srcWidth*yRect + xRect)
, srcWidth * 4
, sz
, thrL
, thrL
, thrH
, thrH);
}
int length = (srcWidth + paddStride)*srcHeight;
int status6 = nppiScale_32f8u_C1R(srcArrayDevice, srcWidth * 4, dstArrayDevice + paddStride, srcWidth + paddStride, roiSize, 0, 65535);
//int status7 = cudaMemcpyAsync(dstPinPtr, dstTest, length, cudaMemcpyDeviceToHost, testStream);
cudaFree(srcArrayDevice);
cudaFree(srcArrayDevice2);
cudaFree(dstArrayDevice);
cudaStreamDestroy(testStream);
cudaProfilerStop();
return 0;
}
This what I got from the Nvidia Visual Profiler: image_width1344
Why are there two streams if I set only one stream? This causes errors in my original project so I'm thinking to switch to a single stream.
I noticed that this behaviour is dependent from the size of the image, if srcWidth and srcHeight are set to 1500 the result is this:image_width1500.
Why changing the size of the image produces another stream?
Why are there two streams if I setted [sic] only one stream?
It appears that nppiThreshold_LTValGTVal_32f_C1R creates its own internal stream for executing one of the kernels it uses. The other is launched either into the default stream, or the stream you specified with nppSetStream.
I think this is really a documentation oversight/user expectation problem. nppSetStream is doing what it says, but nowhere is it stated that the library is limited to using one stream. It probably should be more explicit in the documentation about how many streams the library uses internally, and how nppSetStream interacts with the library. If this is a problem for your application, I suggest you raise a bug report with NVIDIA.
Why changing the size of the image produces another stream?
My guess would be that there are some performance heuristics at work, and whether the second stream is used depends in image size. The library is closed source, however, so I can't say for sure.

Does the leading dimension in cuBLAS allow for accessing any submatrix?

I'm trying to understand the idea of the leading dimension in cuBLAS. It's mentioned that lda must always be greater than or equal to the # of rows in a matrix.
If I have a 100x100 matrix A and I wanted to access A(90:99, 0:99), what would be the arguments of cublasSetMatrix? lda specifies the number of rows between the elements in the same column(100 in this case), but where would I specify the 90? I can only see a way by adjusting *A.
The function definition is:
cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)
And I'm also guessing that I wouldn't be able to transfer the bottom right 3x3 portion of a 5x5 matrix given the length limits.
You have to "adjust *A", as you called it. The pointer that is given to this function must be the starting entry of the respective sub-matrix.
You did not say whether your matrix A is actually the input- or the output matrix, but this should not change much, conceptually.
Assuming you have the following code:
// The matrix in host memory
int rowsA = 100;
int colsA = 100;
float *A = new float[rowsA*colsA];
// Fill A with values
...
// The sub-matrix that should be copied to the device.
// The minimum index is INCLUSIVE
// The maximum index is EXCLUSIVE
int minRowA = 0;
int maxRowA = 100;
int minColA = 90;
int maxColA = 100;
int rowsB = maxRowA-minRowA;
int colsB = maxColA-minColA;
// Allocate the device matrix
float *dB = nullptr;
cudaMalloc(&dB, rowsB * colsB * sizeof(float));
Then, for the cublasSetMatrix call, you have to compute the starting element of the source matrix:
float *sourceA = A + (minRowA + minColA * rowsA);
cublasSetMatrix(rowsB, colsB, sizeof(float), sourceA, rowsA, dB, rowsB);
And this is where the 90 that you asked for comes into play: It is the minColA in the computation of the source pointer.

proper thrust call for subtraction

Following from here.
Assuming that dev_X is a vector.
int * X = (int*) malloc( ThreadsPerBlockX * BlocksPerGridX * sizeof(*X) );
for ( int i = 0; i < ThreadsPerBlockX * BlocksPerGridX; i++ )
X[ i ] = i;
// create device vectors
thrust::device_vector<int> dev_X ( ThreadsPerBlockX * BlocksPerGridX );
//copy to device
thrust::copy( X , X + theThreadsPerBlockX * theBlocksPerGridX , dev_X.begin() );
The following is making a subtraction:
thrust::transform( dev_Kx.begin(), dev_Kx.end(), dev_X.begin() , distX.begin() , thrust::minus<float>() );
dev_Kx - dev_X.
I want to use the whole dev_Kx vector ( as it is used because it goes from .begin to .end() ) and the whole dev_X vector.
The above code uses dev_X.begin().
Is that meaning that it will use the whole dev_X vector? Starting from the beginning?
Or I have to use another extra argument to point to the dev_X.end()? ( because in the above function call I can't just use this extra argument )
Also , for example:
If I want to use
thrust::transform( dev_Kx, dev_Kx + i , dev_X.begin() ,distX.begin() , thrust::minus<int>() );
Then dev_Kx would go from 0 to i and the dev_X.begin()? It will use the same length? (0 to i?) Or it will use the length of dev_X?
Many thrust (and standard library) functions take a range as a first parameter and then assume all other iterators are backed by containers of the same size. A range is a pair of iterators indicating the beginning and end of a sequence.
For example:
thrust::copy(
X.begin(), // begin input iterator
X.end(), // end input iterator
dev_X.begin() // begin output iterator
);
This copies the entire contents of X into dev_X. Why is dev_X.end() not needed? Because thrust requires that you, the programmer, take the care of properly sizing dev_X to be able to contain at least as many elements as there are in the input range. If you don't meet that guarantee, then the behavior is undefined.
When you do this:
thrust::transform(
dev_Kx.begin(), // begin input (1) iterator
dev_Kx.end(), // end input (1) iterator
dev_X.begin(), // begin input (2) iterator
distX.begin(), // output iterator
thrust::minus<float>()
);
What thrust sees is an input range from dev_Kx.begin() to dev_Kx.end(). It has an explicit size of dev_Kx.end() - dev_Kx.begin(). Why are dev_X.end() and distX.end() not needed? Because they have an implicit size of dev_Kx.end() - dev_Kx.begin() too. For example, if there are 10 elements in dev_Kx, then transform will:
Use the 10 elements of dev_Kx
Use 10 elements of dev_X (which must hold at least 10 elements)
Perform the substraction and store the 10 results in distX, which must be able to hold at least 10 elements.
Maybe looking at the implementation would clear up any doubts. Here's some pseudo code:
void transform(InputIterator input1_begin, InputIterator input1_end,
InputIterator input2_begin, OutputIterator output,
BinaryFunction op) {
while (input1_begin != input1_end) {
*output++ = op(*input1_begin++, *input2_begin++);
}
}
Notice how only one end iterator is needed.
On an unrelated note, the following:
int * X = (int*) malloc( ThreadsPerBlockX * BlocksPerGridX * sizeof(*X) );
for ( int i = 0; i < ThreadsPerBlockX * BlocksPerGridX; i++ )
X[ i ] = i;
Could be rewritten in more idiomatic, less error-prone C++ to:
std::vector<int> X(ThreadsPerBlockX * BlocksPerGridX);
std::iota(X.begin(), X.end(), 0);

thrust::min_element doesn't work on a float4 device_vector, while it does on a host_vector

I'm trying to find the minimum number in a array using Thrust and CUDA.
The following device example returns with 0 :
thrust::device_vector<float4>::iterator it = thrust::min_element(IntsOnDev.begin(),IntsOnDev.end(),equalOperator());
int pos = it - IntsOnDev.begin();
However, this host version works perfectly:
thrust::host_vector<float4>arr = IntsOnDev;
thrust::host_vector<float4>::iterator it2 = thrust::min_element(arr.begin(),arr.end(),equalOperator());
int pos2 = it2 - arr.begin();
the comperator type :
struct equalOperator
{
__host__ __device__
bool operator()(const float4 x,const float4 y) const
{
return ( x.w < y.w );
}
};
I just wanted to add that thrust::sort works with the same predicate.
Unfortunately, nvcc disagrees with some host compilers (some 64 bit versions of MSVC, if I recall correctly) about the size of certain aligned types. float4 is one of these. This often results in undefined behavior.
The work-around is to use types without alignment, for example my_float4:
struct my_float4
{
float x, y, z, w;
};