Semantics of __ddiv_ru - cuda

From the documentation of __ddiv_ru I expect that the following code result is ceil(8/32) = 1.0, instead I obtain 0.25.
#include <iostream>
using namespace std;
__managed__ double x;
__managed__ double y;
__managed__ double r;
__global__ void ceilDiv()
{
r = __ddiv_ru(x,y);
}
int main()
{
x = 8;
y = 32;
r = -1;
ceilDiv<<<1,1>>>();
cudaDeviceSynchronize();
cout << "The ceil of " << x << "/" << y << " is " << r << endl;
return 1;
}
What am I missing?

The result you are obtaining is correct.
The intrinsic you are using implements double precision division with a specific IEEE 754-2008 rounding mode for the unit in the last place (ULP) of the significand. This controls what happens when a result cannot be exactly represented in the selected format. In this case you have selected round up, which means the last digit of the significand produced in the division result is rounded up (toward +∞). In your case all rounding modes should produce the same result because the result can be exactly represented in IEEE 754 binary64 format (it is a round power of 2).
Please read everything here before writing any more floating point code.

Related

How to dynamically set the size of device_vectors in thrust set operations?

I have two sets A & B. The result(C) of my operation should have elements in A which are not there in B. I use set_difference to do it. However the size of result(C) has to be set before the operation. Else it has extra zeros at the end, like below:
A=
1 2 3 4 5 6 7 8 9 10
B=
1 2 8 11 7 4
C=
3 5 6 9 10 0 0 0 0 0
How to set the size of result(C) dynamically so that output is C= 3 5 6 9. In a real problem, I would not know the required size of result device_vector apriori.
My code:
#include <thrust/execution_policy.h>
#include <thrust/set_operations.h>
#include <thrust/sequence.h>
#include <thrust/execution_policy.h>
#include <thrust/device_vector.h>
void remove_common_elements(thrust::device_vector<int> A, thrust::device_vector<int> B, thrust::device_vector<int>& C)
{
thrust::sort(thrust::device, A.begin(), A.end());
thrust::sort(thrust::device, B.begin(), B.end());
thrust::set_difference(thrust::device, A.begin(), A.end(), B.begin(), B.end(), C.begin());
}
int main(int argc, char * argv[])
{
thrust::device_vector<int> A(10);
thrust::sequence(thrust::device, A.begin(), A.end(),1); // x components of the 'A' vectors
thrust::device_vector<int> B(6);
B[0]=1;B[1]=2;B[2]=8;B[3]=11;B[4]=7;B[5]=4;
thrust::device_vector<int> C(A.size());
std::cout << "A="<< std::endl;
thrust::copy(A.begin(), A.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
std::cout << "B="<< std::endl;
thrust::copy(B.begin(), B.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
remove_common_elements(A, B, C);
std::cout << "C="<< std::endl;
thrust::copy(C.begin(), C.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
return 0;
}
In the general case (i.e. across various thrust algorithms) there is often no way to know the output size, except what the upper bound would be. The usual approach here would be to pass a result vector whose size is the upper bound of the possible output size. As you stated already, in many cases the actual size of the output cannot be known a-priori. Thrust has no particular magic to solve this. After the operation, you will know the size of the result, and it could be copied to a new vector if the "extra zeroes" were a problem for some reason (I can't think of a reason why they would be a problem generally, except that they use up allocated space).
If this is highly objectionable, one possibility (copying this information from a response by Jared Hoberock in another forum) is to run the algorithm twice, the first time using a discard_iterator (for the output data) and the second time with a real iterator, pointing to an actual vector allocation, of the requisite size. During the first pass, the discard_iterator is used to count the size of the actual result data, even though it is not stored anywhere. Quoting directly from Jared:
In the first phase, pass a discard_iterator as the output iterator. You can compare the discard_iterator returned as the result to compute the size of the output. In the second phase, call the algorithm "for real" and output into an array sized using the result of the first phase.
The technique is demonstrated in the set_operations.cu example [0,1]:
[0] https://github.com/thrust/thrust/blob/master/examples/set_operations.cu#L25
[1] https://github.com/thrust/thrust/blob/master/examples/set_operations.cu#L127
thrust::set_difference returns an iterator to the end of the resulting range.
If you just want to change the logical size of C to the number of resulting elements, you could simply erase the range "behind" the result range.
void remove_common_elements(thrust::device_vector<int> A,
thrust::device_vector<int> B, thrust::device_vector<int>& C)
{
thrust::sort(thrust::device, A.begin(), A.end());
thrust::sort(thrust::device, B.begin(), B.end());
auto C_end = thrust::set_difference(thrust::device, A.begin(), A.end(), B.begin(), B.end(), C.begin());
C.erase(C_end, C.end());
}

First year CS student trying to understand functions?

I'm a first year CS student trying to understand functions, but I'm stuck on this problem where I have to use a function within another function. I have to create a program that checks all numbers from 0 to 100, and finds all the numbers that are evenly divisible by the divisor. I'm only allowed to have three functions, which are named, getDivisor, findNumbers and calcSquare. The output is supposed to be each number that is found (from 0 to 100) and the square of that number. I wrote a program (as seen below) that runs and answers the first question as to what is the divisor, but it stays open for only a few seconds and then closes when trying to compute which numbers are divisible by the divisor. I'm not sure exactly what I did wrong, but I would like to know so I can learn from my mistake! Please disregard the style, it's very sloppy, I usually go back and clean it up after I finish the program.
#include <iostream>
#include <string>
#include <cmath>
#include <iomanip>
using namespace std;
int getDivisor();
void findNumbers(int divisor, int lower, int upper, double &lowerSquared);
double calcSquare(int lower);
int main()
{
int divisor;
int lower = 0;
int upper = 100;
double lowerSquared;
divisor = getDivisor();
cout << "Here are the numbers, from 0 to 100, that are evenly divisble by "
<< divisor << ", and their squares:\n";
findNumbers(divisor, lower, upper, lowerSquared);
system("pause");
return 0;
}
int getDivisor()
{
int divisor;
cout << "Enter a divisor: ";
cin >> divisor;
return divisor;
}
void findNumbers(int divisor, int lower, int upper, double &lowerSquared)
{
while (lower < upper)
{
if (((lower / divisor) % 2) == 0)
{
lowerSquared = calcSquare(lower);
cout << setprecision(0) << fixed << setw(4) << lower << setw(8)<< lowerSquared << endl;
lower++;
}
else
{
lower++;
}
}
}
double calcSquare(int lower)
{
double lowerSquared;
lowerSquared = pow(lower, 2);
return lowerSquared;
}
The output should be (If the user enters 15). The output should be in a list format with the number on the left and the number squared to the right of it, but I don't know how to format properly on here... sorry:
Enter a divisor: 15
Here are the numbers, from 0 to 100, that are evenly divisble by 9, and their squares:
0 0
15 115
30 900
45 2025
60 3600
75 5625
90 8100
I appreciate any assistance!
Are you getting any error? because when running your code I get and exception.
Floating point exception(core dumped)
This exception happens because you are trying to do some illegal operation with float like divide by 0 in your if statement
to fix that simply assign lower number to 1 so the count starts from 1 not 0.
int lower = 1;
Also you might want to check the logic in the if statement because as it stands it wont give result you want.
/*Description:
This program is homework assignment to practice what I
learned from lecture #7a. It illustrates how to use
functions properly, specifically how to use functions
within other functions. The user is prompted to input
a divisor that once entered goes thru a function to
see if it is evenly divisble by every number from 0-100.*/
#include <iostream>
#include <string>
#include <cmath>
#include <iomanip>
using namespace std;
int getDivisor();
void findNumbers(int divisor, int lower, int upper, double &lowerSquared);
double calcSquare(int lower);
//====================== main ===========================
//
//=======================================================
int main()
{
int divisor;
int lower = 0;
int upper = 100;
double lowerSquared;
//Gets the divisor and assigns it to this variable.
divisor = getDivisor();
cout << "Here are the numbers, from 0 to 100, that are evenly divisble by "
<< divisor << ", and their squares:\n";
//Finds the numbers that are divisible by divisor,
//displays and shows their squares.
findNumbers(divisor, lower, upper, lowerSquared);
system("pause");
return 0;
}
/*===================== getDivisor ==========================
This function gets the divisor from the user so it can
assign it to the divisor variable to use in a later
function to check and see if it is divisible from 0-100.
Input:
Divisor
Output:
Divisor being assigned to divisor variable.*/
int getDivisor()
{
int divisor;
cout << "Enter a divisor: ";
cin >> divisor;
return divisor;
}
/*===================== findNumbers ==========================
This function runs a loop from 0 to 100 to check and see
if the divisor the user inputted is evenly divisble by
every number from 0 to 100. It also displays the numbers
that are evenly divisble and their squares with the help
of the calcSquare function.
Input:
There is no user input, other than the divisor from
the getDivisor function.
Output:
Numbers between 0 and 100 that are divisible by the
divisor and their squares.*/
void findNumbers(int divisor, int lower, int upper, double &lowerSquared)
{
while (lower <= upper)
{
if (lower % divisor == 0)
{
lowerSquared = calcSquare(lower);
cout << setprecision(0) << fixed << setw(4) << lower << setw(8) <<
lowerSquared << endl;
lower++;
}
else
{
lower++;
}
}
}
/*===================== calcSquare ==========================
This function squares the number from 0 to 100 (whatever
number that might be in the loop) that is divisible by the
user entered divisor, so that it may assign it to the
lowersquared variable in the findNumbers function to be
used in the output.
Input:
Number from 0 to 100 that is divisible by user entered
divisor
Output:
Number from 0 to 100 squared.*/
double calcSquare(int lower)
{
double lowerSquared;
lowerSquared = pow(lower, 2);
return lowerSquared;
}
//==========================================================
/*OUTPUT:
Enter a divisor: 15
Here are the numbers, from 0 to 100, that are evenly divisble by 15, and their
squares:
0 0
15 225
30 900
45 2025
60 3600
75 5625
90 8100
Press any key to continue . . .
*/
//==========================================================

CUDA: method to calculate all partial sums during a sum reduction

I run into this issue over and over in CUDA. I have, for a set of elements, done some GPU calculation. This results in some value that has linear meaning (for instance, in terms of memory):
element_sizes = [ 10, 100, 23, 45 ]
And now, for the next stage of GPU calculation, I need the following values:
memory_size = sum(element_sizes)
memory_offsets = [ 0, 10, 110, 133 ]
I can calculate memory_size at 80 gbps on my GPU using the reduction code available from NVIDIA. However, I can't use this code, as it uses a branching technique that does not compose the memory offsets array. I have tried many things, but what I have found is that simply copying over elements_sizes to the host and calculating the offsets with a simd for loop is the simplest, fastest, way to go:
// in pseudo code
host_element_sizes = copy_to_host(element_sizes);
host_offsets = (... *) malloc(...);
int total_size = 0;
for(int i = 0; i < ...; ...){
host_offsets[i] = total_size;
total_size += host_element_sizes[i];
}
device_offsets = (... *) device_malloc(...);
device_offsets = copy_to_device(host_offsets,...);
However, I have done this many times now, and it is starting to become a bottleneck. This seems like a typical problem, but I have found no work-around.
What is the expected way for a CUDA programmer to solve this problem?
I think the algorithm you are looking for is a prefix sum. A prefix sum on a vector produces another vector which contains the cumulative sum values of the input vector. A prefix sum exists in at least two variants - an exclusive scan or an inclusive scan. Conceptually these are similar.
If your element_sizes vector has been deposited in GPU global memory (it appears to be the case based on your pseudocode), then there exist library functions that run on the GPU that you could call at that point, to produce the memory_offsets data (vector), and the memory_size value could be trivially obtained from the last value in the vector, with a slight variation based on whether you are doing an inclusive scan or exclusive scan.
Here's a trivial worked example using thrust:
$ cat t319.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <iostream>
int main(){
const int element_sizes[] = { 10, 100, 23, 45 };
const int ds = sizeof(element_sizes)/sizeof(element_sizes[0]);
thrust::device_vector<int> dv_es(element_sizes, element_sizes+ds);
thrust::device_vector<int> dv_mo(ds);
thrust::exclusive_scan(dv_es.begin(), dv_es.end(), dv_mo.begin());
std::cout << "element_sizes:" << std::endl;
thrust::copy_n(dv_es.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "memory_offsets:" << std::endl;
thrust::copy_n(dv_mo.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "memory_size:" << std::endl << dv_es[ds-1] + dv_mo[ds-1] << std::endl;
}
$ nvcc -o t319 t319.cu
$ ./t319
element_sizes:
10,100,23,45,
memory_offsets:
0,10,110,133,
memory_size:
178
$

Thrust Histogram with weights

I want to compute the density of particles over a grid. Therefore, I have a vector that contains the cellID of each particle, as well as a vector with the given mass which does not have to be uniform.
I have taken the non-sparse example from Thrust to compute a histogram of my particles.
However, to compute the density, I need to include the weight of each particle, instead of simply summing the number of particles per cell, i.e. I'm interested in rho[i] = sum W[j] for all j that satify cellID[j]=i (probably unnecessary to explain, since everybody knows that).
Implementing this with Thrust has not worked for me. I also tried to use a CUDA kernel and thrust_raw_pointer_cast, but I did not succeed with that either.
EDIT:
Here is a minimal working example which should compile via nvcc file.cu under CUDA 6.5 and with Thrust installed.
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/binary_search.h>
#include <thrust/adjacent_difference.h>
// Predicate
struct is_out_of_bounds {
__host__ __device__ bool operator()(int i) {
return (i < 0); // out of bounds elements have negative id;
}
};
// cf.: https://code.google.com/p/thrust/source/browse/examples/histogram.cu, but modified
template<typename T1, typename T2>
void computeHistogram(const T1& input, T2& histogram) {
typedef typename T1::value_type ValueType; // input value type
typedef typename T2::value_type IndexType; // histogram index type
// copy input data (could be skipped if input is allowed to be modified)
thrust::device_vector<ValueType> data(input);
// sort data to bring equal elements together
thrust::sort(data.begin(), data.end());
// there are elements that we don't want to count, those have ID -1;
data.erase(thrust::remove_if(data.begin(), data.end(), is_out_of_bounds()),data.end());
// number of histogram bins is equal to the maximum value plus one
IndexType num_bins = histogram.size();
// find the end of each bin of values
thrust::counting_iterator<IndexType> search_begin(0);
thrust::upper_bound(data.begin(), data.end(), search_begin,
search_begin + num_bins, histogram.begin());
// compute the histogram by taking differences of the cumulative histogram
thrust::adjacent_difference(histogram.begin(), histogram.end(),
histogram.begin());
}
int main(void) {
thrust::device_vector<int> cellID(5);
cellID[0] = -1; cellID[1] = 1; cellID[2] = 0; cellID[3] = 2; cellID[4]=1;
thrust::device_vector<float> mass(5);
mass[0] = .5; mass[1] = 1.0; mass[2] = 2.0; mass[3] = 3.0; mass[4] = 4.0;
thrust::device_vector<int> histogram(3);
thrust::device_vector<float> density(3);
computeHistogram(cellID,histogram);
std::cout<<"\nHistogram:\n";
thrust::copy(histogram.begin(), histogram.end(),
std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
// this will print: " Histogram 1 2 1 "
// meaning one element with ID 0, two elements with ID 1
// and one element with ID 2
/* here is what I am unable to implement:
*
*
* computeDensity(cellID,mass,density);
*
* print(density): 2.0 5.0 3.0
*
*
*/
}
I hope the comment at the end of the file also makes clear what I mean by computing the density. If there is any question open, please feel free to ask. Thanks!
There still seems to be a problem in understanding my problem, which I am sorry for! Therefore I added some pictures.
Consider the first picture. For my understanding, a histogram would simply be the count of particles per grid cell. In this case a histogram would be an array of size 36, since there are 36 cells. Also, there would be a lot of zero entries in the vector, since for example in the upper left corner almost no cell contains a particle. This is what I already have in my code.
Now consider the slightly more complicated case. Here each particle has a different mass, indicated by the different size in the plot. To compute the density I can't just add the number of particles per cell, but I have to add the mass of all particles per cell. This is what I'm unable to implement.
What you described in your example does not look like a histogram but rather like a segmented reduction.
The following example code uses thrust::reduce_by_key to sum up the masses of particles within the same cell:
density.cu
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/scatter.h>
#include <iostream>
#define PRINTER(name) print(#name, (name))
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
std::cout << name << ":\t\t";
thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
std::cout << std::endl << std::endl;
}
int main()
{
const int particle_count = 5;
const int cell_count = 10;
thrust::device_vector<int> cellID(particle_count);
cellID[0] = -1; cellID[1] = 1; cellID[2] = 0; cellID[3] = 2; cellID[4]=1;
thrust::device_vector<float> mass(particle_count);
mass[0] = .5; mass[1] = 1.0; mass[2] = 2.0; mass[3] = 3.0; mass[4] = 4.0;
std::cout << "input data" << std::endl;
PRINTER(cellID);
PRINTER(mass);
thrust::sort_by_key(cellID. begin(), cellID.end(), mass.begin());
std::cout << "after sort_by_key" << std::endl;
PRINTER(cellID);
PRINTER(mass);
thrust::device_vector<int> reduced_cellID(particle_count);
thrust::device_vector<float> density(particle_count);
int new_size = thrust::reduce_by_key(cellID. begin(), cellID.end(),
mass.begin(),
reduced_cellID.begin(),
density.begin()
).second - density.begin();
if (reduced_cellID[0] == -1)
{
density.erase(density.begin());
reduced_cellID.erase(reduced_cellID.begin());
new_size--;
}
density.resize(new_size);
reduced_cellID.resize(new_size);
std::cout << "after reduce_by_key" << std::endl;
PRINTER(density);
PRINTER(reduced_cellID);
thrust::device_vector<float> final_density(cell_count);
thrust::scatter(density.begin(), density.end(), reduced_cellID.begin(), final_density.begin());
PRINTER(final_density);
}
compile using
nvcc -std=c++11 density.cu -o density
output
input data
cellID: -1 1 0 2 1
mass: 0.5 1 2 3 4
after sort_by_key
cellID: -1 0 1 1 2
mass: 0.5 2 1 4 3
after reduce_by_key
density: 2 5 3
reduced_cellID: 0 1 2
final_density: 2 5 3 0 0 0 0 0 0 0

std::find with type T** vs T*[N]

I prefer to work with std::string but I like to figure out what is going wrong here.
I am unable to understand out why std::find isn't working properly for type T** even though pointer arithmetic works on them correctly. Like -
std::cout << *(argv+1) << "\t" <<*(argv+2) << std::endl;
But it works fine, for the types T*[N].
#include <iostream>
#include <algorithm>
int main( int argc, const char ** argv )
{
std::cout << *(argv+1) << "\t" <<*(argv+2) << std::endl;
const char ** cmdPtr = std::find(argv+1, argv+argc, "Hello") ;
const char * testAr[] = { "Hello", "World" };
const char ** testPtr = std::find(testAr, testAr+2, "Hello");
if( cmdPtr == argv+argc )
std::cout << "String not found" << std::endl;
if( testPtr != testAr+2 )
std::cout << "String found: " << *testPtr << std::endl;
return 0;
}
Arguments passed: Hello World
Output:
Hello World
String not found
String found: Hello
Thanks.
Comparing types of char const* amounts to pointing to the addresses. The address of "Hello" is guaranteed to be different unless you compare it to another address of the string literal "Hello" (in which case the pointers may compare equal). Your compare() function compares the characters being pointed to.
In the first case, you're comparing the pointer values themselves and not what they're pointing to. And the constant "Hello" doesn't have the same address as the first element of argv.
Try using:
const char ** cmdPtr = std::find(argv+1, argv+argc, std::string("Hello")) ;
std::string knows to compare contents and not addresses.
For the array version, the compiler can fold all literals into a single one, so every time "Hello" is seen throughout the code it's really the same pointer. Thus, comparing for equality in
const char * testAr[] = { "Hello", "World" };
const char ** testPtr = std::find(testAr, testAr+2, "Hello");
yields the correct result