First year CS student trying to understand functions? - function

I'm a first year CS student trying to understand functions, but I'm stuck on this problem where I have to use a function within another function. I have to create a program that checks all numbers from 0 to 100, and finds all the numbers that are evenly divisible by the divisor. I'm only allowed to have three functions, which are named, getDivisor, findNumbers and calcSquare. The output is supposed to be each number that is found (from 0 to 100) and the square of that number. I wrote a program (as seen below) that runs and answers the first question as to what is the divisor, but it stays open for only a few seconds and then closes when trying to compute which numbers are divisible by the divisor. I'm not sure exactly what I did wrong, but I would like to know so I can learn from my mistake! Please disregard the style, it's very sloppy, I usually go back and clean it up after I finish the program.
#include <iostream>
#include <string>
#include <cmath>
#include <iomanip>
using namespace std;
int getDivisor();
void findNumbers(int divisor, int lower, int upper, double &lowerSquared);
double calcSquare(int lower);
int main()
{
int divisor;
int lower = 0;
int upper = 100;
double lowerSquared;
divisor = getDivisor();
cout << "Here are the numbers, from 0 to 100, that are evenly divisble by "
<< divisor << ", and their squares:\n";
findNumbers(divisor, lower, upper, lowerSquared);
system("pause");
return 0;
}
int getDivisor()
{
int divisor;
cout << "Enter a divisor: ";
cin >> divisor;
return divisor;
}
void findNumbers(int divisor, int lower, int upper, double &lowerSquared)
{
while (lower < upper)
{
if (((lower / divisor) % 2) == 0)
{
lowerSquared = calcSquare(lower);
cout << setprecision(0) << fixed << setw(4) << lower << setw(8)<< lowerSquared << endl;
lower++;
}
else
{
lower++;
}
}
}
double calcSquare(int lower)
{
double lowerSquared;
lowerSquared = pow(lower, 2);
return lowerSquared;
}
The output should be (If the user enters 15). The output should be in a list format with the number on the left and the number squared to the right of it, but I don't know how to format properly on here... sorry:
Enter a divisor: 15
Here are the numbers, from 0 to 100, that are evenly divisble by 9, and their squares:
0 0
15 115
30 900
45 2025
60 3600
75 5625
90 8100
I appreciate any assistance!

Are you getting any error? because when running your code I get and exception.
Floating point exception(core dumped)
This exception happens because you are trying to do some illegal operation with float like divide by 0 in your if statement
to fix that simply assign lower number to 1 so the count starts from 1 not 0.
int lower = 1;
Also you might want to check the logic in the if statement because as it stands it wont give result you want.

/*Description:
This program is homework assignment to practice what I
learned from lecture #7a. It illustrates how to use
functions properly, specifically how to use functions
within other functions. The user is prompted to input
a divisor that once entered goes thru a function to
see if it is evenly divisble by every number from 0-100.*/
#include <iostream>
#include <string>
#include <cmath>
#include <iomanip>
using namespace std;
int getDivisor();
void findNumbers(int divisor, int lower, int upper, double &lowerSquared);
double calcSquare(int lower);
//====================== main ===========================
//
//=======================================================
int main()
{
int divisor;
int lower = 0;
int upper = 100;
double lowerSquared;
//Gets the divisor and assigns it to this variable.
divisor = getDivisor();
cout << "Here are the numbers, from 0 to 100, that are evenly divisble by "
<< divisor << ", and their squares:\n";
//Finds the numbers that are divisible by divisor,
//displays and shows their squares.
findNumbers(divisor, lower, upper, lowerSquared);
system("pause");
return 0;
}
/*===================== getDivisor ==========================
This function gets the divisor from the user so it can
assign it to the divisor variable to use in a later
function to check and see if it is divisible from 0-100.
Input:
Divisor
Output:
Divisor being assigned to divisor variable.*/
int getDivisor()
{
int divisor;
cout << "Enter a divisor: ";
cin >> divisor;
return divisor;
}
/*===================== findNumbers ==========================
This function runs a loop from 0 to 100 to check and see
if the divisor the user inputted is evenly divisble by
every number from 0 to 100. It also displays the numbers
that are evenly divisble and their squares with the help
of the calcSquare function.
Input:
There is no user input, other than the divisor from
the getDivisor function.
Output:
Numbers between 0 and 100 that are divisible by the
divisor and their squares.*/
void findNumbers(int divisor, int lower, int upper, double &lowerSquared)
{
while (lower <= upper)
{
if (lower % divisor == 0)
{
lowerSquared = calcSquare(lower);
cout << setprecision(0) << fixed << setw(4) << lower << setw(8) <<
lowerSquared << endl;
lower++;
}
else
{
lower++;
}
}
}
/*===================== calcSquare ==========================
This function squares the number from 0 to 100 (whatever
number that might be in the loop) that is divisible by the
user entered divisor, so that it may assign it to the
lowersquared variable in the findNumbers function to be
used in the output.
Input:
Number from 0 to 100 that is divisible by user entered
divisor
Output:
Number from 0 to 100 squared.*/
double calcSquare(int lower)
{
double lowerSquared;
lowerSquared = pow(lower, 2);
return lowerSquared;
}
//==========================================================
/*OUTPUT:
Enter a divisor: 15
Here are the numbers, from 0 to 100, that are evenly divisble by 15, and their
squares:
0 0
15 225
30 900
45 2025
60 3600
75 5625
90 8100
Press any key to continue . . .
*/
//==========================================================

Related

What's the alternative for __match_any_sync on compute capability 6?

In the cuda examples, e.g. here, __match_all_sync __match_any_sync is used.
Here is an example where a warp is split into multiple (one or more) groups that each keep track of their own atomic counter.
// increment the value at ptr by 1 and return the old value
__device__ int atomicAggInc(int* ptr) {
int pred;
//const auto mask = __match_all_sync(__activemask(), ptr, &pred); //error, should be any_sync, not all_sync
const auto mask = __match_any_sync(__activemask(), ptr, &pred);
const auto leader = __ffs(mask) - 1; // select a leader
int res;
const auto lane_id = ThreadId() % warpSize;
if (lane_id == leader) { // leader does the update
res = atomicAdd(ptr, __popc(mask));
}
res = __shfl_sync(mask, res, leader); // get leader’s old value
return res + __popc(mask & ((1 << lane_id) - 1)); //compute old value
}
The __match_any_sync here splits up the threads in the warp into groups that have the same ptr value, so that each group can update its own ptr atomically without getting in the way of other threads.
I know the nvcc compiler (since cuda 9) does this sort of optimization under the hood automatically, but this is just about the mechanics of __match_any_sync
Is there a way to do this pre compute capability 7?
EDIT: The blog article has now been modified to reflect __match_any_sync() rather than __match_all_sync(), so any commentary to that effect below should be disregarded. The answer below is edited to reflect this.
Based on your statement:
this is just about the mechanics of __match_any_sync
we will focus on a replacement for __match_any_sync itself, not any other form of rewriting the atomicAggInc function. Therefore, we must provide a mask that has the same value as would be returned by __match_any_sync() on cc7.0 or higher architectures.
I believe this will require a loop, which broadcasts the ptr value, in the worst case one iteration for each thread in the warp (since each thread could have a unique ptr value) and testing which threads have the same value. There are various ways we could "optimize" this loop for this function, so as to possibly reduce the trip count from 32 to some lesser value, based on the actual ptr values in each thread, but such optimization in my view introduces considerable complexity, which makes the worst-case processing time longer (as is typical of early-exit optimizations). So I will demonstrate a fairly simple method without this optimization.
The other consideration is what to do in the case of the warp not being converged? For that, we can employ __activemask() to identify that case.
Here is a worked example:
$ cat t1646.cu
#include <iostream>
#include <stdio.h>
// increment the value at ptr by 1 and return the old value
__device__ int atomicAggInc(int* ptr) {
int mask;
#if __CUDA_ARCH__ >= 700
mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
#else
unsigned tmask = __activemask();
for (int i = 0; i < warpSize; i++){
#ifdef USE_OPT
if ((1U<<i) & tmask){
#endif
unsigned long long tptr = __shfl_sync(tmask, (unsigned long long)ptr, i);
unsigned my_mask = __ballot_sync(tmask, (tptr == (unsigned long long)ptr));
if (i == (threadIdx.x & (warpSize-1))) mask = my_mask;}
#ifdef USE_OPT
}
#endif
#endif
int leader = __ffs(mask) - 1; // select a leader
int res;
unsigned lane_id = threadIdx.x % warpSize;
if (lane_id == leader) { // leader does the update
res = atomicAdd(ptr, __popc(mask));
}
res = __shfl_sync(mask, res, leader); // get leader’s old value
return res + __popc(mask & ((1 << lane_id) - 1)); //compute old value
}
__global__ void k(int *d){
int *ptr = d + threadIdx.x/4;
if ((threadIdx.x >= 16) && (threadIdx.x < 32))
atomicAggInc(ptr);
}
const int ds = 32;
int main(){
int *d_d, *h_d;
h_d = new int[ds];
cudaMalloc(&d_d, ds*sizeof(d_d[0]));
cudaMemset(d_d, 0, ds*sizeof(d_d[0]));
k<<<1,ds>>>(d_d);
cudaMemcpy(h_d, d_d, ds*sizeof(d_d[0]), cudaMemcpyDeviceToHost);
for (int i = 0; i < ds; i++)
std::cout << h_d[i] << " ";
std::cout << std::endl;
}
$ nvcc -o t1646 t1646.cu -DUSE_OPT
$ cuda-memcheck ./t1646
========= CUDA-MEMCHECK
0 0 0 0 4 4 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
========= ERROR SUMMARY: 0 errors
$
(CentOS 7, CUDA 10.1.243, with device 0 being Tesla V100, device 1 being a cc3.5 device).
I've added an optional optimization for the case where the warp is diverged (i.e. tmask is not 0xFFFFFFFF). This can be selected by defining USE_OPT.

Semantics of __ddiv_ru

From the documentation of __ddiv_ru I expect that the following code result is ceil(8/32) = 1.0, instead I obtain 0.25.
#include <iostream>
using namespace std;
__managed__ double x;
__managed__ double y;
__managed__ double r;
__global__ void ceilDiv()
{
r = __ddiv_ru(x,y);
}
int main()
{
x = 8;
y = 32;
r = -1;
ceilDiv<<<1,1>>>();
cudaDeviceSynchronize();
cout << "The ceil of " << x << "/" << y << " is " << r << endl;
return 1;
}
What am I missing?
The result you are obtaining is correct.
The intrinsic you are using implements double precision division with a specific IEEE 754-2008 rounding mode for the unit in the last place (ULP) of the significand. This controls what happens when a result cannot be exactly represented in the selected format. In this case you have selected round up, which means the last digit of the significand produced in the division result is rounded up (toward +∞). In your case all rounding modes should produce the same result because the result can be exactly represented in IEEE 754 binary64 format (it is a round power of 2).
Please read everything here before writing any more floating point code.

How to dynamically set the size of device_vectors in thrust set operations?

I have two sets A & B. The result(C) of my operation should have elements in A which are not there in B. I use set_difference to do it. However the size of result(C) has to be set before the operation. Else it has extra zeros at the end, like below:
A=
1 2 3 4 5 6 7 8 9 10
B=
1 2 8 11 7 4
C=
3 5 6 9 10 0 0 0 0 0
How to set the size of result(C) dynamically so that output is C= 3 5 6 9. In a real problem, I would not know the required size of result device_vector apriori.
My code:
#include <thrust/execution_policy.h>
#include <thrust/set_operations.h>
#include <thrust/sequence.h>
#include <thrust/execution_policy.h>
#include <thrust/device_vector.h>
void remove_common_elements(thrust::device_vector<int> A, thrust::device_vector<int> B, thrust::device_vector<int>& C)
{
thrust::sort(thrust::device, A.begin(), A.end());
thrust::sort(thrust::device, B.begin(), B.end());
thrust::set_difference(thrust::device, A.begin(), A.end(), B.begin(), B.end(), C.begin());
}
int main(int argc, char * argv[])
{
thrust::device_vector<int> A(10);
thrust::sequence(thrust::device, A.begin(), A.end(),1); // x components of the 'A' vectors
thrust::device_vector<int> B(6);
B[0]=1;B[1]=2;B[2]=8;B[3]=11;B[4]=7;B[5]=4;
thrust::device_vector<int> C(A.size());
std::cout << "A="<< std::endl;
thrust::copy(A.begin(), A.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
std::cout << "B="<< std::endl;
thrust::copy(B.begin(), B.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
remove_common_elements(A, B, C);
std::cout << "C="<< std::endl;
thrust::copy(C.begin(), C.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
return 0;
}
In the general case (i.e. across various thrust algorithms) there is often no way to know the output size, except what the upper bound would be. The usual approach here would be to pass a result vector whose size is the upper bound of the possible output size. As you stated already, in many cases the actual size of the output cannot be known a-priori. Thrust has no particular magic to solve this. After the operation, you will know the size of the result, and it could be copied to a new vector if the "extra zeroes" were a problem for some reason (I can't think of a reason why they would be a problem generally, except that they use up allocated space).
If this is highly objectionable, one possibility (copying this information from a response by Jared Hoberock in another forum) is to run the algorithm twice, the first time using a discard_iterator (for the output data) and the second time with a real iterator, pointing to an actual vector allocation, of the requisite size. During the first pass, the discard_iterator is used to count the size of the actual result data, even though it is not stored anywhere. Quoting directly from Jared:
In the first phase, pass a discard_iterator as the output iterator. You can compare the discard_iterator returned as the result to compute the size of the output. In the second phase, call the algorithm "for real" and output into an array sized using the result of the first phase.
The technique is demonstrated in the set_operations.cu example [0,1]:
[0] https://github.com/thrust/thrust/blob/master/examples/set_operations.cu#L25
[1] https://github.com/thrust/thrust/blob/master/examples/set_operations.cu#L127
thrust::set_difference returns an iterator to the end of the resulting range.
If you just want to change the logical size of C to the number of resulting elements, you could simply erase the range "behind" the result range.
void remove_common_elements(thrust::device_vector<int> A,
thrust::device_vector<int> B, thrust::device_vector<int>& C)
{
thrust::sort(thrust::device, A.begin(), A.end());
thrust::sort(thrust::device, B.begin(), B.end());
auto C_end = thrust::set_difference(thrust::device, A.begin(), A.end(), B.begin(), B.end(), C.begin());
C.erase(C_end, C.end());
}

Thrust Histogram with weights

I want to compute the density of particles over a grid. Therefore, I have a vector that contains the cellID of each particle, as well as a vector with the given mass which does not have to be uniform.
I have taken the non-sparse example from Thrust to compute a histogram of my particles.
However, to compute the density, I need to include the weight of each particle, instead of simply summing the number of particles per cell, i.e. I'm interested in rho[i] = sum W[j] for all j that satify cellID[j]=i (probably unnecessary to explain, since everybody knows that).
Implementing this with Thrust has not worked for me. I also tried to use a CUDA kernel and thrust_raw_pointer_cast, but I did not succeed with that either.
EDIT:
Here is a minimal working example which should compile via nvcc file.cu under CUDA 6.5 and with Thrust installed.
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/binary_search.h>
#include <thrust/adjacent_difference.h>
// Predicate
struct is_out_of_bounds {
__host__ __device__ bool operator()(int i) {
return (i < 0); // out of bounds elements have negative id;
}
};
// cf.: https://code.google.com/p/thrust/source/browse/examples/histogram.cu, but modified
template<typename T1, typename T2>
void computeHistogram(const T1& input, T2& histogram) {
typedef typename T1::value_type ValueType; // input value type
typedef typename T2::value_type IndexType; // histogram index type
// copy input data (could be skipped if input is allowed to be modified)
thrust::device_vector<ValueType> data(input);
// sort data to bring equal elements together
thrust::sort(data.begin(), data.end());
// there are elements that we don't want to count, those have ID -1;
data.erase(thrust::remove_if(data.begin(), data.end(), is_out_of_bounds()),data.end());
// number of histogram bins is equal to the maximum value plus one
IndexType num_bins = histogram.size();
// find the end of each bin of values
thrust::counting_iterator<IndexType> search_begin(0);
thrust::upper_bound(data.begin(), data.end(), search_begin,
search_begin + num_bins, histogram.begin());
// compute the histogram by taking differences of the cumulative histogram
thrust::adjacent_difference(histogram.begin(), histogram.end(),
histogram.begin());
}
int main(void) {
thrust::device_vector<int> cellID(5);
cellID[0] = -1; cellID[1] = 1; cellID[2] = 0; cellID[3] = 2; cellID[4]=1;
thrust::device_vector<float> mass(5);
mass[0] = .5; mass[1] = 1.0; mass[2] = 2.0; mass[3] = 3.0; mass[4] = 4.0;
thrust::device_vector<int> histogram(3);
thrust::device_vector<float> density(3);
computeHistogram(cellID,histogram);
std::cout<<"\nHistogram:\n";
thrust::copy(histogram.begin(), histogram.end(),
std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
// this will print: " Histogram 1 2 1 "
// meaning one element with ID 0, two elements with ID 1
// and one element with ID 2
/* here is what I am unable to implement:
*
*
* computeDensity(cellID,mass,density);
*
* print(density): 2.0 5.0 3.0
*
*
*/
}
I hope the comment at the end of the file also makes clear what I mean by computing the density. If there is any question open, please feel free to ask. Thanks!
There still seems to be a problem in understanding my problem, which I am sorry for! Therefore I added some pictures.
Consider the first picture. For my understanding, a histogram would simply be the count of particles per grid cell. In this case a histogram would be an array of size 36, since there are 36 cells. Also, there would be a lot of zero entries in the vector, since for example in the upper left corner almost no cell contains a particle. This is what I already have in my code.
Now consider the slightly more complicated case. Here each particle has a different mass, indicated by the different size in the plot. To compute the density I can't just add the number of particles per cell, but I have to add the mass of all particles per cell. This is what I'm unable to implement.
What you described in your example does not look like a histogram but rather like a segmented reduction.
The following example code uses thrust::reduce_by_key to sum up the masses of particles within the same cell:
density.cu
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/scatter.h>
#include <iostream>
#define PRINTER(name) print(#name, (name))
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
std::cout << name << ":\t\t";
thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
std::cout << std::endl << std::endl;
}
int main()
{
const int particle_count = 5;
const int cell_count = 10;
thrust::device_vector<int> cellID(particle_count);
cellID[0] = -1; cellID[1] = 1; cellID[2] = 0; cellID[3] = 2; cellID[4]=1;
thrust::device_vector<float> mass(particle_count);
mass[0] = .5; mass[1] = 1.0; mass[2] = 2.0; mass[3] = 3.0; mass[4] = 4.0;
std::cout << "input data" << std::endl;
PRINTER(cellID);
PRINTER(mass);
thrust::sort_by_key(cellID. begin(), cellID.end(), mass.begin());
std::cout << "after sort_by_key" << std::endl;
PRINTER(cellID);
PRINTER(mass);
thrust::device_vector<int> reduced_cellID(particle_count);
thrust::device_vector<float> density(particle_count);
int new_size = thrust::reduce_by_key(cellID. begin(), cellID.end(),
mass.begin(),
reduced_cellID.begin(),
density.begin()
).second - density.begin();
if (reduced_cellID[0] == -1)
{
density.erase(density.begin());
reduced_cellID.erase(reduced_cellID.begin());
new_size--;
}
density.resize(new_size);
reduced_cellID.resize(new_size);
std::cout << "after reduce_by_key" << std::endl;
PRINTER(density);
PRINTER(reduced_cellID);
thrust::device_vector<float> final_density(cell_count);
thrust::scatter(density.begin(), density.end(), reduced_cellID.begin(), final_density.begin());
PRINTER(final_density);
}
compile using
nvcc -std=c++11 density.cu -o density
output
input data
cellID: -1 1 0 2 1
mass: 0.5 1 2 3 4
after sort_by_key
cellID: -1 0 1 1 2
mass: 0.5 2 1 4 3
after reduce_by_key
density: 2 5 3
reduced_cellID: 0 1 2
final_density: 2 5 3 0 0 0 0 0 0 0

cublasDgemm getting more slower

I have a problem when using cublasDgemm(this function is in cublas, and the result is A*B,A=750*600,B=600*1000).
for (i=0; i < N; ++i) {
cublasDgemm();
}
N=10, total time is 0.000473s, average call is 0.0000473
N=100, total time is 0.00243s, average call is 0.0000243
N=1000, total time is 0.715072s, average call is 0.000715
N=10000, total time is 10.4998s, average call is 0.00104998
why the average time is increasing so much?
#include <cuda_runtime.h>
#include <string.h>
#include <cublas.h>
#include <cublas_v2.h>
#include <time.h>
#include <sys/time.h>
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
using namespace std;
#define IDX2C(i,j,leading) (((j)*(leading))+(i))
#define CHECK_EQ(a,b) do { \
if ((a) != (b)) { \
cout <<__FILE__<<" : "<< __LINE__<<" : check failed because "<<a<<"!="<<b<<endl;\
exit(1);\
}\
} while(0)
#define CUBLAS_CHECK(condition) \
do {\
cublasStatus_t status = condition; \
CHECK_EQ(status, CUBLAS_STATUS_SUCCESS); \
} while(0)
#define CUDA_CHECK(condition)\
do {\
cudaError_t error = condition;\
CHECK_EQ(error, cudaSuccess);\
} while(0)
//check after kernel function
#define CUDA_POST_KERNEL_CHECK CUDA_CHECK(cudaPeekAtLastError())
template <class T>
void randMtx(T *mat, int n, double range) {
srand((unsigned int)time(NULL));
for (int i = 0; i < n; ++i) {
//mat[i] = 1.0;
double flag = 1.0;
if (rand() % 2 == 0) flag = -1.0;
mat[i] = flag * rand()/RAND_MAX * range;
}
}
int main(int argc, char *argv[]) {
if (argc != 9) {
cout << "m1_row m1_col m2_row m2_col m1 m2 count range\n";
return -1;
}
int row1 = atoi(argv[1]);
int col1 = atoi(argv[2]);
int row2 = atoi(argv[3]);
int col2 = atoi(argv[4]);
int count = atoi(argv[7]);
double range = atof(argv[8]);
cublasOperation_t opt1 = CUBLAS_OP_N;
cublasOperation_t opt2 = CUBLAS_OP_N;
int row3 = row1;
int col3 = col2;
int k = col1;
if (argv[5][0] == 't') {
opt1 = CUBLAS_OP_T;
row3 = col1;
k = row1;
}
if (argv[6][0] == 't') {
opt2 = CUBLAS_OP_T;
col3 = row2;
}
double *mat1_c = (double*)malloc(sizeof(double)*row1*col1);
double *mat2_c = (double*)malloc(sizeof(double)*row2*col2);
double *mat3_c = (double*)malloc(sizeof(double)*row3*col3);
srand((unsigned int)time(NULL));
randMtx(mat1_c, row1*col1, range);
randMtx(mat2_c, row2*col2, range);
double *mat1_g;
double *mat2_g;
double *mat3_g;
double alpha = 1.0;
double beta = 0.0;
CUDA_CHECK(cudaMalloc((void **)&(mat1_g), sizeof(double)*row1*col1));
CUDA_CHECK(cudaMalloc((void **)&(mat2_g), sizeof(double)*row2*col2));
CUDA_CHECK(cudaMalloc((void **)&(mat3_g), sizeof(double)*row3*col3));
CUDA_CHECK(cudaMemcpy(mat1_g, mat1_c, sizeof(double)*row1*col1, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(mat2_g, mat2_c, sizeof(double)*row2*col2, cudaMemcpyHostToDevice));
cublasHandle_t handle;
CUBLAS_CHECK(cublasCreate(&handle));
struct timeval beg, end, b1, e1;
gettimeofday(&beg, NULL);
for (int i = 0; i < count ;++i) {
CUBLAS_CHECK(cublasDgemm(handle, opt1, opt2, row3, col3, k, &alpha, mat1_g, row1, mat2_g, row2, &beta, mat3_g, row3));
}
cudaDeviceSynchronize();//
gettimeofday(&end, NULL);
cout << "real time used: " << end.tv_sec-beg.tv_sec + (double)(end.tv_usec-beg.tv_usec)/1000000 <<endl;
free(mat1_c);
free(mat2_c);
free(mat3_c);
cudaFree(mat1_g);
cudaFree(mat2_g);
cudaFree(mat3_g);
return 1;
}
this is the code. I add cudaDeviceSynchronize after the loop block, and no matter the value of count, the average call time is about 0.001s
As pointed out by #talonmies, this behavior is probably exactly what would be expected.
When you call cublasDgemm, the call (usually) returns control to the host (CPU) thread, before the operation is complete. In fact there is a queue that calls like this will go into, each time you make the call. The operation will be placed into a queue, and your host code will continue.
Furthermore, CUDA and CUBLAS usually have some one-time overhead that is associated with using the API. For example, the call to create a CUBLAS handle usually incurs some measurable time, in order to initialize the library.
So your measurements can be broken into 3 groups:
"Small" iteration counts (e.g. 10). In this case, each call pays the cost to put a Dgemm request into the queue, plus the amortization of the startup costs over a relatively small number of iterations. This corresponds to your measurements like this: "average call is 0.0000473"
"Medium" iteration counts (e.g. 100-1000). In this case, the amortization of the start up costs becomes very small per call, and so most of the measurement is just the time to add a Dgemm request to the queue. This corresponds to your measurements like this: "average call is 0.0000243"
"Large" iteration counts (e.g. 10000). At some point, the internal request queue becomes full and can no longer accept new requests, until some requests have been completed and removed from the queue. What happens at this point is that the Dgemm call switches from non-blocking to blocking. It blocks (holds up the host/CPU thread) until a queue slot becomes available. What happens at this point then, is that suddenly new requests must wait effectively for a previous request to finish, so now the cost for a new Dgemm request approximately equals the time to execute and complete a (previous) Dgemm request. So the per-call cost jumps up dramatically from the cost to add an item to the queue to the cost to complete a request. This corresponds to your measurements like this: "average call is 0.00104998"