I was given in university the following code to explain shortly what it does and what is the value of x at the end of the run as a function of n, hope someone could help me.
x = 0;
for(int i = n; i > 1; i--) {
for(int j = 1; j < i; j--) {
x +=5;
}
}
Thanks
(I assume you meant to write "j++" instead of "j--", and not end up in an infinite loop?)
If so, just execute it by hand.
The outer loop iterates with i over the integers, from n down to 2 (inclusive).
At each iteration of that loop, the inner loop iterates with j over the integers from 1 up to i - 1 (inclusive).
thus, x is incremented by 5 for each of:
j = 1, 2, ... n - 1
then, each of:
j = 1, 2, ... n - 2
then, etc,
...
until,
j = 1
if I'm not mistaken, that's n * (n - 1) / 2 iterations in total
(cf. the arithmetic progression)
to give eventually,
x == 5 * n * (n - 1) / 2
E.g., for n = 3:
x == 15
'HTH
for(int i = n; i > 1; i--) {
for(int j = 1; j < i; j--) {
since i > 1 and j=1; j < i; j--.
j will always be less than i so it becomes an infinite loop.
Related
I am working on a sudoku solver using backtracking. For some unknown by me reasons my code blocks can't use recursion. I mean that a function, even if the program reach the code line where I wrote the recursion, won't call itself. The program just continue as if nothing was there.
#include <bits/stdc++.h>
using namespace std;
ifstream in("data.in");
ofstream out("data.out");
int sudoku[10][10];
int f[10];
vector< pair<int, int> > v;
bool continuare(int pas){
int x = v[pas].first;
int y = v[pas].second;
for(int i = x; i <= 9; i++)
f[ sudoku[i][y] ]++;
for(int i = x - 1; i >= 1; i--)
f[ sudoku[i][y] ]++;
for(int j = x + 1; j <= 9; j++)
f[ sudoku[x][j] ]++;
for(int j = x - 1; j >= 1; j--)
f[ sudoku[x][j] ]++;
for( int i = x - 3 + x%3, c1 = 0; c1 < 3; c1++, i++ )
for( int j = y - 3 + y%3, c2 = 0; c2 < 3; c2++, j++ )
f[ sudoku[i][j] ]++;
for(int i = 1; i <= 9; i++){
if( f[i] > 3 )
return false;
f[i] = 0;
}
return true;
}
void afisare(){
for(int i = 1; i <= 9; i++){
for(int j = 1; j <= 9; j++)
out<<sudoku[i][j]<<" ";
out<<"\n";
}
}
void backtracking( int pas ){
if( pas > v.size() )
afisare();
else
for(int i = 1; i <= 9; i++){
sudoku[ v[pas].first ][ v[pas].second ] = i;
if( continuare(pas) )
backtracking( pas + 1 );
}
}
int main()
{
for(int i = 1; i <= 9; i++)
for(int j = 1; j <= 9; j++){
in>>sudoku[i][j];
if(sudoku[i][j] == 0)
v.push_back( make_pair(i, j) );
}
backtracking(1);
return 0;
}
As you may have noticed, the problem is when backtracking() calls itself and as I said nothing happens there.
Copied from comment which seemed to have solved your question:
compile with the -g flag and run your executable against gdb, I just did that and saw that it seg faults at f[ sudoku[i][j] ]++; in continuare function.
What's the time complexity of the following code?
a = 2;
while (a <= n)
{
for (k=1; k <= n; k++)
{
b = n;
while (b > 1)
b = b / 2;
}
a = a * a * a;
}
I'm struggling with the outer while loop, which is loglogn, I can't understand why. How would the time complexity change if the last line was a = a * a * a * a;?
the for loop is O(n), and inner one is O(logn).
So in total, O(n*logn*loglogn)
a values would be:
a = 2 2^3 2^9 2^27 2^81 ...
and so on.
Now let's assume that the last value of a is 2^(3^k)
Where k is the number of iterations of the outer while loop.
For simplicity let's assume that a = n^3, so 2^(3^k) = n^3
So 3^k = 3*log_2(n) => k = log_3(3log_2(n)) = 𝛩(loglogn)
If the last line was a = a * a * a * a the time-complexity would remain 𝛩(loglogn) because k = log_4(4log_2(n)) = 𝛩(loglogn).
the loop is running n times and the inner loop has time complexity is log n so total time complexity is O(n log n)
What is the time complexity for the nested loops shown below:
1)
for (int i = 1; i <=n; i += 2) {
for (int j = 1; j <=n; j += 2) {
// some O(1) expressions
}
}
2)
for (int i = 1; i <=n; i += 3) {
for (int j = 1; j <=n; j += 3) {
// some O(1) expressions
}
}
In general:
for (int i = 1; i <=n; i += c) {
for (int j = 1; j <=n; j += c) {
// some O(1) expressions
}
}
Is is really this the following? O(nc)
Your algorithm will execute (n / c) * (n /c) iterations. We're dividing, because we are skipping c characters for each iteration. See that:
for (var i = 0; i <= n; i = i + 1)
Will have n / 1 iterations
for (var i = 0; i <= n; i = i + 2)
Will have n / 2 iterations
*Note that the result will be floored. That is, if n = 3 and c = 2, it will execute only one time (floor(3 / 2) == 1)
So, we can generalize it to be
(n / c)2
= (n2/c2)
= 1/c2 * n2
Remember that Big O is only interested in the rate of change. Since c is a constant, it is ignored from the calculation.
So, the result is:
O(1/c2 * n2) = O(n2)
For the general case, the inner loop has O(n) and the outer loop has O(n). Therefore, for each iteration of the outside loop, the inner loop iterates n times (c does not matter for order of complexity and should be treated as if it is 1). If the outer loop iterates n times, the total number of iterations in the inner loop is n*n, or O(n^2).
Imagine there are 10 chairs (n here)
in one for loop you are iterating over all the chairs, let say you sit on all the chairs, so in total you need to sit 10 times to sit on all the chairs for a given loop.
Now imagine you sit on first chair and ask your friend to sit on the other chairs one by one including your chair, so in total your friend has to sit on 10 chairs.
Now you choses the second chair, and again ask you friend to sit on each chair again, so in total he again has to sit on 10 chairs.
Similarly you can choose the 3rd,4th... chair and so on, so in total your friend has to sit on 10 chairs for each of the chair you choose.
10 + 10 + ... = 100 times
which is equivalent to 10^2 = 100
So the complexity is O(n^2), where n is the number of chairs.
I have some problem in loop unroll in CUDA.
In normal serial code:
//serial basic:
for(int i = 0; i < n; i++){
c[i] = a[i] + b[i];}
//serial loop unroll:
for(int i = 0; i < n/4; i++){
c[i] = a[i] + b[i];
c[i+1] = a[i+1] + b[i+1];
c[i+2] = a[i+2] + b[i+2];
c[i+3] = a[i+3] + b[i+3];}
So I think the CUDA loop unrolling looks like this:
int i = 2*(threadIdx.x + blockIdx.x * gridDim.x);
a[i+0] = b[i+0] + c[i+0];
a[i+1] = b[i+1] + c[i+1];
But in the CUDA hand-book the unrolling example I can't understand
This is a normal GlobalWrite kernel:
__global__ void GlobalWrites( T *out, T value, size_t N )
{
for(size_t i = blockIdx.x*blockDim.x+threadIdx.x;
i < N;
i += blockDim.x*gridDim.x ) {
out[i] = value;
}
}
unrolling kernel:
template<class T, const int n> __global__ void Global_write(T* out, T value, size_t N){
size_t i;
for(i = n*blockDim.x*blockIdx.x + threadIdx.x;
i < N - n*blockDim.x*blockIdx.x;
i += n*gridDim.x*blockDim.x;)
for(int j = 0; j < n; i++){
size_t index = i + j * blockDim.x;
outp[index] = value;
}
for ( int j = 0; j < n; j++ ) {
size_t index = i+j*blockDim.x;
if ( index<N ) out[index] = value;
}}
I know this kernel uses less blocks but may someone explain why it works better (n=4,10% speed up).
If it wasn't obvious, because n is a template parameter, it is constant at compile time. This means that the compiler is free to optimize the constant trip count loop away by unrolling. It is, therefore, instructive to remove the template magic and unroll the loop by hand for the n=4 case you mentioned:
template<class T>
__global__ void Global_write(T* out, T value, size_t N)
{
size_t i;
for(i = 4*blockDim.x*blockIdx.x + threadIdx.x;
i < N - 4*blockDim.x*blockIdx.x;
i += 4*gridDim.x*blockDim.x;) {
out[i + 0 * blockDim.x] = value;
out[i + 1 * blockDim.x] = value;
out[i + 2 * blockDim.x] = value;
out[i + 3 * blockDim.x] = value;
}
if ( i+0*blockDim.x < N ) out[i+0*blockDim.x] = value;
if ( i+1*blockDim.x < N ) out[i+1*blockDim.x] = value;
if ( i+2*blockDim.x < N ) out[i+2*blockDim.x] = value;
if ( i+3*blockDim.x < N ) out[i+3*blockDim.x] = value;
}
The unrolled inner loop yields four completely independent writes which are coalesced. It is this instruction level parallelism which give the code higher instruction throughput and improved performance. I highly recommend Vasily Volkov's Unrolling Parallel Loops from the GTC conference of a few years ago, if you haven't already seen it. His presentation lays out the theoretical background for why this type of loop unrolling is an optimisation in CUDA.
In the templated kernel, const int n is known at compile time, allowing the compiler to actually unroll the for(int j = 0; j < n; i++) loop removing the conditional checks on that loop. If the loop size is not known at compile time, the compiler cannot unroll the loop. Simple as that.
I got a task to do. I need to run a Flood Fill algorithm on CUDA. On CPU I have a non-recursive method with queue, but I dont have any idea how to do move this code to GPU so that it would run faster. Can anybody help?
edit:
this is my CPU code, just normal FloodFill with my little modifications
void cpuFloodFill(std::vector<std::vector<int>> *colorVector, int node)
{
std::queue<int> q;
q.push(node);
int i,j;
while(!q.empty())
{
int k = q.front();
q.pop();
k2ij(k, &i, &j);
if((*colorVector)[i][j] == COLOR_TARGET)
{
(*colorVector)[i][j] = COLOR_REPLACEMENT;
if(i - 1 >= 0 && i - 1 < X && j >= 0 && j < Y)
q.push(ij2k(i - 1, j));
if(i + 1 >= 0 && i + 1 < X && j >= 0 && j < Y)
q.push(ij2k(i + 1, j));
if(i >= 0 && i < X && j - 1 >= 0 && j - 1 < Y)
q.push(ij2k(i, j - 1));
if(i >= 0 && i < X && j + 1 >= 0 && j + 1 < Y)
q.push(ij2k(i, j + 1));
}
}
}
There's a GPU flood fill implementation in an image skeletonization toolkit named CUDA Skel. The link to its source code is on the website. Please note the license of the code: the source and toolkit are free for research purposes with due citation.