quicksort help, not sure why partitioning returns index and not array - partitioning

i was wondering if anyone could help me on quicksort. i understand the general idea for partitioning, but not sure why it returns an index
int partition(int arr[], int left, int right)
{
int i = left, j = right;
int tmp;
int pivot = arr[(left + right) / 2];
while (i <= j) {
while (arr[i] < pivot)
i++;
while (arr[j] > pivot)
j--;
if (i <= j) {
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
}
return i;
}
void quickSort(int arr[], int left, int right) {
int index = partition(arr, left, right);
if (left < index - 1)
quickSort(arr, left, index - 1);
if (index < right)
quickSort(arr, index, right);
}
i understand the whole rearranging part. it makes sense to me, but im not sure why partition is returning just an index. i thought it was supposed to return an array? like if the problem was to ... Sort {1, 12, 5, 26, 7, 14, 3, 7, 2}. i thought it would return...
1, 2, 5, 7, 3, 14, 7, 26, 12
i guess thats why im not understanding the actual quicksort function. but if someone could help explain it clearly and in an easy to understand way, it would be much appreciated. thanks a lot!

The index that has been returned is only there to identify the end of recurrsion within the QuickSort algorithm. It's mainly the index of the pivot element that is used to identify the smaller and bigger numbers.
AND: You are referring to an enhanced Quick Search Algorithm. In the basic version of the QuickSearch algorithm the returned index won't be needed.
It would also work with: (but a lot slower)
void quickSort(int arr[], int left, int right)
{
if (left < right)
{
int index = partition(arr, left, right);
quickSort(arr, left, index - 1);
quickSort(arr, index+1, right);
}
}

Your partition function is modifying the array in-place. The integer it is returning is the index before which values are smaller than the pivot, and starting at which the values are bigger than the pivot. The two recursive calls are sorting the smaller elements and the larger elements; the condition tested before the recursive calls serves as the base case.

Related

Concurrent Writing CUDA

I am new to CUDA and I am facing a problem with a basic projection kernel. What I am trying to do is to project a 3D point cloud into a 2D image. In case multiple points project to the same pixel, only the point with the smallest depth (the closest one) should be written on the matrix.
Suppose two 3D points fall in an image pixel (0, 0), the way I am implementing the depth check here is not working if (depth > entry.depth), since the two threads (from two different blocks) execute this "in parallel". In the printf statement, in fact, both entry.depth give the numeric limit (the initialization value).
To solve this problem I thought of using a tensor-like structure, each image pixel corresponds to an array of values. After the array is reduced and only the point with the smallest depth is kept. Are there any smarter and more efficient ways of solving this problem?
__global__ void kernel_project(CUDAWorkspace* workspace_, const CUDAMatrix* matrix_) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid >= matrix_->size())
return;
const Point3& full_point = matrix_->at(tid);
float depth = 0.f;
Point2 image_point;
// full point as input, depth and image point as output
const bool& is_good = project(image_point, depth, full_point); // dst, dst, src
if (!is_good)
return;
const int irow = (int) image_point.y();
const int icol = (int) image_point.x();
if (!workspace_->inside(irow, icol)) {
return;
}
// get pointer to entry
WorkspaceEntry& entry = (*workspace_)(irow, icol);
// entry.depth is set initially to a numeric limit
if (depth > entry.depth) // PROBLEM HERE
return;
printf("entry depth %f\n", entry.depth) // BOTH PRINT THE NUMERIC LIMIT
entry.point = point;
entry.depth = depth;
}

Count no. of ones by position(place) of an array of 32-bit binary numbers

An array(10^5 size) of 32-bit binary numbers is given, we're required to count the no. of ones for every bit of those numbers.
For example:
Array : {10101,1011,1010,1}
Counts : {1's place: 3, 2's place: 2, 3's place: 1, 4's place: 2, 5's place: 1}
No bit manipulation technique seems to satisfy the constraints to me.
Well, this should be solveable with two loops: one going over the array the other one masking the right bits. Running time should be not too bad for your constraints.
Here is a rust implementation (out of my head, not throughtfully tested):
fn main() {
let mut v = vec!();
for i in 1..50*1000 {
v.push(i);
}
let r = bitcount_arr(v);
r.iter().enumerate().for_each( |(i,x)| print!("index {}:{} ",i+1,x));
}
fn bitcount_arr(input:Vec<u32>) -> [u32;32] {
let mut res = [0;32];
for num in input {
for i in 0..31 {
let mask = 1 << i;
if num & mask != 0 {
res[i] += 1;
}
}
}
res
}
This can be done with transposed addition, though the array is a bit long for it.
To transpose addition, use an array of counters, but instead of using one counter for every position we'll use one counter for every bit of the count. So a counter that tracks for each position whether the count is even/odd, a counter that tracks for each position whether the count has a 2 in it, etc.
To add an element of the array into this, only half-add operations (& to find the new carry, ^ to update) are needed, since it's only a conditional increment: (not tested)
uint32_t counters[17];
for (uint32_t elem : array) {
uint32_t c = elem;
for (int i = 0; i < 17; i++) {
uint32_t nextcarry = counters[i] & c;
counters[i] ^= c;
c = nextcarry;
}
}
I chose 17 counters because log2(10^5) is just less than 17. So even if all bits are 1, the counters won't wrap.
To read off the result for bit k, take the k'th bit of every counter.
There are slightly more efficient ways that can add several elements of the array into the counters at once using some full-adds and duplicated counters.

Igraph calculating minimum spanning tree with weights C interface

I have been trying to calculate a minimum spanning tree using the prim method, but I have got rather confused about the way that weights are used in this context. The suggested example program in the source documents does not appear to be correct, I don't understand why the edge betweenness needs to be calculated.
Please see the following program, it's designed to make a simple undirected graph.
#include <igraph.h>
int main()
{
igraph_vector_t eb, edges;
igraph_vector_t weights;
long int i;
igraph_t theGraph, tree;
struct arg {
int index;
int source;
int target;
float weight;
};
struct arg data[] = {
{0, 0, 1, 2.0},
{1, 1, 2, 3.0},
{2, 2, 3, 44.0},
{3, 3, 4, 3.0},
{4, 4, 1, 2.0},
{5, 4, 5, 9.0},
{6, 4, 6, 3.0},
{6, 6, 5, 7.0}
};
int nargs = sizeof(data) / sizeof(struct arg);
igraph_empty(&theGraph, nargs, IGRAPH_UNDIRECTED);
igraph_vector_init(&weights, nargs);
// create graph
for (i = 0; i < nargs; i++) {
igraph_add_edge(&theGraph, data[i].source, data[i].target);
// Add an weight per entry
igraph_vector_set(&weights, i, data[i].weight);
}
igraph_vector_init(&eb, igraph_ecount(&theGraph));
igraph_edge_betweenness(&theGraph, &eb, IGRAPH_UNDIRECTED, &weights);
for (i = 0; i < igraph_vector_size(&eb); i++) {
VECTOR(eb)[i] = -VECTOR(eb)[i];
}
igraph_minimum_spanning_tree_prim(&theGraph, &tree, &eb);
igraph_write_graph_edgelist(&tree, stdout);
igraph_vector_init(&edges, 0);
igraph_minimum_spanning_tree(&theGraph, &edges, &eb);
igraph_vector_print(&edges);
igraph_vector_destroy(&edges);
igraph_destroy(&tree);
igraph_destroy(&theGraph);
igraph_vector_destroy(&eb);
return 0;
}
Can anybody see anything that is wrong with this program it's designed to build a simple graph with what I hope is the correct way to use a weight argument. One value per edge between a source and a target.
The section about adding an edge betweenness comes from the original code example for the use of prim. It just needs to be removed for the program to work correctly using a user supply value of weight.

CUDA binary search implementation

I am trying to speed up the CPU binary search. Unfortunately, GPU version is always much slower than CPU version. Perhaps the problem is not suitable for GPU or am I doing something wrong ?
CPU version (approx. 0.6ms):
using sorted array of length 2000 and do binary search for specific value
...
Lookup ( search[j], search_array, array_length, m );
...
int Lookup ( int search, int* arr, int length, int& m )
{
int l(0), r(length-1);
while ( l <= r )
{
m = (l+r)/2;
if ( search < arr[m] )
r = m-1;
else if ( search > arr[m] )
l = m+1;
else
{
return index[m];
}
}
if ( arr[m] >= search )
return m;
return (m+1);
}
GPU version (approx. 20ms):
using sorted array of length 2000 and do binary search for specific value
....
p_ary_search<<<16, 64>>>(search[j], array_length, dev_arr, dev_ret_val);
....
__global__ void p_ary_search(int search, int array_length, int *arr, int *ret_val )
{
const int num_threads = blockDim.x * gridDim.x;
const int thread = blockIdx.x * blockDim.x + threadIdx.x;
int set_size = array_length;
ret_val[0] = -1; // return value
ret_val[1] = 0; // offset
while(set_size != 0)
{
// Get the offset of the array, initially set to 0
int offset = ret_val[1];
// I think this is necessary in case a thread gets ahead, and resets offset before it's read
// This isn't necessary for the unit tests to pass, but I still like it here
__syncthreads();
// Get the next index to check
int index_to_check = get_index_to_check(thread, num_threads, set_size, offset);
// If the index is outside the bounds of the array then lets not check it
if (index_to_check < array_length)
{
// If the next index is outside the bounds of the array, then set it to maximum array size
int next_index_to_check = get_index_to_check(thread + 1, num_threads, set_size, offset);
if (next_index_to_check >= array_length)
{
next_index_to_check = array_length - 1;
}
// If we're at the mid section of the array reset the offset to this index
if (search > arr[index_to_check] && (search < arr[next_index_to_check]))
{
ret_val[1] = index_to_check;
}
else if (search == arr[index_to_check])
{
// Set the return var if we hit it
ret_val[0] = index_to_check;
}
}
// Since this is a p-ary search divide by our total threads to get the next set size
set_size = set_size / num_threads;
// Sync up so no threads jump ahead and get a bad offset
__syncthreads();
}
}
Even if I try bigger arrays, the time ratio is not any better.
You have way too many divergent branches in your code so you're essentially serializing the entire process on the GPU. You want to break up the work so that all the threads in the same warp take the same path in the branch. See page 47 of the CUDA Best Practices Guide.
I'm must admit I'm not entirely sure what what your kernel does, but am I right in assuming that you are looking for just one index that satisfies your search criteria? If so then have a look at the reduction sample that comes with CUDA for some pointers on how to structure and optimize such a query. (What your are doing is essentially trying to reduce the closest index to your query)
Some quick pointers though:
You are performing an awful lot of reads and writes to global memory, which is incredibly slow. Try using shared memory instead.
Secondly remember that __syncthreads() only syncs threads in the same block, so your reads/writes to global memory won't necessarily get synced across all threads (though the latency from you global memory writes may actually make it appear as if they do)

How to store a symmetric matrix?

Which is the best way to store a symmetric matrix in memory?
It would be good to save half of the space without compromising speed and complexity of the structure too much. This is a language-agnostic question but if you need to make some assumptions just assume it's a good old plain programming language like C or C++..
It seems a thing that has a sense just if there is a way to keep things simple or just when the matrix itself is really big, am I right?
Just for the sake of formality I mean that this assertion is always true for the data I want to store
matrix[x][y] == matrix[y][x]
Here is a good method to store a symmetric matrix, it requires only N(N+1)/2 memory:
int fromMatrixToVector(int i, int j, int N)
{
if (i <= j)
return i * N - (i - 1) * i / 2 + j - i;
else
return j * N - (j - 1) * j / 2 + i - j;
}
For some triangular matrix
0 1 2 3
4 5 6
7 8
9
1D representation (stored in std::vector, for example) looks like as follows:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
And call fromMatrixToVector(1, 2, 4) returns 5, so the matrix data is vector[5] -> 5.
For more information see http://www.codeguru.com/cpp/cpp/algorithms/general/article.php/c11211/TIP-Half-Size-Triangular-Matrix.htm
I find that many high performance packages just store the whole matrix, but then only read the upper triangle or lower triangle. They might then use the additional space for storing temporary data during the computation.
However if storage is really an issue then just store the n(n+1)/2 elements making the upper triangle in a one-dimensional array. If that makes access complicated for you, just define a set of helper functions.
In C to access a matrix matA you could define a macro:
#define A(i,j, dim) ((i <= j)?matA[i*dim + j]:matA[j*dim + i])
then you can access your array nearly normally.
Well I would try a triangular matrix, like this:
int[][] sym = new int[rows][];
for( int i = 0; i < cols; ++i ) {
sym=new int[i+1];
}
But then you wil have to face the problem when someone wants to access the "other side". Eg he wants to access [0][10] but in your case this val is stored in[10][0] (assuming 10x10).
The probably "best" way is the lazy one - dont do anything until the user requests. So you could load the specific row if the user types somethin like print(matrix[4]).
If you want to use a one dimensional array the code would look something like this:
int[] new matrix[(rows * (rows + 1 )) >> 1];
int z;
matrix[ ( ( z = ( x < y ? y : x ) ) * ( z + 1 ) >> 1 ) + ( y < x ? y : x ) ] = yourValue;
You can get rid of the multiplications if you create an additional look-up table:
int[] new matrix[(rows * (rows + 1 )) >> 1];
int[] lookup[rows];
for ( int i= 0; i < rows; i++)
{
lookup[i] = (i * (i+1)) >> 1;
}
matrix[ lookup[ x < y ? y : x ] + ( x < y ? x : y ) ] = yourValue;
If you're using something that supports operator overloading (e.g. C++), it's pretty easy to handle this transparently. Just create a matrix class that checks the two subscripts, and if the second is greater than the first, swap them:
template <class T>
class sym_matrix {
std::vector<std::vector<T> > data;
public:
T operator()(int x, int y) {
if (y>x)
return data[y][x];
else
return data[x][y];
}
};
For the moment I've skipped over everything else, and just covered the subscripting. In reality, to handle use as both an lvalue and an rvalue correctly, you'll typically want to return a proxy instead of a T directly. You'll want a ctor that creates data as a triangle (i.e., for an NxN matrix, the first row will have N elements, the second N-1, and so on -- or, equivalantly 1, 2, ...N). You might also consider creating data as a single vector -- you have to compute the correct offset into it, but that's not terribly difficult, and it will use a bit less memory, run a bit faster, etc. I'd use the simple code for the first version, and optimize later if necessary.
You could use a staggered array (or whatever they're called) if your language supports it, and when x < y, switch the position of x and y. So...
Pseudocode (somewhat Python style, but not really) for an n x n matrix:
matrix[n][]
for i from 0 to n-1:
matrix[i] = some_value_type[i + 1]
[next, assign values to the elements of the half-matrix]
And then when referring to values....
if x < y:
return matrix[y][x]
else:
return matrix[x][y]