Multiply two matrix in cuda c [closed] - cuda

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have to matrices A(32*32) and B(32*n), in which 'n' is coming from inputs and is between 2000 to 2000000.
I have two kind of inputs one is integers between 0 to 255 and the other one is 0,1. this multiplication is in a loop that iterates 3000 times. B(32*n) comes form input and is constant in all of the iterations but A(32*32) can change in each iteration.
//read B from file
//read A from file
double D[3000];
for(int i = 0; i < 3000; i++)
{
C = multiply(A, B);
// D[i] = mean of all elements in C
// build A from B using D[i] (this part is really complicated sequential process that contains lots of if and switches)
}
What is the fastest way to do this?
thank you.

Nobody here is going to write code for you, that is not what Stack Overflow is intended for. However, it would appear to be that there are a number of characteristics of the problem which you should be looking to exploit to improve the performance of your code:
Recognise that because one of the matrices only contains 0 or 1 and you are performing this in integer, what you are describing as matrix multiplication is really a large number of independent sparse sums
Recognise that because the next operation is to compute an average, you don't actually have to store the intermediate dot products and could directly perform a reduction on partial results of the matrix row summation
There are probably parallel primitives in the thrust library which you could use for prototyping, and an optimal hand written kernel would be aiming to fuse both the first and most of the second part of the operation into a single kernel.

Related

Bouncing off inclined surface [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I can calculate distance between inclined line and my ball (with normal vector), But how can I calculate new velocity?
Anders' answer was a good one but I realise that you may not have a great mathematical back ground so I will elaborate. The problem you have at the moment is poorly stated. However, see the following figure
This will allow us to derive the equation you require. Now, the scalar product of two vectors a and b, a.b gives the magnitude of a multiplied by the projection of b onto a. Basically, if we take n as a unit vector (magnitude 1 in each component direction) then a.n gives the magnitude of the components of a which act in the direction of n.
So, splitting the velocity components into those parallel and perpendicular to the plain; to get the velocity V we first split U into components.
Perpendicular to the plane in direction n, we have a vector velocity w = (U.n) n. This means that in fact we can write U = (U.n) n + [U - (U.n) n]. This is saying that U is made up of the perpendicular component of itself + the parallel component of itself. Now, -V is very similar to U but the parallel components acts in the reverse direction, so we can write -V = (U.n) n - [U - (U.n) n].
Combining the above gives the result Anders stated, i.e. V = U -2[(U.n) n]. The dot/scalar product is defined as a.b = |a||b|cos(A) where A is the angle between the vectors laid together tail-to-tail, this should enable you to solve your problem.
I hope this helps
If The vector v=(vx,vy) is the initial velocity and the plane has normal n=(nx,ny) then the new reflected velocity vector r will be
r=v−2(v⋅n)*n
The product (v⋅n) is the dot product of v and n, defined as vxnx+vyny. Note that the plane normal must be normalized (length 1.0). A related question with the same answer https://math.stackexchange.com/questions/13261/how-to-get-a-reflection-vector

What is out of bag error in Random Forests? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
What is out of bag error in Random Forests?
Is it the optimal parameter for finding the right number of trees in a Random Forest?
I will take an attempt to explain:
Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).
T = {(X1,y1), (X2,y2), ... (Xn, yn)}
and
Xi is input vector {xi1, xi2, ... xiM}
yi is the label (or output or class).
summary of RF:
Random Forests algorithm is a classifier based on primarily two methods -
Bagging
Random subspace method.
Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS} datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping. (en.wikipedia.org/wiki/Bootstrapping_(statistics))
Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.
Now, RF creates S trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to create any tree. This is called random subspace method.
So for each Ti bootstrap dataset you create a tree Ki. If you want to classify some input data D = {x1, x2, ..., xM} you let it pass through each tree and produce S outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}. Final prediction is a majority vote on this set.
Out-of-bag error:
After creating the classifiers (S trees), for each (Xi,yi) in the original training set i.e. T, select all Tk which does not include (Xi,yi). This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk such that it does not contain (xi,yi).
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi's).
Why is it important?
The study of error estimates for bagged classifiers in Breiman
[1996b], gives empirical evidence to show that the out-of-bag estimate
is as accurate as using a test set of the same size as the training
set. Therefore, using the out-of-bag error estimate removes the need
for a set aside test set.1
(Thanks #Rudolf for corrections. His comments below.)
In Breiman's original implementation of the random forest algorithm, each tree is trained on about 2/3 of the total training data. As the forest is built, each tree can thus be tested (similar to leave one out cross validation) on the samples not used in building that tree. This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed.

Vectorization or sum as matrix operations [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Let there be the following definition of gradient descent cost function
with the hypothesis function defined as
what I've come up with for multivariate linear regression is
theta = theta - alpha * 1/m * ([theta', -1]*[X';y']*X)';
h_theta = 1/(2*m)* (X*theta - y)'*(X*theta-y);
(octave notation, ' means matrix transpose, [A, n] means adding a new column to matrix A with scalar value n, [A; B] means appending matrix B to matrix A row-wise)
It's doing its job correctly how far I can tell (the plots look ok), however I have a strong feeling that it's unnecessarily complicated.
How to write it with as little matrix operations as possible (and no element-wise operations, of course)?
I don't think that is unnecessarily complicated, and instead this is what you want. Matrix operations are good because you don't have to loop over elements yourself or do element-wise operations. I remember taking a course online and my solution seems pretty similar.
The way you have it is the most efficient way of doing it as it is fully vectorized. It can be done by having a for loop over the summation and so on, however this is very inefficient in terms of processing power.

Defining a Total Order [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
How would you define a total order? For example, if you needed to define a total ordering or a shape, etc. How would you go about doing so?
Edit: Specifically, how would you define a total order based on an object with the coordinates (x,y,z). I don't understand how you could structure an ordering whereby each object is unique and sortable.
There is no "natural" ordering on 2D or 3D objects. However, if you want to induce an ordering, you can compare them by their coordinates, for example this way:
// returns -1 if o1<o2, 1 if o1>o2, 0 if o1==o2
int Compare(MyObject o1 ,MyObject o2)
{
if(o1.x>o2.x) return 1;
if(o1.x<o2.x) return -1;
if(o1.y>o2.y) return 1;
if(o1.y<o2.y) return -1;
if(o1.z>o2.z) return 1;
if(o1.z<o2.z) return -1;
return 0;
}
This assumes objects are uniquely identified by their coordinates, of course.
This ordering will let you sort and compare such objects. The question you have to answer yourself is if it helps you for any of the problems you want to solve with that. An ordering on a 1D-set is typically used to make lookups faster, especially when you want not only a specific element from your set, but all elements from a given range.
For 2D or 3D sets, a similar question is to find all element sets within a given rectangle or cube. For that purpose, the order above does not support you very well. There are datastructures like a 2D quadtree or 3D octree supporting this task much better.

Of Ways to Count the Limitless Primes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Alright, so maybe I shouldn't have shrunk this question sooo much... I have seen the post on the most efficient way to find the first 10000 primes. I'm looking for all possible ways. The goal is to have a one stop shop for primality tests. Any and all tests people know for finding prime numbers are welcome.
And so:
What are all the different ways of finding primes?
Some prime tests only work with certain numbers, for instance, the Lucas–Lehmer test only works for Mersenne numbers.
Most prime tests used for big numbers can only tell you that a certain number is "probably prime" (or, if the number fails the test, it is definitely not prime). Usually you can continue the algorithm until you have a very high probability of a number being prime.
Have a look at this page and especially its "See Also" section.
The Miller-Rabin test is, I think, one of the best tests. In its standard form it gives you probable primes - though it has been shown that if you apply the test to a number beneath 3.4*10^14, and it passes the test for each parameter 2, 3, 5, 7, 11, 13 and 17, it is definitely prime.
The AKS test was the first deterministic, proven, general, polynomial-time test. However, to the best of my knowledge, its best implementation turns out to be slower than other tests unless the input is ridiculously large.
For a given integer, the fastest primality check I know is:
Take a list of 2 to the square root of the integer.
Loop through the list, taking the remainder of the integer / current number
If the remainder is zero for any number in the list, then the integer is not prime.
If the remainder was non-zero for all numbers in the list, then the integer is prime.
It uses significantly less memory than The Sieve of Eratosthenes and is generally faster for individual numbers.
The Sieve of Eratosthenes is a decent algorithm:
Take the list of positive integers 2 to any given Ceiling.
Take the next item in the list (2 in the first iteration) and remove all multiples of it (beyond the first) from the list.
Repeat step two until you reach the given Ceiling.
Your list is now composed purely of primes.
There is a functional limit to this algorithm in that it exchanges speed for memory. When generating very large lists of primes the memory capacity needed skyrockets.
#akdom's question to me:
Looping would work fine on my previous suggestion, and you don't need to do any calculations to determine if a number is even; in your loop, simply skip every even number, as shown below:
//Assuming theInteger is the number to be tested for primality.
// Check if theInteger is divisible by 2. If not, run this loop.
// This loop skips all even numbers.
for( int i = 3; i < sqrt(theInteger); i + 2)
{
if( theInteger % i == 0)
{
//getting here denotes that theInteger is not prime
// somehow indicate that some number, i, divides it and break
break;
}
}
A Rutgers grad student recently found a recurrence relation that generates primes. The difference of its successive numbers will generate either primes or 1's.
a(1) = 7
a(n) = a(n-1) + gcd(n,a(n-1)).
It makes a lot of crap that needs to be filtered out. Benoit Cloitre also has this recurrence that does a similar task:
b(1) = 1
b(n) = b(n-1) + lcm(n,b(n-1))
then the ratio of successive numbers, minus one [b(n)/b(n-1)-1] is prime. A full account of all this can be read at Recursivity.
For the sieve, you can do better by using a wheel instead of adding one each time, check out the Improved Incremental Prime Number Sieves. Here is an example of a wheel. Let's look at the numbers, 2 and 5 to ignore. Their wheel is, [2,4,2,2].
In your algorithm using the list from 2 to the root of the integer, you can improve performance by only testing odd numbers after 2. That is, your list only needs to contain 2 and all odd numbers from 3 to the square root of the integer. This cuts the number of times you loop in half without introducing any more complexity.
#theprise
If I were wanting to use an incrementing loop instead of an instantiated list (problems with memory for massive numbers...), what would be a good way to do that without building the list?
It doesn't seem like it would be cheaper to do a divisibility check for the given integer (X % 3) than just the check for the normal number (N % X).
If you're wanting to find a way of generating prime numbers, this have been covered in a previous question.