The most efficient way to calculate an integral in a dataset range - integration

I have an array of 10 rows by 20 columns. Each columns corresponds to a data set that cannot be fitted with any sort of continuous mathematical function (it's a series of numbers derived experimentally). I would like to calculate the integral of each column between row 4 and row 8, then store the obtained result in a new array (20 rows x 1 column).
I have tried using different scipy.integrate modules (e.g. quad, trpz,...).
The problem is that, from what I understand, scipy.integrate must be applied to functions, and I am not sure how to convert each column of my initial array into a function. As an alternative, I thought of calculating the average of each column between row 4 and row 8, then multiply this number by 4 (i.e. 8-4=4, the x-interval) and then store this into my final 20x1 array. The problem is...ehm...that I don't know how to calculate the average over a given range. The question I am asking are:
Which method is more efficient/straightforward?
Can integrals be calculated over a data set like the one that I have described?
How do I calculate the average over a range of rows?

Since you know only the data points, the best choice is to use trapz (the trapezoidal approximation to the integral, based on the data points you know).
You most likely don't want to convert your data sets to functions, and with trapz you don't need to.
So if I understand correctly, you want to do something like this:
from numpy import *
# x-coordinates for data points
x = array([0, 0.4, 1.6, 1.9, 2, 4, 5, 9, 10])
# some random data: 3 whatever data sets (sharing the same x-coordinates)
y = zeros([len(x), 3])
y[:,0] = 123
y[:,1] = 1 + x
y[:,2] = cos(x/5.)
print y
# compute approximations for integral(dataset, x=0..10) for datasets i=0,1,2
yi = trapz(y, x[:,newaxis], axis=0)
# what happens here: x must be an array of the same shape as y
# newaxis tells numpy to add a new "virtual" axis to x, in effect saying that the
# x-coordinates are the same for each data set
# approximations of the integrals based the datasets
# (here we also know the exact values, so print them too)
print yi[0], 123*10
print yi[1], 10 + 10*10/2.
print yi[2], sin(10./5.)*5.

To get the sum of the entries 4 to 8 (including both ends) in each column, use
a = numpy.arange(200).reshape(10, 20)
a[4:9].sum(axis=0)
(The first line is just to create an example array of the desired shape.)

Related

What algorithm is used to filter out from higher dimensional data points?

I have 4-dimensional data points stored in my MySQL database in server. One time dimension data with three spatial GPS data (lat, lon, alt). GPS data are sampled by 1 minute timeframe for thousands of users and are being added to my server 24x7.
Sample REST/post json looks like,
{
"id": "1005",
"location": {
"lat":-87.8788,
"lon":37.909090,
"alt":0.0,
},
"datetime": 11882784
}
Now, I need to filter out all the candidates (userID) whose positions were within k meters distance from a given userID for a given time period.
Sample REST/get query params for filtering looks like,
{
"id": "1001", // user for whose we need to filter out candidates IDs
"maxDistance":3, // max distance in meter to consider (euclidian distance from users location to candidates location)
"maxDuration":14 // duration offset (in days) from current datetime to consider
}
As we see, thousands of entries are inserted in my database per minute which results huge number of total entries. Thus to iterate over for all the entries for filtering, I am afraid trivial naive approach won't be feasible for my current requirement. So, what algorithm should I implement in the server? I have tried to implement naive algorithm like,
params ($uid, $mDis, $mDay)
1. Init $candidates = []
2. For all the locations $Li of user with $uid
3. For all locations $Di in database within $mDay
4. $dif = EuclidianDis($Li, $Di)
5. If $dif < $mDis
6. $candidates += userId for $Di
7. Return $candidates
However, this approach is very slow in practice. And pre calculation might not be feasible as it costs huge space for all userIDs. What other algorithm can improve efficiency?
You could implement a spatial hashing algorithm to efficiently query your database for candidates within a given area/time.
Divide the 3D space into a 3D grid of cubes with width k, and when inserting a data-point into your database, compute which cube the point lies in and compute a hash value based on the cube co-ordinates.
When querying for all data-points within k of another datapoint d, compute the cube that d sits in, and find the 8 adjacent cubes (+/- 1 in each dimension). Compute the hash values of the 9 cubes and query your database for all entries with these hash values within the given time period. You will have small candidate set from which you can then iterate over to find all datapoints within k of d.
If your value of k can range between 2-5 meters, give your cubes a width of 5.
Timestamps can be stored as a separate field, or alternatively you can make your cubes 4-dimensional and include the timestamp in the hash, and search 27 cubes instead of 9.

Stata regression with conditions on dummies and variable values

I'm trying to create a regression that would include a polynomial (let's say 2nd order) of year on a certain interval of year (say 1 to 70) and a number of dummies for certain values of year (say for every year between 45 and 60).
If I didn't have the restriction for dummies, I believe the commands would be:
gen year2=year^2
regress y year year2 i.year if inrange(year,1,70)
I can't make the dummies manually, there will be more than 15 of them in the end). Could anybody help me, please?
If I then want to plot the estimated function without the dummies, why do these two bring different things?
twoway function _b[_cons] +_b[year]*x + _b[year2]*x^2, range(1 70)
twoway function _b[_cons] +_b[year]*year + _b[year2]*year^2, range(1 70)
The way I understood it, _b[_cons], _b[year] and _b[year2] call previously calculated coefficients for the corresponding independent variables and then multiplies it with them. Why does it bring different results then if x should be the same thing as year in this case?
I am not sure why Pearly is giving you such a hard time, I think this may be what you're looking for, but let me know if it is something different:
One thing to note, I am using a dataset that comes preloaded with Stata and this is usually a nice way to make a MVCE like Nick was saying in your other post.
clear
sysuse gnp96
/* variables: gnp, date (quarterly) */
gen year = year(dofq(date)) // get yearly variable
gen year2=year^2 // get the square of the yearly variable
tab year if inrange(year,1970,1975), gen(yr) // generate dummy variables
// the dummy varibales generated have null values for years not
// in the specified range, so we're going to fill those in
foreach v of varlist yr* {
replace `v' = 0 if `v' == .
}
// here's your regression
regress gnp year year2 yr* if inrange(year,1967,1990)
Now, the yr* are your dummy variables and the * is a wildcard calling all variables named like yr[something]
This gives you the range for the dummy variables and the range for the year variables.
As to your question on using x vs year, I am only hypothesizing, but I think that when you use x it is continuous since Stata isn't looking at your variables, but instead just at the x axis whereas your year variable is discrete (a bunch of integers) so it looks more like a step function. More information can be found using the command help twoway function

Finding the smallest distance in a set of points from the origin

I am to find the smallest distance between a given set of points and the origin. I have a matrix with 2 columns and 10 rows. Each row represents coordinates. One point consists of two coordinates and I would like to calculate the smallest distance between each point and to the origin. I would also like to determine which point gave this smallest distance.
In Octave, I calculate this distance by using norm and for each point in my set, I have a distance associated with them and the smallest distance is obviously the one I'm looking for. However, the code I wrote below isn't working the way it should.
function [dist,koor] = bonus4(S)
S= [-6.8667, -44.7967;
-38.0136, -35.5284;
14.4552, -27.1413;
8.4996, 31.7294;
-17.2183, 28.4815;
-37.5100, 14.1941;
-4.2664, -24.4428;
-18.6655, 26.9427;
-15.8828, 18.0170;
17.8440, -22.9164];
for i=1:size(S)
L=norm(S(i, :))
dist=norm(S(9, :));
koor=S(9, :) ;
end
i = 9 is the correct answer, but I need Octave to put that number in. How do I tell Octave that this is the number I want? Specifically:
dist=norm(S(9, :));
koor=S(9, :);
I cannot use any packages. I found the geometry package online but I am to solve the task without additional packages.
I'll work off of your original code. Firstly, you want to compute the norm of all of the points and store them as individual elements in an array. Your current code isn't doing that and is overwriting the variable L which is a single value at each iteration of the loop.
You'll want to make L an array and store the norms at each iteration of the loop. Once you do this, you'll want to find the location as well as the minimum distance itself. That can be done with one call to min where the first output gives you the minimum distance and the second output gives you the location of the minimum. You can use the second output to slice into your S array to retrieve the actual point.
Last but not least, you need to define S first before calling this function. You are defining S inside the function and that will probably give you unintended results if you want to change the input into this function at each invocation. Therefore, define S first, then call the function:
S= [-6.8667, -44.7967;
-38.0136, -35.5284;
14.4552, -27.1413;
8.4996, 31.7294;
-17.2183, 28.4815;
-37.5100, 14.1941;
-4.2664, -24.4428;
-18.6655, 26.9427;
-15.8828, 18.0170;
17.8440, -22.9164];
function [dist,koor] = bonus4(S)
%// New - Create an array to store the distances
L = zeros(size(S,1), 1);
%// Change to iterate over number of rows
for i=1:size(S,1)
L(i)=norm(S(i, :)); %// Change
end
[dist,ind] = min(L); %// Find the minimum distance
koor = S(ind,:); %// Get the actual point
end
Or, make sure you save the above function in a file called bonus4.m, then do this in the Octave command prompt:
octave:1> S= [-6.8667, -44.7967;
> -38.0136, -35.5284;
> 14.4552, -27.1413;
> 8.4996, 31.7294;
> -17.2183, 28.4815;
> -37.5100, 14.1941;
> -4.2664, -24.4428;
> -18.6655, 26.9427;
> -15.8828, 18.0170;
> 17.8440, -22.9164];
octave:2> [dist,koor] = bonus4(S);
Though this code works, I'll debate that it's slow as you're using a for loop. A faster way would be to do this completely vectorized. Because using norm for matrices is different than with vectors, you'll have to compute the distance yourself. Because you are measuring the distance from the origin, you can simply square each of the columns individually then add the columns of each row.
Therefore, you can just do this:
S= [-6.8667, -44.7967;
-38.0136, -35.5284;
14.4552, -27.1413;
8.4996, 31.7294;
-17.2183, 28.4815;
-37.5100, 14.1941;
-4.2664, -24.4428;
-18.6655, 26.9427;
-15.8828, 18.0170;
17.8440, -22.9164];
function [dist,koor] = bonus4(S)
%// New - Computes the norm of each point
L = sqrt(sum(S.^2, 2));
[dist,ind] = min(L); %// Find the minimum distance
koor = S(ind,:); %// Get the actual point
end
The function sum can be used to sum over a dimension independently. As such, by doing S.^2, you are squaring each term in the points matrix, then by using sum with the second parameter as 2, you are summing over all of the columns for each row. Taking the square root of this result computes the distance of each point to the origin, exactly the way the for loop functions. However, this (at least to me) is more readable and I daresay faster for larger sizes of points.

Plot power series gnuplot

I'd like to know how to plot power series (whose variable is x), but I don't even know where to start with.
I know it might not be possible plot infinite series, but it'd do as well plotting the sum of the first n terms.
Gnuplot has a sum function, which can be used inside the using statement to sum up several columns or terms. Together with the special file name + you can implement power series.
Consider the exponention function, which has a power series
\sum_{n=0}^\infty x^n/n!
So, we define a term as
term(x, n) = x**n/n!
Now we can plot the power series up to the n=5 term with
set xrange [0:4]
term(x, n) = x**n/n!
set samples 20
plot '+' using 1:(sum [n=0:5] term($1, n))
To plot the results when using 2 to 7 terms and compare it with the actual exp function, use
term(x, n) = x**n/n!
set xrange [-2:2]
set samples 41
set key left
plot exp(x), for [i=1:6] '+' using 1:(sum[t=0:i] term($1, t)) title sprintf('%d terms', i)
The easiest way that I can think of is to generate a file that has a column of x-values and a column of f(x) values, then just plot the table like you would any other data. A power series is continuous, so you can just connect the dots and have a fairly accurate representation (provided your dots are close enough together). Also, when evaluating f(x), you just sum up the first N terms (where N is big enough). Big enough means that the sum of the rest of the terms is smaller than whatever error you allow. (*If you want 3 good digits, then N needs to be large enough that the remaining sum is smaller than .001.)
You can pull out a calc II textbook to determine how to bound the error on the tail of the sum. A lot of calc classes briefly cover it, but students tend to feel like the error estimates are pointless (I know because I've taught the course a few times.) As an example, if you have an alternating series (whose terms are decreasing in absolute value), then the absolute value of the first term you omit (don't sum) is an upperbound on your error.
*This statement is not 100% true, it is slightly over simplified, but is correct for most practical purposes.

How to convert a QuadTree Cell's Spatial Index (Binary Index) to Position and Dimension values?

Sorry in advance for miss-using any terminology in this question, but basically I'm looking into creating a QuadTree that makes use of Binary Indexing, like this:
As you can see in the two illustrations above, if each cells are given a binary ID (ex: 1010, 1011) then every ODD binary indices controls the X offset and every EVEN binary indices controls the Y offset.
For example, in the case of the Level 2 grid (16 cells), 1010 (cell #10) could be said to have 1s at it's 4th and 2nd index, therefore those would perform two Y offsets. The first '1###' (on the leftmost side) would indicate an offset of one cell-height, then the second '##1#' would additionally offset it twice the cell height.
As in:
// If Cell Height = 64pixels
1### = 64 pixels
+ ##1# = 128 pixels
__________________
1#1# = 192 pixels
The same can be applied to the X axis, only it uses the odd numbers instead (ex: #1#1).
Now, when I initialize my QuadTree, I began calculating the maximum nodes it may contain if all cells and all depths are used. I have calculated this with the sum of 4 to the power of each depths:
_totalNodes = 0;
var t:int=0, tLen:int=_maxLevels;
for (; t<tLen; t++) {
_totalNodes += Math.pow(4, t); //Adds 1, 4, 16, 64, 256, etc...
}
Then, I create another loop (iterating from 0 to _totalNodes) which instantiates the nodes and stores it in a long array. It passes the current iteration integer to the Node constructor, and it stores it as it's index.
So far I've been able to determine which depth (aka: Level) the Node would be stored in by figuring out it's index's Most Significant Bit:
public static function MSB( pValue:uint ):int {
var bits:int = 0;
while ( pValue >>= 1) {
bits++;
}
return bits;
}
But now, I'm stuck trying to figure out how to convert the index from binary form to actual Cell X and Y positions. like I said above, the dimensions of each cells are found. It's just a matter of doing some logical operations on the whole index (or "bit-code" is the name I refer to in my code)
If you know of a good example that uses logical-operations (binary level) to convert the binary index values to X and Y positions, could you please post a link or explanation here?
Thanks!
Here's a reference where I got this idea from (note: different programming language):
L. Spiro Engine - http://lspiroengine.com/?p=530
I'm not familiar with the language used in that article though, so I can't really follow it and convert it easily to ActionScript 3.0.
your task is described by Hannan Samet.
This works by first building the quadtree, and then assign to each quad cell the coresponding morton code. (bit interleaving code).
once you have the code, you assign it to the objects in the quad. then you can delte the quad tree. you then can search by converting a coordinate to the coresponding morton code, and do a bin search on the morton index. Instead of morton (also called z order) you als can use hilbert or gray codes.