I have been stuck for days because I cannot find a way to fit a traditional regression model such as ARIMA giving multiple time series as input.
I have got thousand trajectories positions of different vehicles (xy coordinates for each position). Let's say each sample (trajectory) is composed of 10 positions (all objects trajectories are not necessarily of the same length). It means I have got 10*N different time series (N is the total number of samples). I want to fit a model with all samples for x coordinates and then predict the future position of any new trajectory (test samples) that I give in input. Then I plan to do the same with another model for y coordinates. I do not say the method will work but I need to implement it to compare it with others (Neural networks, ...)
The hypothesis: a number of time series can be modeled with a single ARIMAX (or other) model (i.e. the same parameters work for all the time series). What is wanted: to fit them all simultaneously.
Can someone help me please?
Thank you in advance for your support!
Best regards,
Background
I'm trying to convert an algorithm from sequential to parallel, but I am stuck.
Point and Figure Charts
I am creating point and figure charts.
Decreasing
While the stock is going down, add an O every time it breaks through the floor.
Increasing
While the stock is going up, add an X every time it breaks through the ceiling.
Reversal
If the stock reverses direction, but the change is less than a reversal threshold (3 units) do nothing. If the change is greater than the reversal threshold, start a new column (X or O)
Sequential vs Parallel
Sequentially, this is pretty straight forward. I keep a variable for the floor and ceiling. If the current price breaks through the floor or ceiling, or changes more than the reversal threshold, I can take the appropriate action.
My question is, is there a way to find these reversal point in parallel? I'm fairly new to thinking in parallel, so I'm sorry if this is trivial. I am trying to do this in CUDA, but I have been stuck for weeks. I have tried using the finite difference algorithms from NVidia. These produce local max / min but not the reversal points. Small fluctuations produce numerous relative max / min, but most of them are trivial because the change is not greater than the reversal size.
My question is, is there a way to find these reversal point in parallel?
one possible approach:
use thrust::unique to remove periods where the price is numerically constant
use thrust::adjacent_difference to produce 1st difference data
use thrust::adjacent_difference on 1st difference data to get the 2nd difference data, i.e the points where there is a change in the sign of the slope.
use these points of change in sign of slope to identify separate regions of data - build a key vector from these (e.g. with a prefix sum). This key vector segments the price data into "runs" where the price change is in a particular direction.
use thrust::exclusive_scan_by_key on the 1st difference data, to produce the net change of the run
Wherever the net change of the run exceeds a threshold, flag as a "reversal"
Your description of what constitutes a reversal may also be slightly unclear. The above method would not flag a reversal on certain data patterns that you might classify as a reversal. I suspect you are looking beyond a single run as I have defined it here. If that is the case, there may be a method to address that as well - with more steps.
Existing implementation:
In my implementation of Tic-Tac-Toe with minimax, I look for all boxes where I can get best result and chose 1 of them randomly, so that the same solution isn't displayed each time.
For ex. if the returned list is [1, 0 , 1, -1], at some point, I will randomly chose between the two highest values.
Question about Alpha-Beta Pruning:
Based on what I understood, when the algorithm finds that it is winning from one path, it would no longer need to look for other paths that might/ might not lead to a winning case.
So will this, like I feel, cause the earliest possible box that leads to the best solution to be displayed as the result and seem the same each time? For example at the time of first move, all moves lead to a draw. So will the 1st box be selected each time?
How can I bring randomness to the solution like with the minimax solution? One way that I thought about now could be to randomly pass the indices to the alpha-beta algorithm. So the result will be the first best solution in that randomly sorted list of positions.
Thanks in advance. If there is some literature on this, I'd be glad to read it.
If someone could post some good reference for aplha-beta pruning, That'll be excellent as I had a hard time understanding how to apply it.
To randomly pick among multiple best solutions (all equal) in alpha-beta pruning, you can modify your evaluation function to add a very small random number whenever you evaluate a game state. You should just make sure that the magnitude of that random number is never greater than the true difference between the evaluations of two states.
For example, if the true evaluation function for your game state can only return values -1, 0, and 1, you could add a randomly generated number in the range [0.0, 0.01] to the evaluation of every game state.
Without this, alpha-beta pruning doesn't necessarily find only one solution. Consider this example from wikipedia. In the middle, you see that two solutions with an evaluation of 6 were found, so it can find more than one. I do actually think it will still find all moves leading to optimal solutions at the root node, but not actually find all solutions deep down in the tree. Suppose, in the example image, that the pruned node with score of 9 in the middle actually had a score of 6. It would still get pruned there, so that particular solution wouldn't be found, but the move from root node leading to it (the middle move at root) would still be found. So, eventually, you would be able to reach it.
Some interesting notes:
This implementation would also work in minimax, and avoid the need to store a list of multiple (equally good) solutions
In more complex games than Tic Tac Toe, where you cannot search the complete state space, adding a small random number for the max player and deducting a small random number for the min player like this may actually slightly improve your heuristic evaluation function. The reason for this is as follows. Suppose in state A you have 5 moves available, and in state B you have 10 moves available, which all result in the same heuristic evaluation score. Intuitively, the successors of state B may be slightly better, because you had more moves available; in many games, having more moves available means that you are in a better position. Because you generated 10 random numbers for the 10 successors of state B, it is also a bit more likely that the highest generated random number is among those 10 (instead of the 5 numbers generated for successors of A)
I've found various questions with solutions similar to this problem but nothing quite on the money so far. Very grateful for any help.
I have a mysql (v.5.6.10) database with a single table called POSTS that stores millions upon millions of rows of lat/long points of interest on a map. Each point is classified as one of several different types. Each row is structured as id, type, coords:
id an unsigned bigint + primary key. This is auto incremented for each new row that is inserted.
type an unsigned tinyint used to encode the type of the point of interest.
coords a mysql geospatial POINT datatype representing the lat/long of the point of interest.
There is a SPATIAL index on 'coords'.
I need to find an efficient way to query the table and return up to X of the most recently-inserted points within a radius ("R") of a specific lat/long position ("Position"). The database is very dynamic so please assume that the data is radically different each time the table is queried.
If X is infinite, the problem is trivial. I just need to execute a query something like:
SELECT id, type, AsText(coords) FROM POSTS WHERE MBRContains(GeomFromText(BoundingBox, Position))
Where 'BoundingBox' is a mysql POLYGON datatype that perfectly encloses a circle of radius R from Position. Using a bounding box is, of course, not a perfect solution but this is not important for the particular problem that I'm trying to solve. I can order the results using "ORDER BY ID DESC" to retrieve and process the most-recently-inserted points first.
If X is less than infinite then I just need to modify the above to:
SELECT id, type, AsText(coords) FROM POSTS WHERE MBRContains(GeomFromText(BoundingBox, Position)) ORDER BY id DESC LIMIT X
The problem that I am trying to solve is how do I obtain a good representative set of results from a given region on the map when the points in that region are heavily clustered (for example, within cities on the map search region). For example:
In the example above, I am standing at X and searching for the 5 most-recently-inserted points of type black within the black-framed bounding box. If these points were all inserted in the cluster in the bottom right hand corner (let's assume that cluster is London) then my set of results will not include the black point that is near the top right of the search region. This is a problem for my application as I do not want users to be given the impression that there are no points of interest outside any areas where points are clustered.
I have considered a few potential solutions but I can't find one that works efficiently when the number of rows is huge (10s of millions). Approaches that I have tried so far include:
Dividing the search region into S number of squares (i.e., turning it into a grid) and searching for up to x/S points within each square - i.e., executing a separate mysql query for each square in the grid. This works OK for a small number of rows but becomes inefficient when the number of rows is massive as you need to divide the region into a large number of squares for the approach to work effectively. With only a small number of squares, you cannot guarantee that each square won't contain a densely populated cluster. A large number of squares means a large number of mysql searches which causes things to chug.
Adding a column to each row in the table that stores the distance to the nearest neighbour for each point. The nearest neighbour distance for a given point is calculated when the point is inserted into the table. With this structure, I can then order the search results by the nearest neighbour distance column so that any points that are in clusters are returned last. This solution only works when I'm searching for ALL points within the search region. For example, consider the situation in the diagram shown above. If I want to find the 5 most-recently-inserted points of type green, the nearest neighbour distance that is recorded for each point will not be correct. Recalculating these distances for each and every query is going to be far too expensive, even using efficient algorithms like KD trees.
In fact, I can't see any approach that requires pre-processing of data in table rows (or, put another way, 'touching' every point in the relevant search region dataset) to be viable when the number of rows gets large. I have considered algorithms like k-means / DBSCAN, etc. and I can't find anything that will work with sufficient efficiency given the use case explained above.
Any pearls? My intuition tells me this CAN be solved but I'm stumped so far.
Post-processing in that case seems more effective. Fetch last X points of a given type. Find if there is some clustering, for example: too many points too close together, relative to the distance of your point of view. Drop oldest of them (or these which are very close - may be your data is referencing a same POI). How much - up to you. Fetch next X points and see if there are some of them which are not in the cluster, or you can calculate a value for each of them based on remoteness and recentness and discard points according to that value.
Imagines there's a 2D space and in this space there are circles that grow at different constant rates. What's an efficient data structure for storing theses circles, such that I can query "Which circles intersect point p at time t?".
EDIT: I do realize that I could store the initial state of the circles in a spatial data structure and do a query where I intersect a circle at point p with a radius of fastest_growth * t, but this isn't efficient when there are a few circles that grow extremely quickly whereas most grow slowly.
Additional Edit: I could further augment the above approach by splitting up the circles and grouping them by there growth rate, then applying the above approach to each group, but this requires a bounded time to be efficient.
Represent the circles as cones in 3d, where the third dimension is time. Then use a BSP tree to partition them the best you can.
In general, I think the worst-case for testing for intersection is always O(n), where n is the number of circles. Most spacial data structures work by partitioning the space cleverly so that a fraction of the objects (hopefully close to half) are in each half. However, if the objects overlap then the partitioning cannot be perfect; there will always be cases where more than one object is in a partition. If you just think about the case of two circles overlapping, there is no way to draw a line such that one circle is entirely on one side and the other circle is entirely on the other side. Taken to the logical extreme, assuming arbitrary positioning of the circles and arbitrary radiuses, there is no way to partition them such that testing for intersection takes O(log(n)).
This doesn't mean that, in practice, you won't get a big advantage from using a tree, but the advantage you get will depend on the configuration of the circles and the distribution of the queries.
This is a simplified version of another problem I have posted about a week ago:
How to find first intersection of a ray with moving circles
I still haven't had the time to describe the solution that was expected there, but I will try to outline it here(for this simplar case).
The approach to solve this problem is to use a kinetic KD-tree. If you are not familiar with KD trees it is better to first read about them. You also need to add the time as additional coordinate(you make the space 3d instead of 2d). I have not implemented this idea yet, but I believe this is the correct approach.
I'm sorry this is not completely thought through, but it seems like you might look into multiplicatively-weighted Voronoi Diagrams (MWVDs). It seems like an adversary could force you into computing one with a series of well-placed queries, so I have a feeling they provide a lower-bound to your problem.
Suppose you compute the MWVD on your input data. Then for a query, you would be returned the circle that is "closest" to your query point. You can then determine whether this circle actually contains the query point at the query time. If it doesn't, then you are done: no circle contains your point. If it does, then you should compute the MWVD without that generator and run the same query. You might be able to compute the new MWVD from the old one: the cell containing the generator that was removed must be filled in, and it seems (though I have not proved it) that the only generators that can fill it in are its neighbors.
Some sort of spatial index, such as an quadtree or BSP, will give you O(log(n)) access time.
For example, each node in the quadtree could contain a linked list of pointers to all those circles which intersect it.
How many circles, by the way? For small n, you may as well just iterate over them. If you constantly have to update your spatial index and jump all over cache lines, it may end up being faster to brute-force it.
How are the centres of your circles distributed? If they cover the plane fairly evenly you can discretise space and time, then do the following as a preprocessing step:
for (t=0; t < max_t; t++)
foreach circle c, with centre and radius (x,y,r) at time t
for (int X = x-r; X < x+r; x++)
for (int Y = x-r; Y < y+r; y++)
circles_at[X][Y][T].push_back (&c)
(assuming you discretise space and time along integer boundaries, scale and offset however you like of course, and you can add circles later on or amortise the cost by deferring calculation for distant values of t)
Then your query for point (x,y) at time (t) could do a brute-force linear check over circles_at[x][y][ceil(t)]
The trade-off is obvious, increasing the resolution of any of the three dimensions will increase preprocessing time but give you a smaller bucket in circles_at[x][y][t] to test.
People are going to make a lot of recommendations about types of spatial indices to use, but I would like to offer a bit of orthogonal advice.
I think you are best off building a few indices based on time, i.e. t_0 < t_1 < t_2 ...
If a point intersects a circle at t_i, it will also intersect it at t_{i+1}. If you know the point in advance, you can eliminate all circles that intersect the point at t_i for all computation at t_{i+1} and later.
If you don't know the point in advance, then you can keep these time-point trees (built based on the how big each circle would be at a given time). At query time (e.g. t_query), find i such that t_{i-1} < t_query <= t_i. If you check all the possible circles at t_i, you will not have any false negatives.
This is sort of a hack for a data structure that is "time dynamics aware", but I don't know of any. If you have a threaded environment, then you only need to maintain one spacial index and be working on the next one in the background. It will cost you a lot of computation for the benefit of being able to respond to queries with low latency. This solution should be compared at the very least to the O(n) solution (go through each point and check if dist(point, circle.center) < circle.radius).
Instead of considering the circles, you can test on their bounding boxes to filter out the ones which do not contain the point. If your bounding box sides are all sorted, this is essentially four binary searches.
The tricky part is reconstructing the sorted sides for any given time, t. To do that, you can start off with the original points: two lists for the left and right sides with the x coordinate, and two lists for top and bottom with the y coordinates. For any time greater than 0, all the left side points will move to left, etc. You only need to check each location to the one next to it to obtain a points where the element and the one next to it are are swapped. This should give you a list of time points to modify your ordered lists. If you now sort these modification records by time, for any given starting time and an ending time you can extract all the modification records between the two, and apply them to your four lists in order. I haven't completely figured out the algorithm, but I think there will be edge cases where three or more successive elements can cross over exactly at the same time point, so you may need to modify the algorithm to handle those edge cases as well. Perhaps a list modification record that contains the position in list, and the number of records to reorder would suffice.
I think it's possible to create a binary tree that solves this problem.
Each branch should contain a growing circle, a static circle for partitioning and the latest time at which the partitioning circle cleanly partitions. Further more the growing circle that is contained within a node should always have a faster growing rate than either of it's child nodes' growing circles.
To do a query, take the root node. First check it's growing circle, if it contains the query point at the query time, add it to the answer set. Then, if the time that you're querying is greater than the time at which the partition line is broken, query both children, otherwise if the point falls within the partitioning circle, query the left node, else query the right node.
I haven't quite completed the details of performing insertions, (the difficult part is updating the partition circle so that the number of nodes on the inside and outside is approximately equal and the time when the partition is broken is maximized).
To combat the few circles that grow quickly case, you could sort the circles in descending order by rate of growth and check each of the k fastest growers. To find the proper k given t, I think you can perform a binary search to find the index k such that k*m = (t * growth rate of k)^2 where m is a constant factor you'll need to find by experimentation. The will balance the part the grows linearly with k with the part that falls quadratically with the growth rate.
If you, as already suggested, represent growing circles by vertical cones in 3d, then you can partition the space as regular (may be hexagonal) grid of packed vertical cylinders. For each cylinder calculate minimal and maximal heights (times) of intersections with all cones. If circle center (vertex of cone) is placed inside the cylinder, then minimal time is zero. Then sort cones by minimal intersection time. As result of such indexing, for each cylinder you’ll have the ordered sequence of records with 3 values: minimal time, maximal time and circle number.
When you checking some point in 3d space, you take the cylinder it belongs to and iterate its sequence until stored minimal time exceeds the time of the given point. All obtained cones, which maximal time is less than given time as well, are guaranteed to contain given point. Only cones, where given time lies between minimal and maximal intersection times, are needed to recalculate.
There is a classical tradeoff between indexing and runtime costs – the less is the cylinder diameter, the less is the range of intersection times, therefore fewer cones need recalculation at each point, but more cylinders have to be indexed. If circle centers are distributed non-evenly, then it may be worth to search better cylinder placement configuration then regular grid.
P.S. My first answer here - just registered to post it. Hope it isn’t late.