high-dimensional Spatial-temporal clustering - gis

My question is, how I can make a cluster analysis from spatial - temporal and high dimensional data? my purpose is to find subspace clusters that can show patterns in the space and in the time. over here space mean a geographic position, so I should use autocorrelation law (also knowns like Tobler law or the first law from geography).
is this right?, first I make a transformation from time to frequency through Wavelets transform from every variable (because all variables have time and geographic position related) and after that, taking that coefficients and applying one subspace clustering algorithm for temporal high-dimensional clustering. once I have the temporal clusters I try to find a spatial "cluster" trough regionalization between temporal clusters.
Thanks in Advance any light.

I understand that you use the toblers law as an interpretation of the spatial correlation (regionalization). Its not clear what the final application would be but, a few verification steps i would do in such circumstances would be: to check if the all(150) variables are all corresponding to the same scale in space and time, affected by the same kind of autocorrelation (stationarity) which can simplify the problems in few cases. And finally also has to understand what features or patterns are to be extracted and how they are characterized. Check this out: http://www.geokernels.org/pages/modern_indexpag.html
Hope it helped !
Cheers
Ravi

Its not clear what you would like to achieve here. In general for spatio temporal clustering one could use a distribution based model like a multivariate Guassian Mixture Model for a given patch in the Dataset, and update the covariance matrice parameters (http://en.wikipedia.org/wiki/Multivariate_normal_distribution) - In case of the Wavelet transform coefficient clustering we ignore any spatial correlation to exist.
I am not sure by what you mean here by "regionalization"

You could treat time as just another dimension, depending on your application.

What about constructing a temporal cluster data with a correlation coefficient against cluster which gives a variance equal to 1. A spatial cluster will be a scatter plot which obviously might derive from lognormal, skewed and regression plots.

Related

Deep Learning Data Normalization

I’m working with different types of financial data inputs for my models and I would like to know more about normalization of them.
In particular, working with some technical indicators, I’ve normalized them to have a range between 0 and 1.
Others were normalized to have a range between -1 and 1.
What is your experience with mixed normalized data?
Could it be acceptable to have these two ranges or is it always better to have the training dataset with a single range i.e. [0 1]?
It is important to note that when we discuss data normalization, we are usually referring to the normalization of continuous data. Categorical data (usually) doesn't require the former.
Furthermore, not all ML methods need you to normalize data for them to function well. Examples of such methods include Random Forests and Gradient Boosting Machines. Others, however, do. For instance, Support Vector Machines and Neural Networks.
The reasons for input data normalization are dependent on the methods themselves. For SVMs, data normalization is done to ensure that input features are given equal importance in influencing the model's decisions. For neural networks, we normalize data to allow the gradient descent process to converge smoothly.
Finally, to answer your question, if you are working with continuous data and using a neural network to model your data, just make sure that the normalized data's values are close to each other (even if they are not the same range) because that is what determines the ease with which the gradient descent process converges. If you are working with an SVM, it would be better to normalize your data to a single range, so that all features may be given equal importance by the similarity/ distance function that your SVM uses. In other cases, the need for data normalization, whatever the ranges, may be removed entirely. Ultimately, it depends on the modeling technique you are using!
Credit to #user3666197 for the helpful feedback in the comments.

Normalization before and after Albumentations augmentations?

I use Albumentations augmentations in my computer vision tasks. However, I don't fully understand when to use normalization on my images (I use min-max normalization). Do I need to use normalization before augmentation functions, but values would not be between 0-1, or do I use normalization just after augmentations, so that values are between 0-1, or I use normalization in both cases - before and after augmentations?
For example, when I use Sharpen, values are not in 0-1 range (they vary in -0.5-1.5 range). Does that affect model performance? If yes, how?
Thanks in advance.
The basic idea is that you should have the input of your neural network around 0 and with a variance of 1. There is a mathematical reason why it helps the learning process of neural network. This is not the case for other algorithms like tree boosting.
If you train from scratch the type of normalization (min max or other) should not impact the model performance (except if, for exemple your max/min value is really extrem compare to your other data point).

GIS: partition area based on equal population

I want to partition a US state into 20 parts of approximately equal population. I can do this using, say, tracts, ZIP codes or another smaller geography. I'm looking for an algorithm to do the partitioning. It can be in any language or software (ArcGIS, QGIS, python, PostGIS, R, node).
For grouping or clustering algorithms I've looked at like k-means, ArcGIS Grouping Analysis, etc. These do not seem to do what's needed, since they group based on the similarity of a variable don't partition into equal size based on a variable. My quick look at ESRI's districting tool suggests that this might be a possibility.
Any other suggestions?
You should consider the Shortest splitline algorithm, recommended for creating optimally compact voting districts. Here is a description of its results in solving gerrymandering.
You can try centroidal weighted voronoi diagrams. i.e. Loyds algorithm. Pick the voronoi diagram and the center of gravity of each voronoi cell and rinse and repeat:http://www-cs-students.stanford.edu/~amitp/game-programming/polygon-map-generation/

MySQL Postgresql / PostGIS

I have lat/lon coordinates in a 400 million rows partitioned mysql table.
The table grows # 2000 records a minute and old data is flushed every few weeks.
I am exploring ways to do spatial analysis of this data as it comes in.
Most of the analysis requires finding whether a point is in a particular lat/lon polygon or which polygons contain that point.
I see the following ways of tackling the point in polygon (PIP) problem:
Create a mysql function that takes a point and a Geometry and returns a boolean.
Simple but not sure how Geometry can be used to perform operations on lat/lon co-ordinates since Geometry assumes flat surfaces and not spheres.
Create a mysql function that takes a point and identifier of a custom data structure and returns a boolean.
The polygon vertices can be stored in a table and a function can compute PIP using spherical math. Large number of polygon points may lead to a huge table and slow queries.
Leave point data in mysql and store polygon data in PostGIS and use the app server to run PIP query in PostGIS by probviding point as a parameter.
Port the application from MySQL to Postgresql/PostGIS.
This will require a lot of effort in rewriting queries and procedures.
I can still do it but how good is Postgresql at handling 400 million rows.
A quick search on google for "mysql 1 billion rows" returns many results. same query for Postgres returns no relevant results.
Would like to hear some thoughts & suggestions.
A few thoughts.
First PostgreSQL and MySQL are completely different beasts when it comes to performance tuning. So if you go the porting route be prepared to rethink your indexing strategies. Not only does PostgreSQL have a far more flexible indexing than MySQL, but the table approaches are very different also, meaning the appropriate indexing strategies are as different as the tactics are. Unfortunately this means you can expect to struggle a bit. If i could give advice I would suggest dropping all non-key indexes at first and then adding them back sparingly as needed.
The second point is that nobody here can likely give you a huge amount of practical advice at this point because we don't know the internals of your program. In PostgreSQL, you are best off indexing only what you need, but you can index functions' outputs (which is really helpful in cases like this) and you can index only part of a table.
I am more a PostgreSQL guy than a MySQL guy so of course I think you should go with PostgreSQL. However rather than tell you why etc. and have you struggle at this scale, I will tell you a few things that I would look at using if I were trying to do this.
Functional indexes
Write my own functions for indexes for related analysis
PostGIS is pretty amazing and very flexible
In the end, switching db's at this volume is going to be a learning curve, and you need to be prepared for that. However, PostgreSQL can handle the volume just fine.
The number of rows is quite irrelevant here.
The question is how much of the point in polygon work that can be done by the index.
The answer to that depends on how big the polygons are.
PostGIS is very fast to find all points in the bounding box of a polygon. Then it takes more effort to find out if the point actually is inside the polygon.
If your polygons is small (small bounding boxes) the query will be efficient. If your polygons are big or have a shape that mekes the bounding box big then it will be less efficient.
If your polygons is more or less static there is work arounds. You can divide your polygons in smaller polygons and recreate the idnex. Then the index will be more efficient.
If your polygons is actually multipolygons the firs step is to split the multipolygons to polygons with ST_Dump and recreate and build an index on the result.
HTH
Nicklas

What are the lesser known but useful data structures?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
There are some data structures around that are really useful but are unknown to most programmers. Which ones are they?
Everybody knows about linked lists, binary trees, and hashes, but what about Skip lists and Bloom filters for example. I would like to know more data structures that are not so common, but are worth knowing because they rely on great ideas and enrich a programmer's tool box.
PS: I am also interested in techniques like Dancing links which make clever use of properties of a common data structure.
EDIT:
Please try to include links to pages describing the data structures in more detail. Also, try to add a couple of words on why a data structure is cool (as Jonas Kölker already pointed out). Also, try to provide one data-structure per answer. This will allow the better data structures to float to the top based on their votes alone.
Tries, also known as prefix-trees or crit-bit trees, have existed for over 40 years but are still relatively unknown. A very cool use of tries is described in "TRASH - A dynamic LC-trie and hash data structure", which combines a trie with a hash function.
Bloom filter: Bit array of m bits, initially all set to 0.
To add an item you run it through k hash functions that will give you k indices in the array which you then set to 1.
To check if an item is in the set, compute the k indices and check if they are all set to 1.
Of course, this gives some probability of false-positives (according to wikipedia it's about 0.61^(m/n) where n is the number of inserted items). False-negatives are not possible.
Removing an item is impossible, but you can implement counting bloom filter, represented by array of ints and increment/decrement.
Rope: It's a string that allows for cheap prepends, substrings, middle insertions and appends. I've really only had use for it once, but no other structure would have sufficed. Regular strings and arrays prepends were just far too expensive for what we needed to do, and reversing everthing was out of the question.
Skip lists are pretty neat.
Wikipedia
A skip list is a probabilistic data structure, based on multiple parallel, sorted linked lists, with efficiency comparable to a binary search tree (order log n average time for most operations).
They can be used as an alternative to balanced trees (using probalistic balancing rather than strict enforcement of balancing). They are easy to implement and faster than say, a red-black tree. I think they should be in every good programmers toolchest.
If you want to get an in-depth introduction to skip-lists here is a link to a video of MIT's Introduction to Algorithms lecture on them.
Also, here is a Java applet demonstrating Skip Lists visually.
Spatial Indices, in particular R-trees and KD-trees, store spatial data efficiently. They are good for geographical map coordinate data and VLSI place and route algorithms, and sometimes for nearest-neighbor search.
Bit Arrays store individual bits compactly and allow fast bit operations.
Zippers - derivatives of data structures that modify the structure to have a natural notion of 'cursor' -- current location. These are really useful as they guarantee indicies cannot be out of bound -- used, e.g. in the xmonad window manager to track which window has focused.
Amazingly, you can derive them by applying techniques from calculus to the type of the original data structure!
Here are a few:
Suffix tries. Useful for almost all kinds of string searching (http://en.wikipedia.org/wiki/Suffix_trie#Functionality). See also suffix arrays; they're not quite as fast as suffix trees, but a whole lot smaller.
Splay trees (as mentioned above). The reason they are cool is threefold:
They are small: you only need the left and right pointers like you do in any binary tree (no node-color or size information needs to be stored)
They are (comparatively) very easy to implement
They offer optimal amortized complexity for a whole host of "measurement criteria" (log n lookup time being the one everybody knows). See http://en.wikipedia.org/wiki/Splay_tree#Performance_theorems
Heap-ordered search trees: you store a bunch of (key, prio) pairs in a tree, such that it's a search tree with respect to the keys, and heap-ordered with respect to the priorities. One can show that such a tree has a unique shape (and it's not always fully packed up-and-to-the-left). With random priorities, it gives you expected O(log n) search time, IIRC.
A niche one is adjacency lists for undirected planar graphs with O(1) neighbour queries. This is not so much a data structure as a particular way to organize an existing data structure. Here's how you do it: every planar graph has a node with degree at most 6. Pick such a node, put its neighbors in its neighbor list, remove it from the graph, and recurse until the graph is empty. When given a pair (u, v), look for u in v's neighbor list and for v in u's neighbor list. Both have size at most 6, so this is O(1).
By the above algorithm, if u and v are neighbors, you won't have both u in v's list and v in u's list. If you need this, just add each node's missing neighbors to that node's neighbor list, but store how much of the neighbor list you need to look through for fast lookup.
I think lock-free alternatives to standard data structures i.e lock-free queue, stack and list are much overlooked.
They are increasingly relevant as concurrency becomes a higher priority and are much more admirable goal than using Mutexes or locks to handle concurrent read/writes.
Here's some links
http://www.cl.cam.ac.uk/research/srg/netos/lock-free/
http://www.research.ibm.com/people/m/michael/podc-1996.pdf [Links to PDF]
http://www.boyet.com/Articles/LockfreeStack.html
Mike Acton's (often provocative) blog has some excellent articles on lock-free design and approaches
I think Disjoint Set is pretty nifty for cases when you need to divide a bunch of items into distinct sets and query membership. Good implementation of the Union and Find operations result in amortized costs that are effectively constant (inverse of Ackermnan's Function, if I recall my data structures class correctly).
Fibonacci heaps
They're used in some of the fastest known algorithms (asymptotically) for a lot of graph-related problems, such as the Shortest Path problem. Dijkstra's algorithm runs in O(E log V) time with standard binary heaps; using Fibonacci heaps improves that to O(E + V log V), which is a huge speedup for dense graphs. Unfortunately, though, they have a high constant factor, often making them impractical in practice.
Anyone with experience in 3D rendering should be familiar with BSP trees. Generally, it's the method by structuring a 3D scene to be manageable for rendering knowing the camera coordinates and bearing.
Binary space partitioning (BSP) is a
method for recursively subdividing a
space into convex sets by hyperplanes.
This subdivision gives rise to a
representation of the scene by means
of a tree data structure known as a
BSP tree.
In other words, it is a method of
breaking up intricately shaped
polygons into convex sets, or smaller
polygons consisting entirely of
non-reflex angles (angles smaller than
180°). For a more general description
of space partitioning, see space
partitioning.
Originally, this approach was proposed
in 3D computer graphics to increase
the rendering efficiency. Some other
applications include performing
geometrical operations with shapes
(constructive solid geometry) in CAD,
collision detection in robotics and 3D
computer games, and other computer
applications that involve handling of
complex spatial scenes.
Huffman trees - used for compression.
Have a look at Finger Trees, especially if you're a fan of the previously mentioned purely functional data structures. They're a functional representation of persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece.
As per the original article:
Our functional 2-3 finger trees are an instance of a general design technique in- troduced by Okasaki (1998), called implicit recursive slowdown. We have already noted that these trees are an extension of his implicit deque structure, replacing pairs with 2-3 nodes to provide the flexibility required for efficient concatenation and splitting.
A Finger Tree can be parameterized with a monoid, and using different monoids will result in different behaviors for the tree. This lets Finger Trees simulate other data structures.
Circular or ring buffer - used for streaming, among other things.
I'm surprised no one has mentioned Merkle trees (ie. Hash Trees).
Used in many cases (P2P programs, digital signatures) where you want to verify the hash of a whole file when you only have part of the file available to you.
<zvrba> Van Emde-Boas trees
I think it'd be useful to know why they're cool. In general, the question "why" is the most important to ask ;)
My answer is that they give you O(log log n) dictionaries with {1..n} keys, independent of how many of the keys are in use. Just like repeated halving gives you O(log n), repeated sqrting gives you O(log log n), which is what happens in the vEB tree.
How about splay trees?
Also, Chris Okasaki's purely functional data structures come to mind.
An interesting variant of the hash table is called Cuckoo Hashing. It uses multiple hash functions instead of just 1 in order to deal with hash collisions. Collisions are resolved by removing the old object from the location specified by the primary hash, and moving it to a location specified by an alternate hash function. Cuckoo Hashing allows for more efficient use of memory space because you can increase your load factor up to 91% with only 3 hash functions and still have good access time.
A min-max heap is a variation of a heap that implements a double-ended priority queue. It achieves this by by a simple change to the heap property: A tree is said to be min-max ordered if every element on even (odd) levels are less (greater) than all childrens and grand children. The levels are numbered starting from 1.
http://internet512.chonbuk.ac.kr/datastructure/heap/img/heap8.jpg
I like Cache Oblivious datastructures. The basic idea is to lay out a tree in recursively smaller blocks so that caches of many different sizes will take advantage of blocks that convenient fit in them. This leads to efficient use of caching at everything from L1 cache in RAM to big chunks of data read off of the disk without needing to know the specifics of the sizes of any of those caching layers.
Left Leaning Red-Black Trees. A significantly simplified implementation of red-black trees by Robert Sedgewick published in 2008 (~half the lines of code to implement). If you've ever had trouble wrapping your head around the implementation of a Red-Black tree, read about this variant.
Very similar (if not identical) to Andersson Trees.
Work Stealing Queue
Lock-free data structure for dividing the work equaly among multiple threads
Implementation of a work stealing queue in C/C++?
Bootstrapped skew-binomial heaps by Gerth Stølting Brodal and Chris Okasaki:
Despite their long name, they provide asymptotically optimal heap operations, even in a function setting.
O(1) size, union, insert, minimum
O(log n) deleteMin
Note that union takes O(1) rather than O(log n) time unlike the more well-known heaps that are commonly covered in data structure textbooks, such as leftist heaps. And unlike Fibonacci heaps, those asymptotics are worst-case, rather than amortized, even if used persistently!
There are multiple implementations in Haskell.
They were jointly derived by Brodal and Okasaki, after Brodal came up with an imperative heap with the same asymptotics.
Kd-Trees, spatial data structure used (amongst others) in Real-Time Raytracing, has the downside that triangles that cross intersect the different spaces need to be clipped. Generally BVH's are faster because they are more lightweight.
MX-CIF Quadtrees, store bounding boxes instead of arbitrary point sets by combining a regular quadtree with a binary tree on the edges of the quads.
HAMT, hierarchical hash map with access times that generally exceed O(1) hash-maps due to the constants involved.
Inverted Index, quite well known in the search-engine circles, because it's used for fast retrieval of documents associated with different search-terms.
Most, if not all, of these are documented on the NIST Dictionary of Algorithms and Data Structures
Ball Trees. Just because they make people giggle.
A ball tree is a data structure that indexes points in a metric space. Here's an article on building them. They are often used for finding nearest neighbors to a point or accelerating k-means.
Not really a data structure; more of a way to optimize dynamically allocated arrays, but the gap buffers used in Emacs are kind of cool.
Fenwick Tree. It's a data structure to keep count of the sum of all elements in a vector, between two given subindexes i and j. The trivial solution, precalculating the sum since the begining doesn't allow to update a item (you have to do O(n) work to keep up).
Fenwick Trees allow you to update and query in O(log n), and how it works is really cool and simple. It's really well explained in Fenwick's original paper, freely available here:
http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol24/issue3/spe884.pdf
Its father, the RQM tree is also very cool: It allows you to keep info about the minimum element between two indexes of the vector, and it also works in O(log n) update and query. I like to teach first the RQM and then the Fenwick Tree.
Van Emde-Boas trees. I have even a C++ implementation of it, for up to 2^20 integers.
Nested sets are nice for representing trees in the relational databases and running queries on them. For instance, ActiveRecord (Ruby on Rails' default ORM) comes with a very simple nested set plugin, which makes working with trees trivial.
It's pretty domain-specific, but half-edge data structure is pretty neat. It provides a way to iterate over polygon meshes (faces and edges) which is very useful in computer graphics and computational geometry.