I am trying to build a deep neural networks that takes in a set of documents and predicts the category it belongs.
Since number of documents in each collection is not fixed, my first attempt was to get a mapping of documents from doc2vec and use the average.
The accuracy on training is high as 90% but the testing accuracy is low as 60%.
Is there a better way of representing a collection of documents as a fixed length vector so that the words they have in common are captured?
The description of your process so far is a bit vague and unclear – you may want to add more detail to your question.
Typically, Doc2Vec would convert each doc to a vector, not "a collection of documents".
If you did try to collapse a collection into a single vector – for example, by averaging many doc-vecs, or calculating a vector for a synthetic document with all the sub-documents' words – you might be losing valuable higher-dimensional structure.
To "predict the category" would be a typical "classification" problem, and with a bunch of documents (represented by their per-doc vectors) and known-labels, you could try various kinds of classifiers.
I suspect from your description, that you may just be collapsing a category to a single vector, then classifying new documents by checking which existing category-vector they're closest-to. That can work – it's vaguely a K-Nearest-Neighbors approach, but with every category reduced to one summary vector rather than the full set of known examples, and each classification being made by looking at just one single nearest-neighbor. That forces a simplicity on the process that may not match the "shapes" of the real categories as well as a true KNN classifier, or other classifiers, could achieve.
If accuracy on test data falls far below that observed during training, that can indicate that significant "overfitting" is occurring: the model(s) are essentially memorizing idiosyncrasies of the training data to "cheat" at answers based on arbitrary correlations, rather than learning generalizable rules. Making your model(s) smaller – such as by decreasing the dimensionality of your doc-vectors – may help in such situations, by giving the model less extra state in which to remember peculiarities of the training data. More data can also help - as the "noise" in more numerous varied examples tends of cancel itself out, rather than achieve the sort of misguided importance that can be learned in smaller datasets.
There are other ways to convert a variable-length text into a fixed-length vector, including many based on deeper learning algorithms. But, those can be even more training-data-hungry, and it seems like you may have other factors to improve before trying those in-lieu-of Doc2Vec.
I got to solve a pretty standard problem on the GPU, but I'm quite new to practical GPGPU, so I'm looking for ideas to approach this problem.
I have many points in 3-space which are assigned to a very small number of groups (each point belongs to one group), specifically 15 in this case (doesn't ever change). Now I want to compute the mean and covariance matrix of all the groups. So on the CPU it's roughly the same as:
for each point p
{
mean[p.group] += p.pos;
covariance[p.group] += p.pos * p.pos;
++count[p.group];
}
for each group g
{
mean[g] /= count[g];
covariance[g] = covariance[g]/count[g] - mean[g]*mean[g];
}
Since the number of groups is extremely small, the last step can be done on the CPU (I need those values on the CPU, anyway). The first step is actually just a segmented reduction, but with the segments scattered around.
So the first idea I came up with, was to first sort the points by their groups. I thought about a simple bucket sort using atomic_inc to compute bucket sizes and per-point relocation indices (got a better idea for sorting?, atomics may not be the best idea). After that they're sorted by groups and I could possibly come up with an adaption of the segmented scan algorithms presented here.
But in this special case, I got a very large amount of data per point (9-10 floats, maybe even doubles if the need arises), so the standard algorithms using a shared memory element per thread and a thread per point might make problems regarding per-multiprocessor resources as shared memory or registers (Ok, much more on compute capability 1.x than 2.x, but still).
Due to the very small and constant number of groups I thought there might be better approaches. Maybe there are already existing ideas suited for these specific properties of such a standard problem. Or maybe my general approach isn't that bad and you got ideas for improving the individual steps, like a good sorting algorithm suited for a very small number of keys or some segmented reduction algorithm minimizing shared memory/register usage.
I'm looking for general approaches and don't want to use external libraries. FWIW I'm using OpenCL, but it shouldn't really matter as the general concepts of GPU computing don't really differ over the major frameworks.
Even though there are few groups, I don't think you will be able to avoid the initial sorting into groups while still keeping the reduction step efficient. You will probably also want to perform the full sort, not just sorting indexes, because that will help keep memory access efficient in the reduction step.
For sorting, read about general strategies here:
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter46.html
For reduction (old but still good):
http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
For an example implementation of parallel reduction:
http://developer.nvidia.com/cuda-cc-sdk-code-samples#reduction
I recently read that you can predict the outcomes of a PRNG if you:
Know what algorithm is being used.
Have consecutive data points.
Is it possible to figure out the seed used for a PRNG from only data points?
I managed to find a paper by Kelsey et al which details the different types of attack and also summarises some real-world examples. It seems most attacks rely on similar techniques to those against cryptosystems, and in most cases actually taking advantage of the fact that the PRNG is used in a cryptosystem.
With "enough" data points that are the absolute first data points generated by the PRNG with no gaps, sure. Most PRNG functions are invertible, so just work backwards and you should get the seed.
For example, the typical return seed=(seed*A+B)%N has an inverse of return seed=((seed-B)/A)%N.
It's always theoretically possible, if you're "allowed" to brute force all possible values for the seed, and if you have enough data points that there's only one seed that could have produced that output. If the PRNG was seeded with the time, and you know roughly when that happened, then this might be very fast since there aren't many plausible values to try. If the PRNG was seeded with data from a truly random source having 64 bits of entropy, then this approach is computationally infeasible.
Whether there are other techniques depends on the algorithm. For example doing this for Blum Blum Shub is equivalent to integer factorization, which is generally believed to be a hard computational problem. Other, faster PRNGs might be less "secure" in this sense. Any PRNG used for crypto purposes, for example in a stream cipher, pretty much needs there to be no known feasible way of doing it.
What is the fastest algorithm that exists up with to solve a particular NP-Complete problem? For example, a naive implementation of travelling salesman is O(n!), but with dynamic programming it can be done in O(n^2 * 2^n). Is there any perhaps "easier" NP-Complete problem that has a better running time?
I'm curious about exact solutions, not approximations.
[...] with dynamic programming it can be done in O(n^2 * 2^n). Is there any perhaps "easier" NP-Complete problem that has a better running time?
Sort of. You can get rid of any polynomial factor by creating an artificial problem that encodes the same solution in a polynomially larger input. As long as the input is only polynomially larger, the resulting problem is still NP-complete. Since the complexity is by definition the function that maps input size to running time, if the input size grows the function gets into lower O classes.
So, the same algorithm running on TSP with the input padded with n^2 useless bits, will have complexity O(1 * 2^sqrt(n)).
A characteristic of the NP-Complete problems is that any of the problems in NP can be mechanically transformed into any of the NP-Complete problems in, at most, polynomial time.
Therefore, whatever the best solution for any given NP-Complete problem is, it is automatically a similarly-good solution for any other NP problem.
Given that dynamic programming can solve Traveling Salesman Problem in 2^n time and 2^n space, the same must be true of all other NP problems [well, plus the time to apply the transformation, I guess - so it could be 2^(n+1)].
Generally you cannot find the best solution for the generic Travelling Salesman problem without trying all combinations (there might be negative distances, etc).
By adding additional restrictions and loosening the requirement of getting the best solution, you can speed up things quite a bit.
For instance you can get polynomial executable time if the distances in the problem obey the "it is not longer to go directly from A to B than going from A to C to B" (i.e. a shortcut is never longer), and you can live with the result maximally being 1.5 times the optimal value. See http://en.wikipedia.org/wiki/Travelling_salesman_problem#Metric_TSP
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
There are some data structures around that are really useful but are unknown to most programmers. Which ones are they?
Everybody knows about linked lists, binary trees, and hashes, but what about Skip lists and Bloom filters for example. I would like to know more data structures that are not so common, but are worth knowing because they rely on great ideas and enrich a programmer's tool box.
PS: I am also interested in techniques like Dancing links which make clever use of properties of a common data structure.
EDIT:
Please try to include links to pages describing the data structures in more detail. Also, try to add a couple of words on why a data structure is cool (as Jonas Kölker already pointed out). Also, try to provide one data-structure per answer. This will allow the better data structures to float to the top based on their votes alone.
Tries, also known as prefix-trees or crit-bit trees, have existed for over 40 years but are still relatively unknown. A very cool use of tries is described in "TRASH - A dynamic LC-trie and hash data structure", which combines a trie with a hash function.
Bloom filter: Bit array of m bits, initially all set to 0.
To add an item you run it through k hash functions that will give you k indices in the array which you then set to 1.
To check if an item is in the set, compute the k indices and check if they are all set to 1.
Of course, this gives some probability of false-positives (according to wikipedia it's about 0.61^(m/n) where n is the number of inserted items). False-negatives are not possible.
Removing an item is impossible, but you can implement counting bloom filter, represented by array of ints and increment/decrement.
Rope: It's a string that allows for cheap prepends, substrings, middle insertions and appends. I've really only had use for it once, but no other structure would have sufficed. Regular strings and arrays prepends were just far too expensive for what we needed to do, and reversing everthing was out of the question.
Skip lists are pretty neat.
Wikipedia
A skip list is a probabilistic data structure, based on multiple parallel, sorted linked lists, with efficiency comparable to a binary search tree (order log n average time for most operations).
They can be used as an alternative to balanced trees (using probalistic balancing rather than strict enforcement of balancing). They are easy to implement and faster than say, a red-black tree. I think they should be in every good programmers toolchest.
If you want to get an in-depth introduction to skip-lists here is a link to a video of MIT's Introduction to Algorithms lecture on them.
Also, here is a Java applet demonstrating Skip Lists visually.
Spatial Indices, in particular R-trees and KD-trees, store spatial data efficiently. They are good for geographical map coordinate data and VLSI place and route algorithms, and sometimes for nearest-neighbor search.
Bit Arrays store individual bits compactly and allow fast bit operations.
Zippers - derivatives of data structures that modify the structure to have a natural notion of 'cursor' -- current location. These are really useful as they guarantee indicies cannot be out of bound -- used, e.g. in the xmonad window manager to track which window has focused.
Amazingly, you can derive them by applying techniques from calculus to the type of the original data structure!
Here are a few:
Suffix tries. Useful for almost all kinds of string searching (http://en.wikipedia.org/wiki/Suffix_trie#Functionality). See also suffix arrays; they're not quite as fast as suffix trees, but a whole lot smaller.
Splay trees (as mentioned above). The reason they are cool is threefold:
They are small: you only need the left and right pointers like you do in any binary tree (no node-color or size information needs to be stored)
They are (comparatively) very easy to implement
They offer optimal amortized complexity for a whole host of "measurement criteria" (log n lookup time being the one everybody knows). See http://en.wikipedia.org/wiki/Splay_tree#Performance_theorems
Heap-ordered search trees: you store a bunch of (key, prio) pairs in a tree, such that it's a search tree with respect to the keys, and heap-ordered with respect to the priorities. One can show that such a tree has a unique shape (and it's not always fully packed up-and-to-the-left). With random priorities, it gives you expected O(log n) search time, IIRC.
A niche one is adjacency lists for undirected planar graphs with O(1) neighbour queries. This is not so much a data structure as a particular way to organize an existing data structure. Here's how you do it: every planar graph has a node with degree at most 6. Pick such a node, put its neighbors in its neighbor list, remove it from the graph, and recurse until the graph is empty. When given a pair (u, v), look for u in v's neighbor list and for v in u's neighbor list. Both have size at most 6, so this is O(1).
By the above algorithm, if u and v are neighbors, you won't have both u in v's list and v in u's list. If you need this, just add each node's missing neighbors to that node's neighbor list, but store how much of the neighbor list you need to look through for fast lookup.
I think lock-free alternatives to standard data structures i.e lock-free queue, stack and list are much overlooked.
They are increasingly relevant as concurrency becomes a higher priority and are much more admirable goal than using Mutexes or locks to handle concurrent read/writes.
Here's some links
http://www.cl.cam.ac.uk/research/srg/netos/lock-free/
http://www.research.ibm.com/people/m/michael/podc-1996.pdf [Links to PDF]
http://www.boyet.com/Articles/LockfreeStack.html
Mike Acton's (often provocative) blog has some excellent articles on lock-free design and approaches
I think Disjoint Set is pretty nifty for cases when you need to divide a bunch of items into distinct sets and query membership. Good implementation of the Union and Find operations result in amortized costs that are effectively constant (inverse of Ackermnan's Function, if I recall my data structures class correctly).
Fibonacci heaps
They're used in some of the fastest known algorithms (asymptotically) for a lot of graph-related problems, such as the Shortest Path problem. Dijkstra's algorithm runs in O(E log V) time with standard binary heaps; using Fibonacci heaps improves that to O(E + V log V), which is a huge speedup for dense graphs. Unfortunately, though, they have a high constant factor, often making them impractical in practice.
Anyone with experience in 3D rendering should be familiar with BSP trees. Generally, it's the method by structuring a 3D scene to be manageable for rendering knowing the camera coordinates and bearing.
Binary space partitioning (BSP) is a
method for recursively subdividing a
space into convex sets by hyperplanes.
This subdivision gives rise to a
representation of the scene by means
of a tree data structure known as a
BSP tree.
In other words, it is a method of
breaking up intricately shaped
polygons into convex sets, or smaller
polygons consisting entirely of
non-reflex angles (angles smaller than
180°). For a more general description
of space partitioning, see space
partitioning.
Originally, this approach was proposed
in 3D computer graphics to increase
the rendering efficiency. Some other
applications include performing
geometrical operations with shapes
(constructive solid geometry) in CAD,
collision detection in robotics and 3D
computer games, and other computer
applications that involve handling of
complex spatial scenes.
Huffman trees - used for compression.
Have a look at Finger Trees, especially if you're a fan of the previously mentioned purely functional data structures. They're a functional representation of persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece.
As per the original article:
Our functional 2-3 finger trees are an instance of a general design technique in- troduced by Okasaki (1998), called implicit recursive slowdown. We have already noted that these trees are an extension of his implicit deque structure, replacing pairs with 2-3 nodes to provide the flexibility required for efficient concatenation and splitting.
A Finger Tree can be parameterized with a monoid, and using different monoids will result in different behaviors for the tree. This lets Finger Trees simulate other data structures.
Circular or ring buffer - used for streaming, among other things.
I'm surprised no one has mentioned Merkle trees (ie. Hash Trees).
Used in many cases (P2P programs, digital signatures) where you want to verify the hash of a whole file when you only have part of the file available to you.
<zvrba> Van Emde-Boas trees
I think it'd be useful to know why they're cool. In general, the question "why" is the most important to ask ;)
My answer is that they give you O(log log n) dictionaries with {1..n} keys, independent of how many of the keys are in use. Just like repeated halving gives you O(log n), repeated sqrting gives you O(log log n), which is what happens in the vEB tree.
How about splay trees?
Also, Chris Okasaki's purely functional data structures come to mind.
An interesting variant of the hash table is called Cuckoo Hashing. It uses multiple hash functions instead of just 1 in order to deal with hash collisions. Collisions are resolved by removing the old object from the location specified by the primary hash, and moving it to a location specified by an alternate hash function. Cuckoo Hashing allows for more efficient use of memory space because you can increase your load factor up to 91% with only 3 hash functions and still have good access time.
A min-max heap is a variation of a heap that implements a double-ended priority queue. It achieves this by by a simple change to the heap property: A tree is said to be min-max ordered if every element on even (odd) levels are less (greater) than all childrens and grand children. The levels are numbered starting from 1.
http://internet512.chonbuk.ac.kr/datastructure/heap/img/heap8.jpg
I like Cache Oblivious datastructures. The basic idea is to lay out a tree in recursively smaller blocks so that caches of many different sizes will take advantage of blocks that convenient fit in them. This leads to efficient use of caching at everything from L1 cache in RAM to big chunks of data read off of the disk without needing to know the specifics of the sizes of any of those caching layers.
Left Leaning Red-Black Trees. A significantly simplified implementation of red-black trees by Robert Sedgewick published in 2008 (~half the lines of code to implement). If you've ever had trouble wrapping your head around the implementation of a Red-Black tree, read about this variant.
Very similar (if not identical) to Andersson Trees.
Work Stealing Queue
Lock-free data structure for dividing the work equaly among multiple threads
Implementation of a work stealing queue in C/C++?
Bootstrapped skew-binomial heaps by Gerth Stølting Brodal and Chris Okasaki:
Despite their long name, they provide asymptotically optimal heap operations, even in a function setting.
O(1) size, union, insert, minimum
O(log n) deleteMin
Note that union takes O(1) rather than O(log n) time unlike the more well-known heaps that are commonly covered in data structure textbooks, such as leftist heaps. And unlike Fibonacci heaps, those asymptotics are worst-case, rather than amortized, even if used persistently!
There are multiple implementations in Haskell.
They were jointly derived by Brodal and Okasaki, after Brodal came up with an imperative heap with the same asymptotics.
Kd-Trees, spatial data structure used (amongst others) in Real-Time Raytracing, has the downside that triangles that cross intersect the different spaces need to be clipped. Generally BVH's are faster because they are more lightweight.
MX-CIF Quadtrees, store bounding boxes instead of arbitrary point sets by combining a regular quadtree with a binary tree on the edges of the quads.
HAMT, hierarchical hash map with access times that generally exceed O(1) hash-maps due to the constants involved.
Inverted Index, quite well known in the search-engine circles, because it's used for fast retrieval of documents associated with different search-terms.
Most, if not all, of these are documented on the NIST Dictionary of Algorithms and Data Structures
Ball Trees. Just because they make people giggle.
A ball tree is a data structure that indexes points in a metric space. Here's an article on building them. They are often used for finding nearest neighbors to a point or accelerating k-means.
Not really a data structure; more of a way to optimize dynamically allocated arrays, but the gap buffers used in Emacs are kind of cool.
Fenwick Tree. It's a data structure to keep count of the sum of all elements in a vector, between two given subindexes i and j. The trivial solution, precalculating the sum since the begining doesn't allow to update a item (you have to do O(n) work to keep up).
Fenwick Trees allow you to update and query in O(log n), and how it works is really cool and simple. It's really well explained in Fenwick's original paper, freely available here:
http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol24/issue3/spe884.pdf
Its father, the RQM tree is also very cool: It allows you to keep info about the minimum element between two indexes of the vector, and it also works in O(log n) update and query. I like to teach first the RQM and then the Fenwick Tree.
Van Emde-Boas trees. I have even a C++ implementation of it, for up to 2^20 integers.
Nested sets are nice for representing trees in the relational databases and running queries on them. For instance, ActiveRecord (Ruby on Rails' default ORM) comes with a very simple nested set plugin, which makes working with trees trivial.
It's pretty domain-specific, but half-edge data structure is pretty neat. It provides a way to iterate over polygon meshes (faces and edges) which is very useful in computer graphics and computational geometry.