What is the best way to represent a collection of documents in a fixed length vector? - deep-learning

I am trying to build a deep neural networks that takes in a set of documents and predicts the category it belongs.
Since number of documents in each collection is not fixed, my first attempt was to get a mapping of documents from doc2vec and use the average.
The accuracy on training is high as 90% but the testing accuracy is low as 60%.
Is there a better way of representing a collection of documents as a fixed length vector so that the words they have in common are captured?

The description of your process so far is a bit vague and unclear – you may want to add more detail to your question.
Typically, Doc2Vec would convert each doc to a vector, not "a collection of documents".
If you did try to collapse a collection into a single vector – for example, by averaging many doc-vecs, or calculating a vector for a synthetic document with all the sub-documents' words – you might be losing valuable higher-dimensional structure.
To "predict the category" would be a typical "classification" problem, and with a bunch of documents (represented by their per-doc vectors) and known-labels, you could try various kinds of classifiers.
I suspect from your description, that you may just be collapsing a category to a single vector, then classifying new documents by checking which existing category-vector they're closest-to. That can work – it's vaguely a K-Nearest-Neighbors approach, but with every category reduced to one summary vector rather than the full set of known examples, and each classification being made by looking at just one single nearest-neighbor. That forces a simplicity on the process that may not match the "shapes" of the real categories as well as a true KNN classifier, or other classifiers, could achieve.
If accuracy on test data falls far below that observed during training, that can indicate that significant "overfitting" is occurring: the model(s) are essentially memorizing idiosyncrasies of the training data to "cheat" at answers based on arbitrary correlations, rather than learning generalizable rules. Making your model(s) smaller – such as by decreasing the dimensionality of your doc-vectors – may help in such situations, by giving the model less extra state in which to remember peculiarities of the training data. More data can also help - as the "noise" in more numerous varied examples tends of cancel itself out, rather than achieve the sort of misguided importance that can be learned in smaller datasets.
There are other ways to convert a variable-length text into a fixed-length vector, including many based on deeper learning algorithms. But, those can be even more training-data-hungry, and it seems like you may have other factors to improve before trying those in-lieu-of Doc2Vec.

Related

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the model harness this knowledge? Of course a simple solution could be to add this count as a feature after the self-attention layer. However, this hand-crafted-like approach wouldn't reveal more complex relations, for example if there is a high number of word X, it correlates with target 1, except if there is also high number of word Y, in which case the target tends to be 0.
How could this information be included using deep learning? Paper recommendations in the topic are also well appreciated.

Saving Random Forest Classifiers (sklearn) with picke/joblib creates huge files

I am trying to save a bunch of trained random forest classifiers in order to reuse them later. For this, I am trying to use pickle or joblib. The problem I encounter is, that the saved files get huge. This seems to be correlated to the amount of data that I use for training (which is several 10-millions of samples per forest, leading to dumped files in the order of up to 20GB!).
Is the RF classifier itself saving the training data in its structure? If so, how could I take the structure apart and only save the necessary parameters for later predictions? Sadly, I could not find anything on the subject of size yet.
Thanks for your help!
Baradrist
Here's what I did in a nutshell:
I trained the (fairly standard) RF on a large dataset and saved the trained forest afterwards, trying both pickle and joblib (also with the compress-option set to 3).
X_train, y_train = ... some data
classifier = RandomForestClassifier(n_estimators=24, max_depth=10)
classifier.fit(X_train, y_train)
pickle.dump(classifier, open(path+'classifier.pickle', 'wb'))
or
joblib.dump(classifier, path+'classifier.joblib', compress=True)
Since the saved files got quite big (5GB to nearly 20GB, compressed aprox. 1/3 of this - and I will need >50 such forests!) and the training takes a while, I experimented with different subsets of the training data. Depending on the size of the train set, I found different sizes for the saved classifier, making me believe that information about the training is pickled/joblibed as well. This seems unintuitive to me, as for predictions, I only need the information of all the trained weak predictors (decision trees) which should be steady and since the number of trees and the max depth is not too high, they should also not take up that much space. And certainly not more due to a larger training set.
All in all, I suspect that the structure is containing more than I need. Yet, I couldn't find a good answer on how to exclude these parts from it and save only the necessary information for my future predictions.
I ran into a similar issue and I also thought in the beginning that the model was saving unnecessary information or that the serialization was introducing some redundancy. It turns out in fact that decision trees are indeed memory hungry structures that consists of multiple arrays of length given by the total number of nodes. Nodes in general grow with the size of data (and parameters like max_depth cannot effectively used to limit growth since the reasonable values still have room to generate huge number of nodes). See details in this answer but the gist is:
a single decision tree can easy grow to a few MBs (example above has a 5MB decision tree for 100K data and a 50MB decision tree for 1M data)
a random forest commonly contains at least 100 such decision tree and for the example above you would have models in the range of 0.5/5GB
compression is usually not enough to reduce to reasonable sizes (1/2, 1/3 are usual ranges)
Other notes:
using a different algorithm models might remain of a more manageable size (e.g. with xgboost I saw much smaller serialized models)
it is probably possible to "prune" some of the data used by decision trees if you only plan it to reuse it for prediction. In particular I imagine the array of impurity and possible those on n_samples might not be needed but I have not checked.
with respect to you hypothesis that the random forest is saving the data on which it is trained: not it is not and the data itself would likely be one or more order of magnitude smaller than the final model
so in principle another strategy if you have a reproducible training pipeline could be to save the data instead of the model and retrain on purpose, but this is only possible if you can spare the time to retrain (for example if in a use case where you have a long running service which has the model in memory and you serialize the model in order to have a backup for when the model goes down)
there are probably also other options to limit growth of random forest, the best one I have found until now is in this answer, where the suggestion is to work with min_samples_leaf to set it as a percentage of data

Mixing text and numeric features for text classification using deep learning

I have a problem about classification of text into several categories (topics). Apart from text, I have some numeric features that I believe may be useful (there are also missing values among those features). But the most important information is, of course, presented in the text. Therefore, I think deep learning approach (with a common pipeline: embedding layer + CNN or RNN with dropout + Dense layer) would be the best choice. What is the best practice to mix the current model that works only on text input with numeric features? Are there any tricks, best common practices, state-of-the-art research going on in this field? Are there any papers/experiments (on GitHub, maybe) on this topic?
It'd be great if we could think of the problem in general, but for the sake of having an idea of what sort of problem we may solve, I will give a specific example. Let's suppose we have reviews from users in which they describe a problem they faced while receiving a service or purchasing an item. The target feature is multi-label: the set of tags (categories/topics) associated with the complaint that a user had (we should choose relevant ones among a few hundreds of possible topics).
Then apart from the user's comment itself (which is the most important feature), we may want also to take into account some numerical features like price, waiting time, rating (customer satisfaction score), etc. This can potentially be useful for predicting some particular categories.
The idea is to mix all these features somehow in a deep learning model to produce the final model. Not sure if I know much about the best ways how to do it. What are the best practices / useful tricks for this kinds of problems?
For each numeric feature, statistically have a representation (you can use pandas.DataFrame.describe), also plotting the distribution would visually make you stronger.
After having the values of mean, std, max, min etc. You should get rid ofoutliers which can harm your training model. For example, if your features have its 90% of its numeric values from 18 to 72 but has also values like 1.1 or 1200 etc. you should get rid of those by equalizing them to 18 or 72 depending on the side. You can use np.clip()
After having a reasonable distribution, you should convert those numeric features to categorical features. For instance, numeric distribution from 18 to 72 can be grouped as 18, 27, 36, ......, 72, taking the intervals. You can increase the resolution or decrease it, depending on your understanding and the performance of the algorithm. You can use np.digitize() or do manually by a simple function that you can write.
In the end you have a categorical feature just like the texts. CNN or RNN can work fine with categorical representations of the numeric values as well as you get the better advantage to have feature crosses to increase your performance.
But if you ask for something of more complex, I might not have understood your question or I may not know it. Still, if you want to ask more or differently, I will be happy to try to help.

Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

The following is how I understand the point of parameter sharing in RNNs:
In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames).
However, sequential input data like sentences can come in highly varying lengths, which means that the number of parameters will not be the same depending on which example sentence is processed. That is why parameter sharing is necessary for efficiently processing sequential data: it makes sure that the model always has the same input size regardless of the sequence length, as it is specified in terms of transition from one state to another. It is thus possible to use the same transition function with the same weights (input to hidden weights, hidden to output weights, hidden to hidden weights) at every time step. The big advantage is that it allows generalization to sequence lengths that did not appear in the training set.
My questions are:
Is my understanding of RNNs, as summarized above, correct?
In the actual code example in Keras I looked at for LSTMs, they padded the sentences to equal lengths before all. By doing so, doesn't this wash away the whole purpose of parameter sharing in RNNs?
Parameter Sharing
Being able to efficiently process sequences of varying length is not the only advantage of parameter sharing. As you said, you can achieve that with padding. The main purpose of parameter sharing is a reduction of the parameters that the model has to learn. This is the whole purpose of using a RNN.
If you would learn a different network for each time step and feed the output of the first model to the second etc. you would end up with a regular feed-forward network. For a number of 20 time steps, you would have 20 models to learn. In Convolutional Nets, parameters are shared by the Convolutional Filters because when we can assume that there are similar interesting patterns in different regions of the picture (for example a simple edge). This drastically reduces the number of parameters we have to learn. Analogously, in sequence learning we can often assume that there are similar patterns at different time steps. Compare 'Yesterday I ate an apple' and 'I ate an apple yesterday'. These two sentences mean the same, but the 'I ate an apple' part occurs on different time steps. By sharing parameters, you only have to learn what that part means once. Otherwise, you'd have to learn it for every time step, where it could occur in your model.
There is a drawback to sharing the parameters. Because our model applies the same transformation to the input at every time step, it now has to learn a transformation that makes sense for all time steps. So, it has to remember, what word came in which time step, i.e. 'chocolate milk' should not lead to the same hidden and memory state as 'milk chocolate'. But this drawback is small compared to using a large feed-forward network.
Padding
As for padding the sequences: the main purpose is not directly to let the model predict sequences of varying length. Like you said, this can be done by using parameter sharing. Padding is used for efficient training - specifically to keep the computational graph during training low. Without padding, we have two options for training:
We unroll the model for each training sample. So, when we have a sequence of length 7, we unroll the model to 7 time steps, feed the sequence, do back-propagation through the 7 time steps and update the parameters. This seems intuitive in theory. But in practice, this is inefficient, because TensorFlow's computational graphs don't allow recurrency, they are feedforward.
The other option is to create the computational graphs before starting training. We let them share the same weights and create one computational graph for every sequence length in our training data. But when our dataset has 30 different sequence lengths this means 30 different graphs during training, so for large models, this is not feasible.
This is why we need padding. We pad all sequences to the same length and then only need to construct one computational graph before starting training. When you have both very short and very long sequence lengths (5 and 100 for example), you can use bucketing and padding. This means, you pad the sequences to different bucket lengths, for example [5, 20, 50, 100]. Then, you create a computational graph for each bucket. The advantage of this is, that you don't have to pad a sequence of length 5 to 100, as you would waste a lot of time on "learning" the 95 padding tokens in there.

What are the lesser known but useful data structures?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
There are some data structures around that are really useful but are unknown to most programmers. Which ones are they?
Everybody knows about linked lists, binary trees, and hashes, but what about Skip lists and Bloom filters for example. I would like to know more data structures that are not so common, but are worth knowing because they rely on great ideas and enrich a programmer's tool box.
PS: I am also interested in techniques like Dancing links which make clever use of properties of a common data structure.
EDIT:
Please try to include links to pages describing the data structures in more detail. Also, try to add a couple of words on why a data structure is cool (as Jonas Kölker already pointed out). Also, try to provide one data-structure per answer. This will allow the better data structures to float to the top based on their votes alone.
Tries, also known as prefix-trees or crit-bit trees, have existed for over 40 years but are still relatively unknown. A very cool use of tries is described in "TRASH - A dynamic LC-trie and hash data structure", which combines a trie with a hash function.
Bloom filter: Bit array of m bits, initially all set to 0.
To add an item you run it through k hash functions that will give you k indices in the array which you then set to 1.
To check if an item is in the set, compute the k indices and check if they are all set to 1.
Of course, this gives some probability of false-positives (according to wikipedia it's about 0.61^(m/n) where n is the number of inserted items). False-negatives are not possible.
Removing an item is impossible, but you can implement counting bloom filter, represented by array of ints and increment/decrement.
Rope: It's a string that allows for cheap prepends, substrings, middle insertions and appends. I've really only had use for it once, but no other structure would have sufficed. Regular strings and arrays prepends were just far too expensive for what we needed to do, and reversing everthing was out of the question.
Skip lists are pretty neat.
Wikipedia
A skip list is a probabilistic data structure, based on multiple parallel, sorted linked lists, with efficiency comparable to a binary search tree (order log n average time for most operations).
They can be used as an alternative to balanced trees (using probalistic balancing rather than strict enforcement of balancing). They are easy to implement and faster than say, a red-black tree. I think they should be in every good programmers toolchest.
If you want to get an in-depth introduction to skip-lists here is a link to a video of MIT's Introduction to Algorithms lecture on them.
Also, here is a Java applet demonstrating Skip Lists visually.
Spatial Indices, in particular R-trees and KD-trees, store spatial data efficiently. They are good for geographical map coordinate data and VLSI place and route algorithms, and sometimes for nearest-neighbor search.
Bit Arrays store individual bits compactly and allow fast bit operations.
Zippers - derivatives of data structures that modify the structure to have a natural notion of 'cursor' -- current location. These are really useful as they guarantee indicies cannot be out of bound -- used, e.g. in the xmonad window manager to track which window has focused.
Amazingly, you can derive them by applying techniques from calculus to the type of the original data structure!
Here are a few:
Suffix tries. Useful for almost all kinds of string searching (http://en.wikipedia.org/wiki/Suffix_trie#Functionality). See also suffix arrays; they're not quite as fast as suffix trees, but a whole lot smaller.
Splay trees (as mentioned above). The reason they are cool is threefold:
They are small: you only need the left and right pointers like you do in any binary tree (no node-color or size information needs to be stored)
They are (comparatively) very easy to implement
They offer optimal amortized complexity for a whole host of "measurement criteria" (log n lookup time being the one everybody knows). See http://en.wikipedia.org/wiki/Splay_tree#Performance_theorems
Heap-ordered search trees: you store a bunch of (key, prio) pairs in a tree, such that it's a search tree with respect to the keys, and heap-ordered with respect to the priorities. One can show that such a tree has a unique shape (and it's not always fully packed up-and-to-the-left). With random priorities, it gives you expected O(log n) search time, IIRC.
A niche one is adjacency lists for undirected planar graphs with O(1) neighbour queries. This is not so much a data structure as a particular way to organize an existing data structure. Here's how you do it: every planar graph has a node with degree at most 6. Pick such a node, put its neighbors in its neighbor list, remove it from the graph, and recurse until the graph is empty. When given a pair (u, v), look for u in v's neighbor list and for v in u's neighbor list. Both have size at most 6, so this is O(1).
By the above algorithm, if u and v are neighbors, you won't have both u in v's list and v in u's list. If you need this, just add each node's missing neighbors to that node's neighbor list, but store how much of the neighbor list you need to look through for fast lookup.
I think lock-free alternatives to standard data structures i.e lock-free queue, stack and list are much overlooked.
They are increasingly relevant as concurrency becomes a higher priority and are much more admirable goal than using Mutexes or locks to handle concurrent read/writes.
Here's some links
http://www.cl.cam.ac.uk/research/srg/netos/lock-free/
http://www.research.ibm.com/people/m/michael/podc-1996.pdf [Links to PDF]
http://www.boyet.com/Articles/LockfreeStack.html
Mike Acton's (often provocative) blog has some excellent articles on lock-free design and approaches
I think Disjoint Set is pretty nifty for cases when you need to divide a bunch of items into distinct sets and query membership. Good implementation of the Union and Find operations result in amortized costs that are effectively constant (inverse of Ackermnan's Function, if I recall my data structures class correctly).
Fibonacci heaps
They're used in some of the fastest known algorithms (asymptotically) for a lot of graph-related problems, such as the Shortest Path problem. Dijkstra's algorithm runs in O(E log V) time with standard binary heaps; using Fibonacci heaps improves that to O(E + V log V), which is a huge speedup for dense graphs. Unfortunately, though, they have a high constant factor, often making them impractical in practice.
Anyone with experience in 3D rendering should be familiar with BSP trees. Generally, it's the method by structuring a 3D scene to be manageable for rendering knowing the camera coordinates and bearing.
Binary space partitioning (BSP) is a
method for recursively subdividing a
space into convex sets by hyperplanes.
This subdivision gives rise to a
representation of the scene by means
of a tree data structure known as a
BSP tree.
In other words, it is a method of
breaking up intricately shaped
polygons into convex sets, or smaller
polygons consisting entirely of
non-reflex angles (angles smaller than
180°). For a more general description
of space partitioning, see space
partitioning.
Originally, this approach was proposed
in 3D computer graphics to increase
the rendering efficiency. Some other
applications include performing
geometrical operations with shapes
(constructive solid geometry) in CAD,
collision detection in robotics and 3D
computer games, and other computer
applications that involve handling of
complex spatial scenes.
Huffman trees - used for compression.
Have a look at Finger Trees, especially if you're a fan of the previously mentioned purely functional data structures. They're a functional representation of persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece.
As per the original article:
Our functional 2-3 finger trees are an instance of a general design technique in- troduced by Okasaki (1998), called implicit recursive slowdown. We have already noted that these trees are an extension of his implicit deque structure, replacing pairs with 2-3 nodes to provide the flexibility required for efficient concatenation and splitting.
A Finger Tree can be parameterized with a monoid, and using different monoids will result in different behaviors for the tree. This lets Finger Trees simulate other data structures.
Circular or ring buffer - used for streaming, among other things.
I'm surprised no one has mentioned Merkle trees (ie. Hash Trees).
Used in many cases (P2P programs, digital signatures) where you want to verify the hash of a whole file when you only have part of the file available to you.
<zvrba> Van Emde-Boas trees
I think it'd be useful to know why they're cool. In general, the question "why" is the most important to ask ;)
My answer is that they give you O(log log n) dictionaries with {1..n} keys, independent of how many of the keys are in use. Just like repeated halving gives you O(log n), repeated sqrting gives you O(log log n), which is what happens in the vEB tree.
How about splay trees?
Also, Chris Okasaki's purely functional data structures come to mind.
An interesting variant of the hash table is called Cuckoo Hashing. It uses multiple hash functions instead of just 1 in order to deal with hash collisions. Collisions are resolved by removing the old object from the location specified by the primary hash, and moving it to a location specified by an alternate hash function. Cuckoo Hashing allows for more efficient use of memory space because you can increase your load factor up to 91% with only 3 hash functions and still have good access time.
A min-max heap is a variation of a heap that implements a double-ended priority queue. It achieves this by by a simple change to the heap property: A tree is said to be min-max ordered if every element on even (odd) levels are less (greater) than all childrens and grand children. The levels are numbered starting from 1.
http://internet512.chonbuk.ac.kr/datastructure/heap/img/heap8.jpg
I like Cache Oblivious datastructures. The basic idea is to lay out a tree in recursively smaller blocks so that caches of many different sizes will take advantage of blocks that convenient fit in them. This leads to efficient use of caching at everything from L1 cache in RAM to big chunks of data read off of the disk without needing to know the specifics of the sizes of any of those caching layers.
Left Leaning Red-Black Trees. A significantly simplified implementation of red-black trees by Robert Sedgewick published in 2008 (~half the lines of code to implement). If you've ever had trouble wrapping your head around the implementation of a Red-Black tree, read about this variant.
Very similar (if not identical) to Andersson Trees.
Work Stealing Queue
Lock-free data structure for dividing the work equaly among multiple threads
Implementation of a work stealing queue in C/C++?
Bootstrapped skew-binomial heaps by Gerth Stølting Brodal and Chris Okasaki:
Despite their long name, they provide asymptotically optimal heap operations, even in a function setting.
O(1) size, union, insert, minimum
O(log n) deleteMin
Note that union takes O(1) rather than O(log n) time unlike the more well-known heaps that are commonly covered in data structure textbooks, such as leftist heaps. And unlike Fibonacci heaps, those asymptotics are worst-case, rather than amortized, even if used persistently!
There are multiple implementations in Haskell.
They were jointly derived by Brodal and Okasaki, after Brodal came up with an imperative heap with the same asymptotics.
Kd-Trees, spatial data structure used (amongst others) in Real-Time Raytracing, has the downside that triangles that cross intersect the different spaces need to be clipped. Generally BVH's are faster because they are more lightweight.
MX-CIF Quadtrees, store bounding boxes instead of arbitrary point sets by combining a regular quadtree with a binary tree on the edges of the quads.
HAMT, hierarchical hash map with access times that generally exceed O(1) hash-maps due to the constants involved.
Inverted Index, quite well known in the search-engine circles, because it's used for fast retrieval of documents associated with different search-terms.
Most, if not all, of these are documented on the NIST Dictionary of Algorithms and Data Structures
Ball Trees. Just because they make people giggle.
A ball tree is a data structure that indexes points in a metric space. Here's an article on building them. They are often used for finding nearest neighbors to a point or accelerating k-means.
Not really a data structure; more of a way to optimize dynamically allocated arrays, but the gap buffers used in Emacs are kind of cool.
Fenwick Tree. It's a data structure to keep count of the sum of all elements in a vector, between two given subindexes i and j. The trivial solution, precalculating the sum since the begining doesn't allow to update a item (you have to do O(n) work to keep up).
Fenwick Trees allow you to update and query in O(log n), and how it works is really cool and simple. It's really well explained in Fenwick's original paper, freely available here:
http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol24/issue3/spe884.pdf
Its father, the RQM tree is also very cool: It allows you to keep info about the minimum element between two indexes of the vector, and it also works in O(log n) update and query. I like to teach first the RQM and then the Fenwick Tree.
Van Emde-Boas trees. I have even a C++ implementation of it, for up to 2^20 integers.
Nested sets are nice for representing trees in the relational databases and running queries on them. For instance, ActiveRecord (Ruby on Rails' default ORM) comes with a very simple nested set plugin, which makes working with trees trivial.
It's pretty domain-specific, but half-edge data structure is pretty neat. It provides a way to iterate over polygon meshes (faces and edges) which is very useful in computer graphics and computational geometry.