I’m working with different types of financial data inputs for my models and I would like to know more about normalization of them.
In particular, working with some technical indicators, I’ve normalized them to have a range between 0 and 1.
Others were normalized to have a range between -1 and 1.
What is your experience with mixed normalized data?
Could it be acceptable to have these two ranges or is it always better to have the training dataset with a single range i.e. [0 1]?
It is important to note that when we discuss data normalization, we are usually referring to the normalization of continuous data. Categorical data (usually) doesn't require the former.
Furthermore, not all ML methods need you to normalize data for them to function well. Examples of such methods include Random Forests and Gradient Boosting Machines. Others, however, do. For instance, Support Vector Machines and Neural Networks.
The reasons for input data normalization are dependent on the methods themselves. For SVMs, data normalization is done to ensure that input features are given equal importance in influencing the model's decisions. For neural networks, we normalize data to allow the gradient descent process to converge smoothly.
Finally, to answer your question, if you are working with continuous data and using a neural network to model your data, just make sure that the normalized data's values are close to each other (even if they are not the same range) because that is what determines the ease with which the gradient descent process converges. If you are working with an SVM, it would be better to normalize your data to a single range, so that all features may be given equal importance by the similarity/ distance function that your SVM uses. In other cases, the need for data normalization, whatever the ranges, may be removed entirely. Ultimately, it depends on the modeling technique you are using!
Credit to #user3666197 for the helpful feedback in the comments.
Related
I am trying to save a bunch of trained random forest classifiers in order to reuse them later. For this, I am trying to use pickle or joblib. The problem I encounter is, that the saved files get huge. This seems to be correlated to the amount of data that I use for training (which is several 10-millions of samples per forest, leading to dumped files in the order of up to 20GB!).
Is the RF classifier itself saving the training data in its structure? If so, how could I take the structure apart and only save the necessary parameters for later predictions? Sadly, I could not find anything on the subject of size yet.
Thanks for your help!
Baradrist
Here's what I did in a nutshell:
I trained the (fairly standard) RF on a large dataset and saved the trained forest afterwards, trying both pickle and joblib (also with the compress-option set to 3).
X_train, y_train = ... some data
classifier = RandomForestClassifier(n_estimators=24, max_depth=10)
classifier.fit(X_train, y_train)
pickle.dump(classifier, open(path+'classifier.pickle', 'wb'))
or
joblib.dump(classifier, path+'classifier.joblib', compress=True)
Since the saved files got quite big (5GB to nearly 20GB, compressed aprox. 1/3 of this - and I will need >50 such forests!) and the training takes a while, I experimented with different subsets of the training data. Depending on the size of the train set, I found different sizes for the saved classifier, making me believe that information about the training is pickled/joblibed as well. This seems unintuitive to me, as for predictions, I only need the information of all the trained weak predictors (decision trees) which should be steady and since the number of trees and the max depth is not too high, they should also not take up that much space. And certainly not more due to a larger training set.
All in all, I suspect that the structure is containing more than I need. Yet, I couldn't find a good answer on how to exclude these parts from it and save only the necessary information for my future predictions.
I ran into a similar issue and I also thought in the beginning that the model was saving unnecessary information or that the serialization was introducing some redundancy. It turns out in fact that decision trees are indeed memory hungry structures that consists of multiple arrays of length given by the total number of nodes. Nodes in general grow with the size of data (and parameters like max_depth cannot effectively used to limit growth since the reasonable values still have room to generate huge number of nodes). See details in this answer but the gist is:
a single decision tree can easy grow to a few MBs (example above has a 5MB decision tree for 100K data and a 50MB decision tree for 1M data)
a random forest commonly contains at least 100 such decision tree and for the example above you would have models in the range of 0.5/5GB
compression is usually not enough to reduce to reasonable sizes (1/2, 1/3 are usual ranges)
Other notes:
using a different algorithm models might remain of a more manageable size (e.g. with xgboost I saw much smaller serialized models)
it is probably possible to "prune" some of the data used by decision trees if you only plan it to reuse it for prediction. In particular I imagine the array of impurity and possible those on n_samples might not be needed but I have not checked.
with respect to you hypothesis that the random forest is saving the data on which it is trained: not it is not and the data itself would likely be one or more order of magnitude smaller than the final model
so in principle another strategy if you have a reproducible training pipeline could be to save the data instead of the model and retrain on purpose, but this is only possible if you can spare the time to retrain (for example if in a use case where you have a long running service which has the model in memory and you serialize the model in order to have a backup for when the model goes down)
there are probably also other options to limit growth of random forest, the best one I have found until now is in this answer, where the suggestion is to work with min_samples_leaf to set it as a percentage of data
I use Albumentations augmentations in my computer vision tasks. However, I don't fully understand when to use normalization on my images (I use min-max normalization). Do I need to use normalization before augmentation functions, but values would not be between 0-1, or do I use normalization just after augmentations, so that values are between 0-1, or I use normalization in both cases - before and after augmentations?
For example, when I use Sharpen, values are not in 0-1 range (they vary in -0.5-1.5 range). Does that affect model performance? If yes, how?
Thanks in advance.
The basic idea is that you should have the input of your neural network around 0 and with a variance of 1. There is a mathematical reason why it helps the learning process of neural network. This is not the case for other algorithms like tree boosting.
If you train from scratch the type of normalization (min max or other) should not impact the model performance (except if, for exemple your max/min value is really extrem compare to your other data point).
I have a very complex LSTM based neural network model which I'm training on Quora Duplicate Question pairs. There are approximately 400 000 sentence pairs in the original dataset. It would take a lot of processing power and computation time to train on the entire (or 80%) dataset. Would it be unwise if I choose a random subset of the dataset (say 8000 pairs only) for training and 2000 for testing? Would it have a severe impact on the performance? Is always "more the data, better the model" true?
As a Rule of Thumb, Deep Neural Networks usually benefit from more data.
If you have a well described model and properly engineered your inputs, you will lose if you chose a smaller subset of your dataset.
However, you could always evaluate this by using metrics. Check how your loss decreases at every sample size, starting from your 8000 pairs.
For big problems, you always have to keep in mind that computation time is usually also big.
I am trying to build a deep neural networks that takes in a set of documents and predicts the category it belongs.
Since number of documents in each collection is not fixed, my first attempt was to get a mapping of documents from doc2vec and use the average.
The accuracy on training is high as 90% but the testing accuracy is low as 60%.
Is there a better way of representing a collection of documents as a fixed length vector so that the words they have in common are captured?
The description of your process so far is a bit vague and unclear – you may want to add more detail to your question.
Typically, Doc2Vec would convert each doc to a vector, not "a collection of documents".
If you did try to collapse a collection into a single vector – for example, by averaging many doc-vecs, or calculating a vector for a synthetic document with all the sub-documents' words – you might be losing valuable higher-dimensional structure.
To "predict the category" would be a typical "classification" problem, and with a bunch of documents (represented by their per-doc vectors) and known-labels, you could try various kinds of classifiers.
I suspect from your description, that you may just be collapsing a category to a single vector, then classifying new documents by checking which existing category-vector they're closest-to. That can work – it's vaguely a K-Nearest-Neighbors approach, but with every category reduced to one summary vector rather than the full set of known examples, and each classification being made by looking at just one single nearest-neighbor. That forces a simplicity on the process that may not match the "shapes" of the real categories as well as a true KNN classifier, or other classifiers, could achieve.
If accuracy on test data falls far below that observed during training, that can indicate that significant "overfitting" is occurring: the model(s) are essentially memorizing idiosyncrasies of the training data to "cheat" at answers based on arbitrary correlations, rather than learning generalizable rules. Making your model(s) smaller – such as by decreasing the dimensionality of your doc-vectors – may help in such situations, by giving the model less extra state in which to remember peculiarities of the training data. More data can also help - as the "noise" in more numerous varied examples tends of cancel itself out, rather than achieve the sort of misguided importance that can be learned in smaller datasets.
There are other ways to convert a variable-length text into a fixed-length vector, including many based on deeper learning algorithms. But, those can be even more training-data-hungry, and it seems like you may have other factors to improve before trying those in-lieu-of Doc2Vec.
I have a dataset with both binary data (0,1) and numeric data with different units. If I want to apply some machine learning techniques to classify my data (potentially autoencoder or hierarchy clustering), should I standardize or normalize the data?
Thank you!
It depends.
For neural networks you may want to standardize continuous variables for numerical reasons. But it depends on your platform. Consider Googles TPUs: they work with 1 byte precision, so you want the relevant input domain to use this limited range optimally.
For distance based methods like clustering, preprocessing the data is crucial, but difficult. It is false that standardizing is always the right thing to do. But it is fairly common to apply some normalization. But you need a domain expert to find the best normalization.