Formal definition of a Plane Verse Layer - terminology

For instance, what is the difference between a Data Plane and a Data Layer? What is the formal definition of those terms in general, plane vs layer?

Related

Doesn't introduction of polynomial features lead to increased collinearity?

I was going through Linear and Logistic regression from ISLR and in both cases I found that one of the approaches adopted to increase the flexibility of the model was to use polynomial features - X and X^2 both as features and then apply the regression models as usual while considering X and X^2 as independent features (in sklearn, not the polynomial fit of statsmodel). Does that not increase the collinearity amongst the features though? How does it affect the model performance?
To summarize my thoughts regarding this -
First, X and X^2 have substantial correlation no doubt.
Second, I wrote a blog demonstrating that, at least in Linear regression, collinearity amongst features does not affect the model fit score though it makes the model less interpretable by increasing coefficient uncertainty.
So does the second point have anything to do with this, given that model performance is measured by the fit score.
Multi-collinearity isn't always a hindrance. It depends from data to data. If your model isn't giving you the best results(high accuracy or low loss), you then remove the outliers or highly correlated features to improve it but is everything is hunky-dory, you don't bother about them.
Same goes with polynomial regression. Yes it adds multi-collinearity in your model by introducing x^2, x^3 features into your model.
To overcome that, you can use orthogonal polynomial regression which introduces polynomials that are orthogonal to each other.
But it will still introduce higher degree polynomials which can become unstable at the boundaries of your data space.
To overcome this issue, you can use Regression Splines in which it divides the distribution of the data into separate portions and fit linear or low degree polynomial functions on each of these portions. The points where the division occurs are called Knots. Functions which we can use for modelling each piece/bin are known as Piecewise functions. This function has a constraint , suppose, if it is introducing 3 degree of polynomials or cubic features and then the function should be second-order differentiable.
Such a piecewise polynomial of degree m with m-1 continuous derivatives is called a Spline.

How feature importance is calculated in regression trees?

In case of classification using decision tree algorithm or Random Forest we use gini impurity or information gain as a measure to decide which feature to select first for splitting parent/intermediate node but if we are conducting regression using decision tree or random forest then how is feature importance calculated or the features selected?
For regression (feature selection), the goal of splitting is to get two childs with the lowest variance among target values.
You can check the 'criterion' parameter from regression vs classification from sklearn library to get a better idea.
You can also check this video: https://www.youtube.com/watch?v=nSaOuPCNvlk

Predicting rare events and their strength with LSTM autoencoder

I’m currently creating and LSTM to predict rare events. I’ve seen this paper which suggest: first an autoencoder LSTM for extracting features and second to use the embeddings for a second LSTM that will make the actual prediction. According to them, the autoencoder extract features (this is usually true) which are then useful for the prediction layers to predict.
In my case, I need to predict if it would be or not an extreme event (this is the most important thing) and then how strong is gonna be. Following their advice, I’ve created the model, but instead of adding one LSTM from embeddings to predictions I add two. One for binary prediction (It is, or it is not), ending with a sigmoid layer, and the second one for predicting how strong will be. Then I have three losses. The reconstruction loss (MSE), the prediction loss (MSE), and the binary loss (Binary Entropy).
The thing is that I’m not sure that is learning anything… the binary loss keeps in 0.5, and even the reconstruction loss is not really good. And of course, the bad thing is that the time series is plenty of 0, and some numbers from 1 to 10, so definitely MSE is not a good metric.
What do you think about this approach?
This is the better architecture for predicting rare events? Which one would be better?
Should I add some CNN or FC from the embeddings before the other to LSTM, for extracting 1D patterns from the embedding, or directly to make the prediction?
Should the LSTM that predicts be just one? And only use MSE loss?
Would be a good idea to multiply the two predictions to force in both cases the predicted days without the event coincide?
Thanks,

Why do we need the hyperparameters beta and alpha in LDA?

I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:
First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?
Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?
If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.
The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.

Does resnet have fully connected layers?

In my understanding, fully connected layer(fc in short) is used for predicting.
For example, VGG Net used 2 fc layers, which are both 4096 dimension. The last layer for softmax has dimension same with classes num:1000.
But for resnet, it used global average pooling, and use the pooled result of last convolution layer as the input.
But they still has a fc layer! Does this layer a really fc layer? Or this layer is only to make input into a vector of features which number is classes number? Does this layer has function for prediction result?
In a word, how many fc layers do resnet and VGGnet have? Does VGGnet's 1st 2nd 3rd fc layer has different function?
VGG has three FC layers, two with 4096 neurons and one with 1000 neurons which outputs the class probabilities.
ResNet only has one FC layer with 1000 neurons which again outputs the class probabilities. In a NN classifier always the best choice is to use softmax, some authors make this explicit in the diagram while others do not.
In essence the guys at microsoft (ResNet) favor more convolutional layers instead of fully connected ones and therefore ommit fully connected layers. GlobalAveragePooling also decreases the feature size dramatically and therefore reduces the number of parameters going from the convolutional part to the fully connected part.
I would argue that the performance difference is quite slim, but one of their main accomplishments, by introducing ResNets is the dramatic reduction of parameters and those two points helped them accomplish that.