How feature importance is calculated in regression trees? - regression

In case of classification using decision tree algorithm or Random Forest we use gini impurity or information gain as a measure to decide which feature to select first for splitting parent/intermediate node but if we are conducting regression using decision tree or random forest then how is feature importance calculated or the features selected?

For regression (feature selection), the goal of splitting is to get two childs with the lowest variance among target values.
You can check the 'criterion' parameter from regression vs classification from sklearn library to get a better idea.
You can also check this video: https://www.youtube.com/watch?v=nSaOuPCNvlk

Related

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the model harness this knowledge? Of course a simple solution could be to add this count as a feature after the self-attention layer. However, this hand-crafted-like approach wouldn't reveal more complex relations, for example if there is a high number of word X, it correlates with target 1, except if there is also high number of word Y, in which case the target tends to be 0.
How could this information be included using deep learning? Paper recommendations in the topic are also well appreciated.

Deep Learning Data Normalization

I’m working with different types of financial data inputs for my models and I would like to know more about normalization of them.
In particular, working with some technical indicators, I’ve normalized them to have a range between 0 and 1.
Others were normalized to have a range between -1 and 1.
What is your experience with mixed normalized data?
Could it be acceptable to have these two ranges or is it always better to have the training dataset with a single range i.e. [0 1]?
It is important to note that when we discuss data normalization, we are usually referring to the normalization of continuous data. Categorical data (usually) doesn't require the former.
Furthermore, not all ML methods need you to normalize data for them to function well. Examples of such methods include Random Forests and Gradient Boosting Machines. Others, however, do. For instance, Support Vector Machines and Neural Networks.
The reasons for input data normalization are dependent on the methods themselves. For SVMs, data normalization is done to ensure that input features are given equal importance in influencing the model's decisions. For neural networks, we normalize data to allow the gradient descent process to converge smoothly.
Finally, to answer your question, if you are working with continuous data and using a neural network to model your data, just make sure that the normalized data's values are close to each other (even if they are not the same range) because that is what determines the ease with which the gradient descent process converges. If you are working with an SVM, it would be better to normalize your data to a single range, so that all features may be given equal importance by the similarity/ distance function that your SVM uses. In other cases, the need for data normalization, whatever the ranges, may be removed entirely. Ultimately, it depends on the modeling technique you are using!
Credit to #user3666197 for the helpful feedback in the comments.

Using OLS regression on binary outcome variable

I have previously been told that -- for reasons that make complete sense -- one shouldn't run OLS regressions when the outcome variable is binary (i.e. yes/no, true/false, win/loss, etc). However, I often read papers in economics/other social sciences in which researchers run OLS regressions on binary variables and interpret the coefficients just like they would for a continuous outcome variable. A few questions about this:
Why do they not run a logistic regression? Is there any disadvantage/limitation to using logit models? In economics, for example, I very often see papers using OLS regression for binary variable and not logit. Can logit only be used in certain situations?
In general, when can one run an OLS regression on ordinal data? If I have a variable that captures "number of times in a week survey respondent does X", can I - in any circumstance - use it as a dependent variable in a linear regression? I often see this being done in literature as well, even though we're always told in introductory statistics/econometrics that outcome variables in an OLS regression should be continuous.
The application of applying OLS to a binary outcome is called Linear Probability Model. Compared to a logistic model, LPM has advantages in terms of implementation and interpretation that make it an appealing option for researchers conducting impact analysis. In LPM, parameters represent mean marginal effects while parameters represent log odds ratio in logistic regression. To calculate the mean marginal effects in logistic regression, we need calculate that derivative for every data point and then
calculate the mean of those derivatives. While logistic regression and the LPM usually yield the same expected average impact estimate[1], researchers prefer LPM for estimating treatment impacts.
In general, yes, we can definitely apply OLS to an ordinal outcome. Similar to the previous case, applying OLS to a binary or ordinal outcome result in violations of the assumptions of OLS. However, within econometrics, they believe the practical effect of violating these assumptions is minor and that the simplicity of interpreting an OLS outweighs the technical correctness of an ordered logit or probit model, especially when the ordinal outcome looks quasi-normal.
Reference:
[1] Deke, J. (2014). Using the linear probability model to estimate impacts on binary outcomes in randomized controlled trials. Mathematica Policy Research.

Why do we need the hyperparameters beta and alpha in LDA?

I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:
First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?
Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?
If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.
The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.

How is unsupervised deep learning used in sentiment analysis?

Typically text classification, including sentiment analysis can be performed in one of 2 ways: 1. Supervised learning if there is enough training data and 2. A unsupervised training when there is no enough training data which is not prelabeled
I have only a collection of tweets which contains only the texte (reviews) and there is no polarity fir each twwet.
My question is is there any method to di sentimeent analysis on this data using unsupervised learning?
Thank you to help me
(Based on your comment, I've concentrated on the "unsupervised" part of your question, and ignored deep learning.)
If you use something like SentiWordNet you can assign a positive or negative score to each word in a tweet, and then (as the simplest approach) sum them to get a single sentiment number for each tweet.
At this point it doesn't really matter if you are doing supervised or unsupervised learning, as either way you will have a score for each tweet, and can divide them up the tweets into, say, positive, neutral and negative sentiment. What the supervised data, the class, does allow is getting an error estimate on how well it has done at classifying them.
If you want an error estimate when your training data has no classes, you could evaluate some percentage of the tweets yourself. Even just doing 30 of them will start to give you an idea of where your grouping algorithm is on the scale from random to perfect, and won't take long.