Why do we need the hyperparameters beta and alpha in LDA? - lda

I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:
First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?
Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?

If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.
The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.

Related

Doesn't introduction of polynomial features lead to increased collinearity?

I was going through Linear and Logistic regression from ISLR and in both cases I found that one of the approaches adopted to increase the flexibility of the model was to use polynomial features - X and X^2 both as features and then apply the regression models as usual while considering X and X^2 as independent features (in sklearn, not the polynomial fit of statsmodel). Does that not increase the collinearity amongst the features though? How does it affect the model performance?
To summarize my thoughts regarding this -
First, X and X^2 have substantial correlation no doubt.
Second, I wrote a blog demonstrating that, at least in Linear regression, collinearity amongst features does not affect the model fit score though it makes the model less interpretable by increasing coefficient uncertainty.
So does the second point have anything to do with this, given that model performance is measured by the fit score.
Multi-collinearity isn't always a hindrance. It depends from data to data. If your model isn't giving you the best results(high accuracy or low loss), you then remove the outliers or highly correlated features to improve it but is everything is hunky-dory, you don't bother about them.
Same goes with polynomial regression. Yes it adds multi-collinearity in your model by introducing x^2, x^3 features into your model.
To overcome that, you can use orthogonal polynomial regression which introduces polynomials that are orthogonal to each other.
But it will still introduce higher degree polynomials which can become unstable at the boundaries of your data space.
To overcome this issue, you can use Regression Splines in which it divides the distribution of the data into separate portions and fit linear or low degree polynomial functions on each of these portions. The points where the division occurs are called Knots. Functions which we can use for modelling each piece/bin are known as Piecewise functions. This function has a constraint , suppose, if it is introducing 3 degree of polynomials or cubic features and then the function should be second-order differentiable.
Such a piecewise polynomial of degree m with m-1 continuous derivatives is called a Spline.

Using OLS regression on binary outcome variable

I have previously been told that -- for reasons that make complete sense -- one shouldn't run OLS regressions when the outcome variable is binary (i.e. yes/no, true/false, win/loss, etc). However, I often read papers in economics/other social sciences in which researchers run OLS regressions on binary variables and interpret the coefficients just like they would for a continuous outcome variable. A few questions about this:
Why do they not run a logistic regression? Is there any disadvantage/limitation to using logit models? In economics, for example, I very often see papers using OLS regression for binary variable and not logit. Can logit only be used in certain situations?
In general, when can one run an OLS regression on ordinal data? If I have a variable that captures "number of times in a week survey respondent does X", can I - in any circumstance - use it as a dependent variable in a linear regression? I often see this being done in literature as well, even though we're always told in introductory statistics/econometrics that outcome variables in an OLS regression should be continuous.
The application of applying OLS to a binary outcome is called Linear Probability Model. Compared to a logistic model, LPM has advantages in terms of implementation and interpretation that make it an appealing option for researchers conducting impact analysis. In LPM, parameters represent mean marginal effects while parameters represent log odds ratio in logistic regression. To calculate the mean marginal effects in logistic regression, we need calculate that derivative for every data point and then
calculate the mean of those derivatives. While logistic regression and the LPM usually yield the same expected average impact estimate[1], researchers prefer LPM for estimating treatment impacts.
In general, yes, we can definitely apply OLS to an ordinal outcome. Similar to the previous case, applying OLS to a binary or ordinal outcome result in violations of the assumptions of OLS. However, within econometrics, they believe the practical effect of violating these assumptions is minor and that the simplicity of interpreting an OLS outweighs the technical correctness of an ordered logit or probit model, especially when the ordinal outcome looks quasi-normal.
Reference:
[1] Deke, J. (2014). Using the linear probability model to estimate impacts on binary outcomes in randomized controlled trials. Mathematica Policy Research.

Is there a way to not select a reference category for logistic regression in SPSS?

When doing logistic regression in SPSS, is there a way to remove the reference category in the independent variables so they're all compared against each other equally rather than against the reference category?
When you have a categorical predictor variable, the most fundamental way to encode it for modeling, sometimes referred to as the canonical representation, is to use a 0-1 indicator for each level of the predictor, where each case takes on a value of 1 for the indicator corresponding to its category, and 0 for all the other indicators. The multinomial logistic regression procedure in SPSS (NOMREG) uses this parameterization.
If you run NOMREG with a single categorical predictor with k levels, the design matrix is built with an intercept column and the k indicator variables, unless you suppress the intercept. If the intercept remains in the model, the last indicator will be redundant, linearly dependent on the intercept and the first k-1 indicators. Another way to say this is that the design matrix is of deficient rank, since any of the columns can be predicted given the other k columns.
The same redundancy will be true of any additional categorical predictors entered as main effects (only k-1 of k indicators can be nonredundant). If you add interactions among categorical predictors, an indicator for each combination of levels of the two predictors is generated, but more than one of these will also be redundant given the intercept and main effects preceding the interaction(s).
The fundamental or canonical representation of the model is thus overparameterized, meaning it has more parameters than can be uniquely estimated. There are multiple ways commonly used to deal with this fact. One approach is the one used in NOMREG and most other more recent regression-type modeling procedures in SPSS, which is to use a generalized inverse of the cross-product of the design matrix, which has the effect of aliasing parameters associated with redundant columns to 0. You'll see these parameters represented by 0 values with no standard errors or other statistics in the SPSS output.
The other way used in SPSS to handle the overparameterized nature of the basic model is to reparameterize the design matrix to full rank, which involves creating k-1 coded variables instead of k indicators for each main effect, and creating interaction variables from these. This is the approach taken in LOGISTIC REGRESSION.
Note that the overall model fit and predicted values from a logistic regression (or other form of linear or generalized linear model) will be the same regardless of what choices are made about parameterization, as long as the appropriate total number of unique columns are in the design matrix. Particular parameter estimates are of course highly dependent upon the particular parameterization used, but you can derive the results from any of the valid approaches using the results from any other valid approach.
If there are k levels in a categorical predictor, there are k-1 degrees of freedom for comparing those k groups, meaning that once you'd made k-1 linearly independent or nonredundant comparisons, any others can be derived from those.
So the short answer is no, you can't do what you're talking about, but you don't need to, because the results for any valid parameterization will allow you to derive those for any other one.

What does it mean to normalize based on mean and standard deviation of images in the imagenet training dataset?

In implementation of densenet model as in CheXNet paper, in section 3.1 it is mentioned that:
Before inputting the images into the network, we downscale the images to 224x224 and normalize based on the mean and standard edviation of images in the ImageNet training set.
Why would we want to normalize new set of images with mean and std of different dataset?
How do we get the mean and std of ImageNet dataset? Is it provided somewhere?
Subtracting the mean centers the input to 0, and dividing by the standard deviation makes any scaled feature value the number of standard deviations away from the mean.
Consider how a neural network learns its weights. C(NN)s learn by continually adding gradient error vectors (multiplied by a learning rate) computed from backpropagation to various weight matrices throughout the network as training examples are passed through.
The thing to notice here is the "multiplied by a learning rate".
If we didn't scale our input training vectors, the ranges of our distributions of feature values would likely be different for each feature, and thus the learning rate would cause corrections in each dimension that would differ (proportionally speaking) from one another. We might be over compensating a correction in one weight dimension while undercompensating in another.
This is non-ideal as we might find ourselves in a oscillating (unable to center onto a better maxima in cost(weights) space) state or in a slow moving (traveling too slow to get to a better maxima) state.
Original Post: https://stats.stackexchange.com/questions/185853/why-do-we-need-to-normalize-the-images-before-we-put-them-into-cnn
They used mean and std dev of the ImageNet training set because the weights of their model were pretrained on ImageNet (see Model Architecture and Training section of the paper).

Generative model and inference

I was looking at the hLDA model here:
https://papers.nips.cc/paper/2466-hierarchical-topic-models-and-the-nested-chinese-restaurant-process.pdf
I have questions on how the generative model works. What will be the output of the generative model and how is it used in the inference(Gibbs sampling) stage. I am getting mixed up with the generative model and inference part and am not able to distinguish between them.
I am new to this area and any reference articles or papers that can be useful to clear the concept would be very useful.
To get a handle on how this type of Bayesian model works, I'd recommend David Blei's original 2003 LDA paper (google scholar "Latent Dirichlet Allocation" and it'll appear near the top). They used variational inference (as opposed to Gibbs sampling) to estimate the "posterior" (which you could call the "best fit solution"), but the principles behind using a generative model are well explained.
In a nutshell, Bayesian topic models work like this: you presume that your data is created by some "generative model". This model describes a probabilistic process for generating data, and has a few unspecified "latent" variables. In a topic model, these variables are the "topics" that you're trying to find. The idea is to find the most probable values for the "topics" given the data at hand.
In Bayesian inference these most probable values for latent variables are known as the "posterior". Strictly speaking, the posterior is actually a probability distribution over possible values for the latent variables, but a common approach is to use the most probable set of values, called "maximum a posteriori" or MAP estimation.
Note that for topic models, what you get is an estimate for the true MAP values. Many of the latent values, perhaps especially those close to zero, are essentially noise, and cannot be taken seriously (except for being close to zero). It's the larger values that are more meaningful.