I was going through Linear and Logistic regression from ISLR and in both cases I found that one of the approaches adopted to increase the flexibility of the model was to use polynomial features - X and X^2 both as features and then apply the regression models as usual while considering X and X^2 as independent features (in sklearn, not the polynomial fit of statsmodel). Does that not increase the collinearity amongst the features though? How does it affect the model performance?
To summarize my thoughts regarding this -
First, X and X^2 have substantial correlation no doubt.
Second, I wrote a blog demonstrating that, at least in Linear regression, collinearity amongst features does not affect the model fit score though it makes the model less interpretable by increasing coefficient uncertainty.
So does the second point have anything to do with this, given that model performance is measured by the fit score.
Multi-collinearity isn't always a hindrance. It depends from data to data. If your model isn't giving you the best results(high accuracy or low loss), you then remove the outliers or highly correlated features to improve it but is everything is hunky-dory, you don't bother about them.
Same goes with polynomial regression. Yes it adds multi-collinearity in your model by introducing x^2, x^3 features into your model.
To overcome that, you can use orthogonal polynomial regression which introduces polynomials that are orthogonal to each other.
But it will still introduce higher degree polynomials which can become unstable at the boundaries of your data space.
To overcome this issue, you can use Regression Splines in which it divides the distribution of the data into separate portions and fit linear or low degree polynomial functions on each of these portions. The points where the division occurs are called Knots. Functions which we can use for modelling each piece/bin are known as Piecewise functions. This function has a constraint , suppose, if it is introducing 3 degree of polynomials or cubic features and then the function should be second-order differentiable.
Such a piecewise polynomial of degree m with m-1 continuous derivatives is called a Spline.
Related
I am doing actor-critic reinforcement learning for an environment that is best represented as a "bag-of-words". For this reason, I have opted to use a single body, multi-head approach for the net architecture. I use a linear pre-processing layer to generate n word embeddings of dimension d. Then I run the (batch,n,d) words through a stack of (2) nn.TransformerEncoder layers for the body and each head is another encoder layer followed by a linear logit layer.
Since this is RL and I have limited compute, it is also difficult to evaluate training as its happening. I decided to try looking a the mean cosine similarity of the latent words after the encoder body. My intuition tells me if the net is learning a proper latent representation of the environment then dis-similar words should have low cosine similarity.
However even though the net is clearly improving somewhat the mean cosine sim. remains very high, > .99
Thinking about it more, I don't think theres any reason to believe my first intuition, especially since I am not even normalizing the words after encoder body stack. But even if I did normalize, I'm not sure that would encourage lower cosine sim. as I am using 256 dimensions per word. All normalizing does is reduce the dimension of the output space by 1, which should hardly matter here.
Does this make sense? Also any general advice about my net is welcome
I have previously been told that -- for reasons that make complete sense -- one shouldn't run OLS regressions when the outcome variable is binary (i.e. yes/no, true/false, win/loss, etc). However, I often read papers in economics/other social sciences in which researchers run OLS regressions on binary variables and interpret the coefficients just like they would for a continuous outcome variable. A few questions about this:
Why do they not run a logistic regression? Is there any disadvantage/limitation to using logit models? In economics, for example, I very often see papers using OLS regression for binary variable and not logit. Can logit only be used in certain situations?
In general, when can one run an OLS regression on ordinal data? If I have a variable that captures "number of times in a week survey respondent does X", can I - in any circumstance - use it as a dependent variable in a linear regression? I often see this being done in literature as well, even though we're always told in introductory statistics/econometrics that outcome variables in an OLS regression should be continuous.
The application of applying OLS to a binary outcome is called Linear Probability Model. Compared to a logistic model, LPM has advantages in terms of implementation and interpretation that make it an appealing option for researchers conducting impact analysis. In LPM, parameters represent mean marginal effects while parameters represent log odds ratio in logistic regression. To calculate the mean marginal effects in logistic regression, we need calculate that derivative for every data point and then
calculate the mean of those derivatives. While logistic regression and the LPM usually yield the same expected average impact estimate[1], researchers prefer LPM for estimating treatment impacts.
In general, yes, we can definitely apply OLS to an ordinal outcome. Similar to the previous case, applying OLS to a binary or ordinal outcome result in violations of the assumptions of OLS. However, within econometrics, they believe the practical effect of violating these assumptions is minor and that the simplicity of interpreting an OLS outweighs the technical correctness of an ordered logit or probit model, especially when the ordinal outcome looks quasi-normal.
Reference:
[1] Deke, J. (2014). Using the linear probability model to estimate impacts on binary outcomes in randomized controlled trials. Mathematica Policy Research.
I understand that the Fourier transform of a convolution of two signals is the pointwise product of their Fourier transforms (convolutional theorem). What I wonder is there known cases where a convolution can be meaningfully applied to a Fourier-transformed signal (e.g. time series, or image) in the frequency domain to act as a filter instead of the multiplication by a square matrix. Also, are there known applications of filters that increase the size of the time domain, ie where the matrix in the frequency domain is rectangular, and then an inverse FT is applied to back to the time domain? In particular, I'm interested known examples of such method for deep learning.
As you say, convolution of two signals is the pointwise product of their Fourier transforms. This is true in both directions - the convolution of two Fourier-transformed signals is equal to the pointwise product of the two time series.
You do have to define "convolution" suitably - for discrete Fourier transforms, the convolution is a circular convolution.
There are definitely meaningful uses for doing a pointwise block multiply in the time domain (for example, applying a data window to a signal before converting to the frequency domain, or modulating a carrier), so you can say that it is meaningful to do the convolution in the frequency domain. But it is unlikely to be efficient, compared to just doing the operation in the time domain.
Note that a LOT of effort has been spent over the years at optimizing Fourier transforms, precisely because it is more efficient to do a block multiply in the frequency domain (it is O(n)) compared to doing a convolution in the time domain (which is O(n^2)). Since the Fourier transform is O(n log(n)), the combination of forwardTransform-blockMultiply-inverseTransform is usually faster than doing a convolution. This is still true in the other direction, so if you start with frequency data a inverseTransform-blockMultiply-forwardTransform will usually be faster than doing a convolution in the frequency domain. And, of course, usually you already have the original time data somewhere, so the block multiply in the time domain would then be even faster.
Unfortunately, I don't know of applications that increase the size of the time domain off the top of my head. And I don't know anything about deep learning, so I can't help you there.
I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:
First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?
Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?
If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.
The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.
I am working on a CFD project and I am using the new CUDA 5 libary "cusparse" to solve a system of linear equations. I tested the sample code "conjugateGradientPrecond". The result show that Preconditioned Gradient using ILU took more time to get the final answer than Conjugate gradient without preconditioning. The former algorithm do need less iteration, but it take to much time on "cusparseScsrsv_solve", so the overall time is longer.
Here is my question, is there any other preconditioned conjugate Gradient that can greatly decrease the iteration while don't include any time-consuming function like "cusparseScsrsv_solve"?
Preconditioning techniques such as ILU0/ILUT, IC0/ICT will require to solve a triangular system two times at each iteration of CG (upper and lower decomposition of the preconditioning matrix). By nature, solving triangular systems is a sequential problem, but for case of sparse matrices some analysis phase can be performed to find some degrees of parallelization (refer to this post). In general, for sparse systems, one can not offer the best preconditioning technique, but the simple diagonal (aka Jacobi) preconditioning, imposes negligible overhead and offers high level of parallelization for GPU implementation.