What is information form of Kalman filter? How is it better or different from standard form? - kalman-filter

I don't understand what is unique about information matrix or information form of Kalman filter. What is so unique or different about it from the standard form?

The information matrix is the inverse of the covariance matrix. A covariance matrix is symmetric positive definite and therefor has a unique inverse, called the information matrix. Kalman filters can be implemented using either form.
One reason to select an information matrix implementation in preference to a covariance matrix implementation is that an information matrix initialized to zero implies no information (infinite variance) on each of the states. One cannot initialize a covariance matrix that way - some finite variance must be selected.
Information matrix processing decreases the complexity of measurement processing and increases the complexity of covariance propagation - but they are mathematically equivalent for a linear system. Some discussion on Wikipedia:
https://en.wikipedia.org/wiki/Kalman_filter#Information_filter

Related

Convolutional filter applied to Fourier-transformed signal

I understand that the Fourier transform of a convolution of two signals is the pointwise product of their Fourier transforms (convolutional theorem). What I wonder is there known cases where a convolution can be meaningfully applied to a Fourier-transformed signal (e.g. time series, or image) in the frequency domain to act as a filter instead of the multiplication by a square matrix. Also, are there known applications of filters that increase the size of the time domain, ie where the matrix in the frequency domain is rectangular, and then an inverse FT is applied to back to the time domain? In particular, I'm interested known examples of such method for deep learning.
As you say, convolution of two signals is the pointwise product of their Fourier transforms. This is true in both directions - the convolution of two Fourier-transformed signals is equal to the pointwise product of the two time series.
You do have to define "convolution" suitably - for discrete Fourier transforms, the convolution is a circular convolution.
There are definitely meaningful uses for doing a pointwise block multiply in the time domain (for example, applying a data window to a signal before converting to the frequency domain, or modulating a carrier), so you can say that it is meaningful to do the convolution in the frequency domain. But it is unlikely to be efficient, compared to just doing the operation in the time domain.
Note that a LOT of effort has been spent over the years at optimizing Fourier transforms, precisely because it is more efficient to do a block multiply in the frequency domain (it is O(n)) compared to doing a convolution in the time domain (which is O(n^2)). Since the Fourier transform is O(n log(n)), the combination of forwardTransform-blockMultiply-inverseTransform is usually faster than doing a convolution. This is still true in the other direction, so if you start with frequency data a inverseTransform-blockMultiply-forwardTransform will usually be faster than doing a convolution in the frequency domain. And, of course, usually you already have the original time data somewhere, so the block multiply in the time domain would then be even faster.
Unfortunately, I don't know of applications that increase the size of the time domain off the top of my head. And I don't know anything about deep learning, so I can't help you there.

Is there a way to not select a reference category for logistic regression in SPSS?

When doing logistic regression in SPSS, is there a way to remove the reference category in the independent variables so they're all compared against each other equally rather than against the reference category?
When you have a categorical predictor variable, the most fundamental way to encode it for modeling, sometimes referred to as the canonical representation, is to use a 0-1 indicator for each level of the predictor, where each case takes on a value of 1 for the indicator corresponding to its category, and 0 for all the other indicators. The multinomial logistic regression procedure in SPSS (NOMREG) uses this parameterization.
If you run NOMREG with a single categorical predictor with k levels, the design matrix is built with an intercept column and the k indicator variables, unless you suppress the intercept. If the intercept remains in the model, the last indicator will be redundant, linearly dependent on the intercept and the first k-1 indicators. Another way to say this is that the design matrix is of deficient rank, since any of the columns can be predicted given the other k columns.
The same redundancy will be true of any additional categorical predictors entered as main effects (only k-1 of k indicators can be nonredundant). If you add interactions among categorical predictors, an indicator for each combination of levels of the two predictors is generated, but more than one of these will also be redundant given the intercept and main effects preceding the interaction(s).
The fundamental or canonical representation of the model is thus overparameterized, meaning it has more parameters than can be uniquely estimated. There are multiple ways commonly used to deal with this fact. One approach is the one used in NOMREG and most other more recent regression-type modeling procedures in SPSS, which is to use a generalized inverse of the cross-product of the design matrix, which has the effect of aliasing parameters associated with redundant columns to 0. You'll see these parameters represented by 0 values with no standard errors or other statistics in the SPSS output.
The other way used in SPSS to handle the overparameterized nature of the basic model is to reparameterize the design matrix to full rank, which involves creating k-1 coded variables instead of k indicators for each main effect, and creating interaction variables from these. This is the approach taken in LOGISTIC REGRESSION.
Note that the overall model fit and predicted values from a logistic regression (or other form of linear or generalized linear model) will be the same regardless of what choices are made about parameterization, as long as the appropriate total number of unique columns are in the design matrix. Particular parameter estimates are of course highly dependent upon the particular parameterization used, but you can derive the results from any of the valid approaches using the results from any other valid approach.
If there are k levels in a categorical predictor, there are k-1 degrees of freedom for comparing those k groups, meaning that once you'd made k-1 linearly independent or nonredundant comparisons, any others can be derived from those.
So the short answer is no, you can't do what you're talking about, but you don't need to, because the results for any valid parameterization will allow you to derive those for any other one.

regression with given upper and lower bounds for the target value

I am using several regressors like xgboost, gradient boosting, random forest or decision tree to predict a continuous target value.
I have some complementary information like I know my prediction (target value) based on all features that I have should be in a given range.
Is there any way to more effectively take into consideration these bounds as a feature to any of these algorithms instead of verifying the range on already predicted values and only doing some post-processing.
Note that by just simply putting the lower and upper bound for my target value, not necessarily these algorithms will learn to effectively compute the prediction in the given range. I am looking for more effective way to take into consideration these bounds as a given data.
Thanks

Why do we need the hyperparameters beta and alpha in LDA?

I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:
First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?
Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?
If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.
The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.

Conceptual confusion about numerical methods

I have a conceptual question about numerical methods. What is the difference among finite element, continuous finite element, discontinuous finite element, continuous galerkin and discontinuous galerkin methods? Are some of them just the same thing?
Thanks in advance
Finite element methods are a subset of numerical methods (which also include finite volume, finite difference, Monte Carlo and a lot more).
Simply put, in finite element methods, one tries to approximate the solution to a problem with a linear combination of pre-defined basis functions. These basis functions can be chosen to be either continuous or discontinuous. The resulting numerical methods are called CG/DG (continuous/discontinuous Galerkin) methods. In a DG method, the basis functions are only piecewise continuous: each basis function is zero everywhere in the domain, except in one element. See also this excellent Wikipedia article, which features some very nice figures.
Discontinuous Garlerkin methods were originally popularised in the field of particle transport long ago, but they have recently gained ground in other fields as well. (This is mostly because it wasn't clear at first how discontinuous basis functions would work well in equations that involve diffusion, but that problem has been solved now.)
Just a small pedantic correction to #A. Hennink's answer - In DG methods, the state variable is not piece-wise continuous between basis functions. There is a non-physical jump in the state variable between basis functions, hence the discontinuous part in the name. This can be visualized in the following figure displaying (dis)continuous basis functions of CG and DG:
In CG methods, the basis functions are piece-wise continuous, meaning continuous in the state variable itself, but the derivative may not be continuous (i.e. there may be a discontinuity in the derivative of the state variable between basis functions). This means that the solution to what ever problem you're solving can have non-physical kinks in the solution. Note the 'kinks' in the following solution
Some basis functions are not always discontinuous in the derivative though (see Hermite basis functions). See how the following Hermite basis function has a derivative of zero at both the end points. If all basis functions are composed using Hermite polynomials, then the derivative is continuous between basis functions because it's zero at the boundaries. Below, Psi_1 is the derivate of Psi_0, and Psi_3 is the derivative of Psi_2: