How to design conditional scored based, diffusion or normalizing flow models? - deep-learning

loosly speaking, I would write an unconditional model as follow y=f(z) with y in R^n (or R^(n,m) or R^(n,m,d)) and z in R^q with z drawn from some probability distribution. For Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs) I was able to find examples how to design a conditional variant of these types of models. The conditional model could then be written as follows - y = f(u,z) with y and z as above and u in R^p. Remark - I am interested in continuous conditions. But how can such a conditional model be designed for diffusion or normalizing flow based or score based models? Lets take diffusion models for example. Would we add to the input u at each step gradually some Gaussian noise and "forget" the condition more and more this way? Or would we keep the full condition?

Related

Can 1D CNNs infer a feature from two other included features?

I'm using a 1D CNN on temporal data. Let's say that I have two features A and B. The ratio between A and B (i.e. A/B) is important - let's call this feature C. I'm wondering if I need to explicitly calculate and include feature C, or can the CNN theoretically infer feature C from the given features A and B?
I understand that in deep learning, it's best to exclude highly-correlated features (such as feature C), but I don't understand why.
The short answer is NO. Using the standard DNN layers will not automatically capture this A/B relationship, because standard layers like Conv/Dense will only perform the matrix multiplication operations.
To simplify the discussion, let us assume that your input feature is two-dimensional, where the first dimension is A and the second is B. Applying a Conv layer to this feature simply learns a weight matrix w and bias b
y = w * [f_A, f_B] + b = w_A * f_A + w_B * f_B + b
As you can see, there is no way for this representation to mimic or even approximate the ratio operation between A and B.
You don't have to use the feature C in the same way as feature A and B. Instead, it may be a better idea to keep feature C as an individual input, because its dynamic range may be very different from those of A and B. This means that you can have a multiple-input network, where each input has its own feature extraction layers and the resulting features from both inputs can be concatenated together to predict your target.

how to predict query topics using word-topic matrix?

I'm implementing LDA using Java. I know how the algorithm works. In the end of the training (the given iterations) I will get 2 matrices (topic-word and document-topic) that represent the set of the input documents.
My problem is that when I input a new document (query) I want to use these matrices (or any other way) to get the document-topic vector of that query. How would I do that?
Are you using Variational Inference or Gibbs Sampling?
For Gibbs Sampling a typical approach is adding the new document/s to the inference, and only updating its own counters, keeping constant the counters for the documents you used to learn the model.
This is specified in equations 84 and 85 in Parameter Estimation for Text Analysis
I guess there has to be a similar approach in VI LDA.

What is the difference between RSE and MSE?

I am going through Introduction to Statistical Learning in R by Hastie and Tibshirani. I came across two concepts: RSE and MSE. My understanding is like this:
RSE = sqrt(RSS/N-2)
MSE = RSS/N
Now I am building 3 models for a problem and need to compare them. While MSE come intuitively to me, I was also wondering if calculating RSS/N-2 will make any use which is according to above is RSE^2
I think I am not sure which to use where?
RSE is an estimate of the standard deviation of the residuals, and therefore also of the observations. Which is why it's equal to RSS/df. And in your case, as a simple linear model df = 2.
MSE is mean squared error observed in your models, and it's usually calculated using a test set to compare the predictive accuracy of your fitted models. Since we're concerned with the mean error of the model, we divide by n.
I think RSE ⊂ MSE (i.e. RSE is part of MSE).
And MSE = RSS/ degree of freedom
MSE for a single set of data (X1,X2,....Xn) would be RSS over N
or more accurately is RSS/N-1
(since your freedom to vary will be reduced by one when U have used up all the freedom)
But in linear regression concerning X and Y with binomial term, the degree of freedom is affected by both X and Y thus N-2
thus yr MSE = RSS/N-2
and one can also call this RSE
And in over parameterized model, meaning one have a collection of many ßs (more than 2 terms| y~ ß0 + ß1*X + ß2*X..), one can even penalize the model by reducing the denominator by including the number of parameters:
MSE= RSS/N-p (p is the number of fitted parameters)

Can I use autoencoder for clustering?

In the below code, they use autoencoder as supervised clustering or classification because they have data labels.
http://amunategui.github.io/anomaly-detection-h2o/
But, can I use autoencoder to cluster data if I did not have its labels.?
Regards
The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.
The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.
m <- h2o.deeplearning(
2:564, training_frame = tfidf,
hidden = c(2), auto-encoder = T, activation = "Tanh"
)
f <- h2o.deepfeatures(m, tfidf, layer = 1)
The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:
Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.
BTW, I made the above plot with this code:
d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above
(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )
In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.
A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap.
The code can be found here
This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.
This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.
In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.
A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.

Applying a Kalman filter on a leg follower robot

I was asked to create a leg follower robot (I already did it) and in the second part of this assignment I have to develop a Kalman filter in order to improve the following process of the robot. The robot gets from the person the distance where she is to the robot and also the angle (it is a relative angle, because the reference is the robot itself, not absolute x-y coordinates)
About this assignment I have a serious doubt. Everything I have read, every sample I have seen about kalman filter has been in one dimension (a car running distance or a rock falling from a building) and according to the task I would have to apply it in 2 dimensions. Is it possible to apply a kalman filter like this?
If it is possible to calculate kalman filter in 2 dimensions then I would understand that what is asked to do is to follow the legs in a linnearized way, despite a person walks weirdly (with random movements) --> About this I have the doubt of how to establish the function of the state matrix, could anyone please tell me how to do it or to tell me where I can find more information about this?
thanks.
Well you should read up on Kalman Filter. Basically what it does is estimate a state through its mean and variance separately. The state can be whatever you want. You can have local coordinates in your state but also global coordinates.
Note that the latter will certainly result in nonlinear system dynamics, in which case you could use the Extended Kalman Filter, or to be more correct the continuous-discrete Kalman Filter, where you treat the system dynamics in a continuous manner and the measurements in discrete time.
Example with global coordinates:
Assuming you have a small cubic mass which can drive forward with velocity v. You could simply model the dynamics in local coordinates only, where your state s would be s = [v], which is a linear model.
But, you could also incorporate the global coordinates x and y, assuming we are moving on a plane only. Then you would have s = [x, y, phi, v]'. We need phi to keep track of the current orientation since the cube can only move forward in respect to its orientation of course. Let's define phi as the angle between the cube's forward direction and the x-axis. Or in other words: With phi=0 the cube would move along the x-axis, with phi=90° it would move along the y-axis.
The nonlinear system dynamics with global coordinates can then be written as
s_dot = [x_dot, y_dot, phi_dot, v_dot]'
with
x_dot = cos(phi) * v
y_dot = sin(phi) * v
phi_dot = ...
v_dot = ... (Newton's Law)
In EKF (Extended Kalman Filter) Prediction step you would use the (discretized) equations above to predict the mean of the state in the first step of and the linearized (and discretized) equations for prediction of the Variance.
There are two things to keep in mind when you decide what your state vector s should look like:
You might be tempted to use my linear example s = [v] and then integrate the velocity outside of the Kalman Filter in order to obtain the global coordinate estimates. This would work, but you would lose the awesomeness of the Kalman Filter since you would only integrate the mean of the state, not its variance. In other words, you would have no idea what the current uncertainties for your global coordinates are.
The second step of the Kalman Filter, the measurement or correction update, requires that you can describe your sensor output as a function of your states. So you may have to add states to your representation just so that you can express your measurements correctly as z[k] = h(s[k], w[k]) where z are measurements and w is a noise vector with Gaussian distribution.