How to get the probability of each vector belonging to each cluster? - nltk

I use the following code to create clusters. I would like to get the probability of each vector belonging to each cluster. How to do this?
import numpy as np
from nltk import cluster
from nltk.cluster import euclidean_distance
vectors = [np.array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]
clusterer = cluster.KMeansClusterer(2, euclidean_distance)
clusters = clusterer.cluster(vectors, assign_clusters=True, trace=False)

from sklearn import mixture
model = mixture.GMM(n_components=4)
model.fit(dataset)
model.score_samples(dataset)
this returns, acc to docs
Posterior probabilities of each mixture component for each observation.
But of course this won't help if the Clustering doesn't converge for your data.

Are you talking about:
the assignments kmeans made to vectors from your vectors variable or
the assignment of a new vector to an existing cluster?
1. The K-means assignments
Simply print the clusters variables. If you see [0, 0, 1, 1], then it means [3, 3] and [1, 2] (the first two) got assigned to the cluster 0, and [4, 2] and [4, 0] (the last two) to the cluster 1. There's no probability here.
2. Assigning a new vector to an existing cluster
Since you're using KMeans, you first need to know what is the centroid of each cluster. The nltk API says this is a private information : the interesting variable (_means) is prefixed by an underscore. The variable could change in the future, but you can still get the value if you want to.
The NLTK algorithm is randomized, so you will get different centroids each time. As I said before, you can see the assignments with print(clusters). You can see the centroids with print(clusterer._means). Let's say you got the assignment [0, 0, 1, 1] with centroids [2, 2.5] and [4, 1]. A new vector (say [1, 2]) would be assigned to an existing cluster by using the closest cluster. Again, it makes little sense to talk about probability here. You could get scores by using distance for all clusters and then using softmax to get to probabilities if you really wanted to.

Related

Pytorch: How do I deal with different input sizes within one batch?

I am implementing something closely related to the DeepSets architecture on point clouds:
https://arxiv.org/abs/1703.06114
That means I am working with a set of inputs (coordinates), have fully connected layers process each of those seperately and then perform average pooling over them (to then do further processing).
The input for each sample i is a tensor of shape [L_i, 3] where L_i is the number of points and the last dimension is 3 because each points has x,y,z coordinates. Crucially, L_i depends on the sample. So I have a different number of points per instance. When I put everything into a batch, I currently have the input in the shape [B, L, 3] where L is larger than L_i for all i. The individual samples are padded with 0's. The issue is that 0's are not ignored by the network, they are processed and fed into the average pooling. Instead I would like the average pooling to only consider actual points (not padded 0's). I do have another array which stores the lengths [L_1, L_2, L_3, L_4...], but I am not sure how to use it.
My Question is: How do you handle different input sizes wihtin one batch in the most graceful manner?
This is how the model is define:
encoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 128))
x = self.encoder(x)
x = x.max(dim=1)[0]
decoder = ...

Defining a Keras function

I have recently started to learn Deep Learning and CNNs. I have come across the following code which defines a simple CNN.
Can anyone help me to understand how these lines work:
loss = layer_output[:, :, :, 0] - What is the result of this ? My question is that, the network has not been trained yet. Weights [Kernels] are not yet calculated. so, what data it is going to return !! Does 0 represent the first kernel ?
iterate = K.function([input_img], [loss, grads]) - There is not much documentation available on Keras site. What I understand is that iterate is a function which takes an Input tensor and returns a list of tensors, first one is loss and second one is grads. But, they are defined elsewhere !!
Define Input Image with these dimensions:
img_data = np.random.uniform(size=(1, 250, 250, 3))
There is a Simple CNN, which has one Convolutional layer. It uses two 3 X 3 kernels.
input = Input(shape=250, 250, 3,), name='input_1')
First_Conv2D = Conv2D(2, kernel_size=(3, 3), padding="same", name='conv2d_1', activation='relu')(input)
flat = Flatten(name='flatten_1')(First_Conv2D)
output = Dense(2, name='dense_1', activation='softmax')(flat)
model = Model(inputs=[input], outputs=[output])
layer_dict = dict([(layer.name, layer) for layer in model.layers[0:]])
layer_output = layer_dict['conv2d_1'].output
input_img = model.input
# Calculate loss and gradient.
loss = layer_output[:, :, :, 0]
grads = K.gradients(loss, input_img)[0]
# Define a Keras function
iterate = K.function([input_img], [loss, grads])
# Call iterate function
loss_value, grads_value = iterate([img_data])
Thank You.
This looks like a nasty dissection of Keras as an API. I reckon it leads to more confusion rather than an introduction to deep learning. Anyway, addressing your questions:
All tensors are symbolic meaning that until we run a session, they do not contain any values. They instead define a directed computation graph. The loss = layer_output[:,:,:,0] is an slicing operation that takes the first element of the last dimension returning another tensor with 3 dimensions. When you run the session with actual inputs, then the tensors will have values which these operations run. The operations are almost identical to NumPy ndarrays which are not symbolic and contain values, you can get an intuition.
K.function just glues the inputs to the outputs returning a single operation that when given the inputs it will follow the computation graph from the inputs to the defined outputs. In this case, given a list of single input it returns a list of 2 output tensors loss and gradients. These are still symbolic remember, if you try to print one you'll just get what it is and it's shape, data type.

What is the difference between performing upsampling together with strided transpose convolution and transpose convolution with stride 1 only?

I noticed in a number of places that people use something like this, usually in fully convolutional networks, autoencoders, and similar:
model.add(UpSampling2D(size=(2,2)))
model.add(Conv2DTranspose(kernel_size=k, padding='same', strides=(1,1))
I am wondering what is the difference between that and simply:
model.add(Conv2DTranspose(kernel_size=k, padding='same', strides=(2,2))
Links towards any papers that explain this difference are welcome.
Here and here you can find a really nice explanation of how transposed convolutions work. To sum up both of these approaches:
In your first approach, you are first upsampling your feature map:
[[1, 2], [3, 4]] -> [[1, 1, 2, 2], [1, 1, 2, 2], [3, 3, 4, 4], [3, 3, 4, 4]]
and then you apply a classical convolution (as Conv2DTranspose with stride=1 and padding='same' is equivalent to Conv2D).
In your second approach you are first un(max)pooling your feature map:
[[1, 2], [3, 4]] -> [[1, 0, 2, 0], [0, 0, 0, 0], [3, 0, 4, 0], [0, 0, 0, 0]]
and then apply a classical convolution with filter_size, filters`, etc.
Fun fact is that - although these approaches are different they share something in common. Transpose convolution is meant to be the approximation of gradient of convolution, so the first approach is approximating sum pooling whereas second max pooling gradient. This makes the first results to produce slightly smoother results.
Other reasons why you might see the first approach are:
Conv2DTranspose (and its equivalents) are relatively new in keras so the only way to perform learnable upsampling was using Upsample2D,
Author of keras - Francois Chollet used this approach in one of his tutorials,
In the past equivalents of transpose, convolution seemed to work awful in keras due to some API inconsistencies.
I just want to point out a couple of things that you mentioned. Upsample2D is not a learnable layer since There is literally 0 parameter.
Also, we can not justify the reason why we might want to use the first approach because Francoise Chollet introduced the usage in his example.

Mask specific elements in a final layer in PyTorch

I am now reproducing the following model which outputs an action and uses filter for filtering inappropriate candidates.
https://arxiv.org/abs/1702.03274
In this model, output is filtered after last softmax layer. Let's assume action_size==3. So the output after dense & asoftmax layer is like below.
output: [0.1, 0.7, 0.2]
filter: [0, 1, 1]
output*filter: [0, 0.7, 0.2]
But in pytorch, logsoftmax is preferred with NLLLoss. So my output is like below. This doesn't make sense.
output: [-5.4, -0.2, -4.9]
filter: [0, 1, 1]
output*filter: [0, -0.2, -4.9]
So pytoroch doesn't recommend vanilla Softmax. How should I apply mask to eliminate specific actions?
Or is there any categorical cross entropy loss functions with vanilla Softmax?
This module doesn’t work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use Logsoftmax instead (it’s faster and has better numerical properties).
http://pytorch.org/docs/master/nn.html#torch.nn.Softmax
The output of LogSoftmax is simply the log of the output of Softmax. That means you can just call torch.exp(output_from_logsoftmax) to get the same values as from Softmax.
So, if I'm reading your question correctly, you would calculate LogSoftmax, and then feed that into NLLLoss and also exponentiate that to use in your filtering.

Inception style convolution

I have the necessity to keep the model as small as possible to deploy an image classifier that can run efficiently on an app (the accuracy is not really relevant for me)
I recently approached deep learning and I haven't great experience, hence I'm currently playing with the cifar-10 example.
I tried to replace the first two 5x5 convolutional layers with two 3x3 convolution each, as described in the inception paper.
Unluckily, when I'm going to classify the test set, I got around 0.1 correct classification (random choice)
This is the modified code of the first layer (the second is similar):
with tf.variable_scope('conv1') as scope:
kernel_l1 = _variable_with_weight_decay('weights_l1', shape=[3, 3, 3, 64],
stddev=1e-4, wd=0.0)
kernel_l2 = _variable_with_weight_decay('weights_l2', shape=[3, 3, 64, 1],
stddev=1e-4, wd=0.0)
biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
conv_l1 = tf.nn.conv2d(images, kernel_l1, [1, 1, 1, 1], padding='SAME')
conv_l2 = tf.nn.depthwise_conv2d(conv_l1, kernel_l2, [1, 1, 1, 1], padding='SAME')
bias = tf.nn.bias_add(conv_l2, biases)
conv1 = tf.nn.relu(bias, name=scope.name)
_activation_summary(conv1)
Is it correct?
It seems you're attempting to compute 64 features (for each 3x3 patch) in the first convolutional layer and feed this directly into the second convolutional layer, with no intermediate pooling layer. Convolutional neural networks typically have a structure of stacked convolutional layers, followed by contrast normalization and max pooling.
To reduce processing overheads researchers have experimented in moving from fully connected to sparsely connected architectures, and hence the creation of inception architecture. However, whilst these yield good results for high dimensional inputs, you may be expecting too much from the 32x32 pixels of Cifar10 in TensorFlow.
Therefore, I think the issue is less around patch size of and more to do with overall architecture. This code is a known good starting point. Get this working and start reducing parameters until it breaks.