Is Next-Frame Video Prediction with Convolutional LSTMs a regression problem or a classification?
Why regression/classification?
Why do we use the last Conv3D layer?
I would consider closer to a regression problem than a classification problem since it's inputs are all the previous frames from which it learns the trend or function to fit, in this case learns the direction in which the MNIST digit might be moving and then predicts the next best possible location.
Since, it is NOT trying to classify a set of available digit postions as next_location or NOT_next_location, it doesn't seem like a classification problem.
The last layer defined as:
x = layers.Conv3D(filters=1, kernel_size=(3, 3, 3), activation="sigmoid", padding="same")(x)
Is essentially taking in all past 2D-frames(individual MNIST images), so it's (height,width,frame_num) and compressing them to predict the next single frame.
IF you go to the colab notebook link in the keras tutorial you mentioned, you can first 4 cells and add a model.summary() to see this:
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, None, 64, 64, 1) 0
]
conv_lstm2d (ConvLSTM2D) (None, None, 64, 64, 64) 416256
batch_normalization (BatchN (None, None, 64, 64, 64) 256
ormalization)
conv_lstm2d_1 (ConvLSTM2D) (None, None, 64, 64, 64) 295168
batch_normalization_1 (Batc (None, None, 64, 64, 64) 256
hNormalization)
conv_lstm2d_2 (ConvLSTM2D) (None, None, 64, 64, 64) 33024
conv3d (Conv3D) (None, None, 64, 64, 1) 1729
=================================================================
Total params: 746,689
Trainable params: 746,433
Non-trainable params: 256
conv3D layer here will output a single prediction frame of dimensions (64,64,1)
Connor shorten has made a video explanations as well: Youtube Tutorial Link
Related
# Load data set
(X_train, _), (X_test, _) = fashion_mnist.load_data()
# Define the input shape
input_shape = (28, 28, 1)
latent_dim = 16
# Define the number of kernels for each convolutional layer
encoder_conv_kernels = [64, 32, 16]
decoder_conv_kernels = [16, 32, 64]
# Define the kernel size for all convolutional layers
kernel_size = (3, 3)
# Define the pool size for all max pooling layers
pool_size = (2, 2)
# Define the up sampling factors for all up sampling layers
up_sampling_factors = (2, 2)
# Define the encoder model
inputs = Input(shape=input_shape, name='encoder_input')
x = inputs
for filters in encoder_conv_kernels:
x = Conv2D(filters=filters,
kernel_size=kernel_size,
activation='relu',
padding='same')(x)
x = MaxPool2D(pool_size=pool_size)(x)
# Read and preserve the dimensions of the tensor
shape = backend.int_shape(x)
# Then we have a flattening layer
x = Flatten()(x)
latent_outputs = Dense(latent_dim, name='latent_vector')(x)
# Define the encoder model
encoder = Model(inputs=inputs, outputs=latent_outputs, name='encoder')
encoder.summary()
below is what I got from summary, as you can see from 7,7 it became 3,3:
Model: "encoder"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
encoder_input (InputLayer) [(None, 28, 28, 1)] 0
conv2d_111 (Conv2D) (None, 28, 28, 64) 640
max_pooling2d_51 (MaxPoolin (None, 14, 14, 64) 0
g2D)
conv2d_112 (Conv2D) (None, 14, 14, 32) 18464
max_pooling2d_52 (MaxPoolin (None, 7, 7, 32) 0
g2D)
conv2d_113 (Conv2D) (None, 7, 7, 16) 4624
max_pooling2d_53 (MaxPoolin (None, 3, 3, 16) 0
g2D)
flatten_10 (Flatten) (None, 144) 0
latent_vector (Dense) (None, 16) 2320
=================================================================
Total params: 26,048
Trainable params: 26,048
Non-trainable params: 0
_________________________________________________________________
But in decoder, from 3,3 it became 6,6 instead of 7,7
# Define the decoder model
# First we have the input layer
latent_inputs = Input(shape=(latent_dim,), name='decoder_input')
x = Dense(shape[1] * shape[2] * shape[3])(latent_inputs)
x = Reshape((shape[1], shape[2], shape[3]))(x)
for filters in decoder_conv_kernels[::-1]:
x = Conv2D(filters=filters,
kernel_size=kernel_size,
activation='relu',
padding='same')(x)
x = UpSampling2D(size=up_sampling_factors)(x)
# Define the output layer with sigmoid activation function
# then we add one more convolutional layer to control the channel dimension
x = Conv2D(filters=1,
kernel_size=kernel_size,
padding='same')(x)
# and one activation later with the sigmoid activation function
outputs = Activation('sigmoid', name='decoder_output')(x)
# Define the decoder model
decoder = Model(inputs=latent_inputs, outputs=outputs, name='decoder')
decoder.summary(line_length=110)
below is what I got from the summary:
Model: "decoder"
______________________________________________________________________________________________________________
Layer (type) Output Shape Param #
==============================================================================================================
decoder_input (InputLayer) [(None, 16)] 0
dense_11 (Dense) (None, 144) 2448
reshape_12 (Reshape) (None, 3, 3, 16) 0
conv2d_114 (Conv2D) (None, 3, 3, 64) 9280
up_sampling2d_48 (UpSampling2D) (None, 6, 6, 64) 0
conv2d_115 (Conv2D) (None, 6, 6, 32) 18464
up_sampling2d_49 (UpSampling2D) (None, 12, 12, 32) 0
conv2d_116 (Conv2D) (None, 12, 12, 16) 4624
up_sampling2d_50 (UpSampling2D) (None, 24, 24, 16) 0
conv2d_117 (Conv2D) (None, 24, 24, 1) 145
decoder_output (Activation) (None, 24, 24, 1) 0
==============================================================================================================
Total params: 34,961
Trainable params: 34,961
Non-trainable params: 0
______________________________________________________________________________________________________________
How can I make it from 3,3 to 7,7 instead of 6,6 in decoder? Thx!
what I expected to happen is in the decoder the output will be 28,28,1 instead of 24, 24,1
been stuck on this for hours now, cant figure out where is the mistake and what i am doing wrong, basically its a simple neural network, so i had list of lists that each element is an image, from my understanding i converted it to np.array. its called x now, i printed the entire shapes, but i still cant figure out how to fix it.
the train data contains 4057 images of size 32*32, so thats 1024. i got 27 classes so thats the last output layer.
image_shape = (1024,1)
classes_num = 27
batch = 256
epoch=50
model = Sequential()
model.add(Dense(1024, activation='relu', input_shape=image_shape,kernel_regularizer=reg))
model.add(Dropout(dropout))
model.add(Dense(512,kernel_regularizer=reg))
model.add(Dropout(dropout))
model.add(Dense(512,kernel_regularizer=reg))
model.add(Dropout(dropout))
model.add(Dense(classes_num, activation='softmax'))
model.summary()
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
//original were list of list, so i made np.array from it.
x=np.array([np.array(xi) for xi in splitDataDict['train_x']])
y=np.array([np.array(xi) for xi in splitDataDict['train_y']])
val_x=np.array([np.array(xi) for xi in splitDataDict['val_x']])
val_y=np.array([np.array(xi) for xi in splitDataDict['val_y']])
print(x.shape)
print(y.shape)
print(val_x.shape)
print(val_y.shape)
history = model.fit(x, y, validation_data=(val_x, val_y), epochs=epoch, batch_size=batch)
Shapes
x = (4057, 1024, 1)
y= (4057,)
val_x = (508, 1024, 1)
val_y = (508,)
Output
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 1024, 1024) 2048
_________________________________________________________________
dense_2 (Dense) (None, 1024, 512) 524800
_________________________________________________________________
dense_3 (Dense) (None, 1024, 512) 262656
_________________________________________________________________
dense_4 (Dense) (None, 1024, 27) 13851
=================================================================
Total params: 803,355
Trainable params: 803,355
Non-trainable params: 0
_________________________________________________________________
Error
ValueError: Error when checking target: expected dense_4 to have 3 dimensions, but got array with shape (4057, 1)
I am fairly new to Deep Learning, but I managed to build a multi-branch Image Classification architecture yielding quite satisfactory results.
Not so important: I am working on KKBox customer churn (https://kaggle.com/c/kkbox-churn-prediction-challenge/data) where I transformed customer behavior, transactions and static data into heatmaps and try to classify churners based on that.
The classification itself works just fine. My issue comes in when I try to apply LIME to see where the results are coming from. When following the code here: https://marcotcr.github.io/lime/tutorials/Tutorial%20-%20images.html with the exception that I use list of inputs [members[0],transactions[0],user_logs[0]], I get the following error: AttributeError: 'list' object has no attribute 'shape'
What springs to mind is that LIME is probably not made for multi-input architectures such as mine. On the other hand, Microsoft Azure have a multi-branch architecture as well (http://www.freepatentsonline.com/20180253637.pdf?fbclid=IwAR1j30etyDGPCmG-QGfb8qaGRysvnS_f5wLnKz-KdwEbp2Gk0_-OBsSepVc) and they allegedly use LIME to interpret their result (https://www.slideshare.net/FengZhu18/predicting-azure-churn-with-deep-learning-and-explaining-predictions-with-lime).
I have tried to concatenate the images into a single input but this sort of an approach yields far worse results than the multi-input one. LIME works for this approach though (even though not as comprehensibly as for usual image recognition).
The DNN architecture:
# Members
members_input = Input(shape=(61,4,3), name='members_input')
x1 = Dropout(0.2)(members_input)
x1 = Conv2D(32, kernel_size = (61,4), padding='valid', activation='relu', strides=1)(x1)
x1 = GlobalMaxPooling2D()(x1)
# Transactions
transactions_input = Input(shape=(61,39,3), name='transactions_input')
x2 = Dropout(0.2)(transactions_input)
x2 = Conv2D(32, kernel_size = (61,1,), padding='valid', activation='relu', strides=1)(x2)
x2 = Conv2D(32, kernel_size = (1,39,), padding='valid', activation='relu', strides=1)(x2)
x2 = GlobalMaxPooling2D()(x2)
# User logs
userlogs_input = Input(shape=(61,7,3), name='userlogs_input')
x3 = Dropout(0.2)(userlogs_input)
x3 = Conv2D(32, kernel_size = (61,1,), padding='valid', activation='relu', strides=1)(x3)
x3 = Conv2D(32, kernel_size = (1,7,), padding='valid', activation='relu', strides=1)(x3)
x3 = GlobalMaxPooling2D()(x3)
# User_logs + Transactions + Members
merged = keras.layers.concatenate([x1,x2,x3]) # Merged layer
out = Dense(2)(merged)
out_2 = Activation('softmax')(out)
model = Model(inputs=[members_input, transactions_input, userlogs_input], outputs=out_2)
model.compile(optimizer="adam", loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()
The attempted LIME utilization:
explainer = lime_image.LimeImageExplainer()
explanation = explainer.explain_instance([members_test[0],transactions_test[0],user_logs_test[0]], model.predict, top_labels=2, hide_color=0, num_samples=1000)
Model summary:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
transactions_input (InputLayer) (None, 61, 39, 3) 0
__________________________________________________________________________________________________
userlogs_input (InputLayer) (None, 61, 7, 3) 0
__________________________________________________________________________________________________
members_input (InputLayer) (None, 61, 4, 3) 0
__________________________________________________________________________________________________
dropout_2 (Dropout) (None, 61, 39, 3) 0 transactions_input[0][0]
__________________________________________________________________________________________________
dropout_3 (Dropout) (None, 61, 7, 3) 0 userlogs_input[0][0]
__________________________________________________________________________________________________
dropout_1 (Dropout) (None, 61, 4, 3) 0 members_input[0][0]
__________________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 1, 39, 32) 5888 dropout_2[0][0]
__________________________________________________________________________________________________
conv2d_4 (Conv2D) (None, 1, 7, 32) 5888 dropout_3[0][0]
__________________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 1, 1, 32) 23456 dropout_1[0][0]
__________________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 1, 1, 32) 39968 conv2d_2[0][0]
__________________________________________________________________________________________________
conv2d_5 (Conv2D) (None, 1, 1, 32) 7200 conv2d_4[0][0]
__________________________________________________________________________________________________
global_max_pooling2d_1 (GlobalM (None, 32) 0 conv2d_1[0][0]
__________________________________________________________________________________________________
global_max_pooling2d_2 (GlobalM (None, 32) 0 conv2d_3[0][0]
__________________________________________________________________________________________________
global_max_pooling2d_3 (GlobalM (None, 32) 0 conv2d_5[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 96) 0 global_max_pooling2d_1[0][0]
global_max_pooling2d_2[0][0]
global_max_pooling2d_3[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 2) 194 concatenate_1[0][0]
__________________________________________________________________________________________________
activation_1 (Activation) (None, 2) 0 dense_1[0][0]
==================================================================================================
Hence my question: does anyone have experience with multi-input DNN architecture and LIME? Is there a workaround I am not seeing? Is there another interpretable model I could use?
Thank you.
I've trained a GRU with Keras. Getting the error when I ran
nxt = model.predict([features,embedding_matrix[enc_map[cur]]])
ValueError: Error when checking : expected input_2 to have shape (512,) but got array with shape (1,)
But
features.shape
(512,)`
And
embedding_matrix[enc_map[cur]].shape
(50,)
Here's the summary of the model:
model.summary()
________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
================================================================================
input_2 (InputLayer) (None, 512) 0
________________________________________________________________________________
input_1 (InputLayer) (None, 50) 0
________________________________________________________________________________
merge_1 (Merge) (None, 562) 0 input_2[0][0]
input_1[0][0]
________________________________________________________________________________
reshape_1 (Reshape) (None, 1, 562) 0 merge_1[0][0]
________________________________________________________________________________
gru_1 (GRU) (None, 128) 265344 reshape_1[0][0]
_______________________________________________________________________________
dense_1 (Dense) (None, 50) 6450 gru_1[0][0]
================================================================================
Total params: 271,794
Trainable params: 271,794
Non-trainable params: 0
The inputs must be a numpy array with shape (any, 512)
Check out the shape of the X_train data you used for training, it must follow the same rules.
If this results in a correct shape, you can:
input_data = np.array([features,embedding_matrix[enc_map[cur]]])
But if there is anything wrong with this data and it doesn't fit the required (any,512), the model will not be able to use it.
You need to reshape the input_data array to (512,1) and also transpose it.
input_data = input_data.reshape(512,1).T
In a Keras implementation, I once saw the two last fully connected layers defined as follows
outX = Dense(300, activation='relu')(outX)
outX = Flatten()(outX)
predictions = Dense(1,activation='linear')(outX)
Between the two Dense layers, there is Flatten layer, why we must add a Flatten operation between two fully connected layer. Is that always required?
Short answer: a Flatten layer doesn't have any parameter to learn itself. However, adding a Flatten layer to the model can increase the learning parameters of the model.
Example: try to figure out the difference between these two models:
1) Without Flatten:
inp = Input(shape=(20,10,))
A = Dense(300, activation='relu')(inp)
#A = Flatten()(A)
A = Dense(1, activation='relu')(A)
m = Model(inputs=inp,outputs=A)
m.summary()
Output:
input_9 (InputLayer) (None, 20, 10) 0
dense_20 (Dense) (None, 20, 300) 3300
dense_21 (Dense) (None, 20, 1) 301
Total params: 3,601
Trainable params: 3,601
Non-trainable params: 0
2) With Flatten:
inp = Input(shape=(20,10,))
A = Dense(300, activation='relu')(inp)
A = Flatten()(A)
A = Dense(1, activation='relu')(A)
m = Model(inputs=inp,outputs=A)
m.summary()
Output:
input_10 (InputLayer) (None, 20, 10) 0
dense_22 (Dense) (None, 20, 300) 3300
flatten_9 (Flatten) (None, 6000) 0
dense_23 (Dense) (None, 1) 6001
Total params: 9,301
Trainable params: 9,301
Non-trainable params: 0
Finally, To add or not to add a Flatten layer depends on the data at hand. Having more parameter to learn can lead to a more accurate model OR can cause overfitting. So, one answer should be: "apply both, choose best"