Where should I put the previous predicted sequence in LSTM for optical character recognition systems - ocr

I am trying to build an optical character recognition system that can recognize handwritten sentences using the LSTM cell.
Now what I have understood from literature is that you need to give two inputs to the LSTM cell: one is the image that you are trying to recognize and the second is the sequence of words it has already predicted. So for example if I had an image that read "I love machine learning", I would create the following pairs of input:
Image + startseq
Image + startseq + I
Image + startseq + I + love
So for each input you want the LSTM to predict the next word i.e. I, love, machine for the above sequences.
The problem that I'm having is that I can't figure out how to input the image AND the previous sequence to the LSTM cell. Do I divide my image (a 2-D matrix) into row/column vectors and send them to the LSTM one at a time and after I'm done with that I send in the previous sequence of words? But this way I'll have quite long input sequences which might lead to large converging times.
I know image captioning tasks vectorize input images using pretrained neural nets but can that be done for optical character recognition systems, i.e. would that cause accuracy issues?

No, you don't have to feed recognized words back into LSTM. You only feed an input (feature) sequence and the LSTM learns to propagate relevant information through this sequence.
You should think of input sequence and output sequence when talking about Recurrent Neural Networks (RNNs).
The input to a RNN at time-step t is:
state of memory cell at t-1
input element at t
LSTM has a more advanced internal structure than a vanilla RNN to allow more robust training. But from a user perspective, it works just like a vanilla RNN. You input a sequence and LSTM computes a output sequence for you.
When doing handwriting recognition, you usually extract a feature sequence from the input image (e.g. by using convolutional layers).
Then, you feed this feature sequence into LSTM layers.
You map the output sequence to a character-probability matrix which is then decoded into the final text by the CTC layer.
Here is a short tutorial how to build a handwriting recognition system, it should give you an idea of which data (see "Data": "CNN output" and "RNN output") flows into LSTM and which data flows out of LSTM:
https://towardsdatascience.com/2326a3487cd5

Related

Transfer learning from image classifiers (e.g. AlexNet) to to binary classification of generated data charts

I have written code that transforms time series data (medical EEGs) into a scatter chart. I have labels available for a training set to note the presence of absence of 'seizures' in the data. I can see the seizures clearly (as a human) in the scatter plots. I thought it would be straightforward to adapt a pre-trained image classifier (AlexNet) to do the (seizure, no_seizure) binary classification. I have a training set of 500 chart images. The model is not converging.
I replaced the final Alexnet layer before training:
model.classifier[6] = torch.nn.Linear(model.classifier[6].in_features, 2)
Do you have any advice for helping me with this challenge? Intuitively I thought classification of scatter charts would be easier than photograph image classification.

Is it possible to feed the output back to input in artificial neural network?

I am currently designing a artificial neural network for a problem with a decay curve.
For example, building a model for predicting the durability of the some material. It may includes the environment condition like temperature and humidity.
However, it is not adequate to predict the durability of the material. For such a problem, I think it is better to using the output durability of previous time slots as one of the current input to predict the durability of next time slot.
Moreover, I do not know how to train a model which feed the output back to input as one of the input columns has only the initial value before training.
For this case,
Method 1 (fail)
I have tried to fill the predicted output durability of current row to the input durability of next row. Nevertheless, it will prevent the model from "loss.backward()" so we cannot compute and update the gradient if we do so. The gradient function used was "CopySlices" instead of "MSELoss" when I copied the predicted output to the next row of the input data.
Feed output to input
gradient function -copy-
Method 2 "fill the input column with expected output"
In this method, I fill the blank input column with expected output (row-1) before training the model. Filling the input column with expected output of previous row is only done for training. For real prediction, I will feed the predicted output to the input. In this case, I am successful to train a overfitting model with MSELoss.
Moreover, I do not believe it is a right method as it uses the expected output as the input no matter how bad it predict. I strongly believed that it is not a right method.
Therefore, I want to ask whether it is possible to feed output to input in linear regression problem using artificial neural network.
I apologize for uploading no code here as I am not convenient to upload the full code here. It may be confidential.
It looks like you need an RNN (recurrent neural network). This tutorial is pretty helpful for understanding an RNN: https://colah.github.io/posts/2015-08-Understanding-LSTMs/.

One-hot encoding to word2vec embedding

I'm trying to create vectors for categorical information that I have at hand. This information is intended to be used for aiding seq2seq network for NLP purposes (like summarization).
To get the idea, maybe an example would be of help:
Sample Text: shark attacks off Florida in a 1-hour span
And suppose that we have this hypothetical categorical information:
1. [animal, shark, sea, ocean]
2. [animal, tiger, jungle, mountains]
...
19. [animal, eagle, sky, mountains]
I want to feed sample text to an LSTM network token-by-token (like seq2seq networks). I'm using pre-trained GloVe embeddings as my original embeddings which are fed into the network, but also want to concatenate a dense vector to each token denoting its category.
For now, I know that I can simply use the one-hot embeddings (0-1 binary). So, for example, the first input (for shark) to the RNN network would be:
# GloVe embeddings of shark + one-hot encoding for shark, + means concatenation
[-0.323 0.213 ... -0.134 0.934 0.031 ] + [1 0 0 0 0 ... 0 0 1]
The problem is that I have an extreme number of categories out there (around 20,000). After searching over the Internet, it seemed to me that people suggest using word2vec instead of one-hots. But, I can't get the underlying idea of how word2vec can demonstrate the categorical features in this case. Does anybody have a more clear idea?
Word2Vec can't be used for classification. It is just the underlying algorithm.
For classification you can use Doc2Vec or something similar.
It basically takes a list of documents and each has unique id assigned to it. After the training it builds relations between the documents similar to those which word2vec builds for the words. Then when you give it an unknown document it will tell you the top n most similar, and if your documents have previously defined tags you can assume that the unknown document can be labeled the same way.

What is the difference between the denoising autoencoder and the conventional autoencoder?

To train the denoising autoencoder, I constructed x+n in the input data and x in the output data(x: original data, n: noise). After learning was completed, I obtained noise-removed data through a denoising autoencoder (x_test + n_test -> x_test).
Then, as a test, I trained autoencoder by constructing the input and output data to the same value, just like the conventional autoencoder
(x -> x).
As a result, i obtained noise-removed data similar to a denoising autoencoder in the test phase.
Why is noise removed through the conventional autoencoder?
Please tell me the difference between these two autoencoder.
An autoencoder's purpose is to map high dimensional data (e.g images) to a compressed form (i.e. hidden representation), and build up the original image from the hidden representation.
A denoising autoencoder, in addition to learning to compress data (like an autoencoder), it learns to remove noise in images, which allows to perform well even when the inputs are noisy. So denoising autoencoders are more robust than autoencoders + they learn more features from the data than a standard autoencoder.
And one of the uses of autoencoders was to find a good initialization for deep neural networks (in the late 2000s). However, with good initializations (e.g. Xavier) and activation functions (e.g. ReLU), their advantage has disappeared. Now they are more used in generative tasks (e.g. variational autoencoder)

How to massage inputs into Keras framework?

I am new to keras and despite reading the documentation and the examples folder in keras, I'm still struggling with how to fit everything together.
In particular, I want to start with a simple task: I have a sequence of tokens, where each token has exactly one label. I have a lot training data like this - practically infinite, as I can generate more (token, label) training pairs as needed.
I want to build a network to predict labels given tokens. The number of tokens must always be the same as the number of labels (one token = one label).
And I want this to be based on all surrounding tokens, say within the same line or sentence or window -- not just on the preceding tokens.
How far I got on my own:
created the training numpy vectors, where I converted each sentence into a token-vector and label-vector (of same length), using a token-to-int and label-to-int mappings
wrote a model using categorical_crossentropy and one LSTM layer, based on https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py.
Now struggling with:
All the input_dim and input_shape parameters... since each sentence has a different length (different number of tokens and labels in it), what should I put as input_dim for the input layer?
How to tell the network to use the entire token sentence for prediction, not just one token? How to predict a whole sequence of labels given a sequence of tokens, rather than just label based on previous tokens?
Does splitting the text into sentences or windows make any sense? Or can I just pass a vector for the entire text as a single sequence? What is a "sequence"?
What are "time slices" and "time steps"? The documentation keeps mentioning that and I have no idea how that relates to my problem. What is "time" in keras?
Basically I have trouble connecting the concepts from the documentation like "time" or "sequence" to my problem. Issues like Keras#40 didn't make me any wiser.
Pointing to relevant examples on the web or code samples would be much appreciated. Not looking for academic articles.
Thanks!
If you have sequences of different length you can either pad them or use a stateful RNN implementation in which the activations are saved between batches. The former is the easiest and most used.
If you want to use future information when using RNNs you want to use a bidirectional model where you concatenate two RNN's moving in opposite directions. RNN will use a representation of all previous information when e.g. predicting.
If you have very long sentences it might be useful to sample a random sub-sequence and train on that. Fx 100 characters. This also helps with overfitting.
Time steps are your tokens. A sentence is a sequence of characters/tokens.
I've written an example of how I understand your problem but it's not tested so it might not run. Instead of using integers to represent your data I suggest one-hot encoding if it is possible and then use binary_crossentropy instead of mse.
from keras.models import Model
from keras.layers import Input, LSTM, TimeDistributed
from keras.preprocessing import sequence
# Make sure all sequences are of same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
# The input shape is your sequence length and your token embedding size (which is 1)
inputs = Input(shape=(maxlen, 1))
# Build a bidirectional RNN
lstm_forward = LSTM(128)(inputs)
lstm_backward = LSTM(128, go_backwards=True)(inputs)
bidirectional_lstm = merge([lstm_forward, lstm_backward], mode='concat', concat_axis=2)
# Output each timestep into a fully connected layer with linear
# output to map to an integer
sequence_output = TimeDistributed(Dense(1, activation='linear'))(bidirectional_lstm)
# Dense(n_classes, activation='sigmoid') if you want to classify
model = Model(inputs, sequence_output)
model.compile('adam', 'mse')
model.fit(X_train, y_train)