labeling strategy about tesseract-ocr - ocr

I'm training a lstm model using tesseract ocr. When labeling it on jtessboxediter, some words are not clear(e.g 'blood type','eyes color'). Can I just ignore them and only label the clear words like 'BROWN','NONE'.And will it impact the training result?

Related

Transfer learning from image classifiers (e.g. AlexNet) to to binary classification of generated data charts

I have written code that transforms time series data (medical EEGs) into a scatter chart. I have labels available for a training set to note the presence of absence of 'seizures' in the data. I can see the seizures clearly (as a human) in the scatter plots. I thought it would be straightforward to adapt a pre-trained image classifier (AlexNet) to do the (seizure, no_seizure) binary classification. I have a training set of 500 chart images. The model is not converging.
I replaced the final Alexnet layer before training:
model.classifier[6] = torch.nn.Linear(model.classifier[6].in_features, 2)
Do you have any advice for helping me with this challenge? Intuitively I thought classification of scatter charts would be easier than photograph image classification.

Giving pretokenized input to sentiment classifier

I am using the sentiment classifier in python according to this demo.
Is it possible to give pre-tokenized text as input to the predictor? I would like to be able to use my own custom tokenizer.
There are two AllenNLP sentiment analysis models, and they are both tightly tied to their tokenizations. The GLoVe-based one needs tokens that correspond to the pre-trained GLoVe embeddings, and similarly the RoBERTa one needs tokens (word pieces) that correspond with its pretraining. It does not really make sense to use these models with a different tokenizer.

One-hot encoding to word2vec embedding

I'm trying to create vectors for categorical information that I have at hand. This information is intended to be used for aiding seq2seq network for NLP purposes (like summarization).
To get the idea, maybe an example would be of help:
Sample Text: shark attacks off Florida in a 1-hour span
And suppose that we have this hypothetical categorical information:
1. [animal, shark, sea, ocean]
2. [animal, tiger, jungle, mountains]
...
19. [animal, eagle, sky, mountains]
I want to feed sample text to an LSTM network token-by-token (like seq2seq networks). I'm using pre-trained GloVe embeddings as my original embeddings which are fed into the network, but also want to concatenate a dense vector to each token denoting its category.
For now, I know that I can simply use the one-hot embeddings (0-1 binary). So, for example, the first input (for shark) to the RNN network would be:
# GloVe embeddings of shark + one-hot encoding for shark, + means concatenation
[-0.323 0.213 ... -0.134 0.934 0.031 ] + [1 0 0 0 0 ... 0 0 1]
The problem is that I have an extreme number of categories out there (around 20,000). After searching over the Internet, it seemed to me that people suggest using word2vec instead of one-hots. But, I can't get the underlying idea of how word2vec can demonstrate the categorical features in this case. Does anybody have a more clear idea?
Word2Vec can't be used for classification. It is just the underlying algorithm.
For classification you can use Doc2Vec or something similar.
It basically takes a list of documents and each has unique id assigned to it. After the training it builds relations between the documents similar to those which word2vec builds for the words. Then when you give it an unknown document it will tell you the top n most similar, and if your documents have previously defined tags you can assume that the unknown document can be labeled the same way.

Best way to handle OOV words when using pretrained embeddings in PyTorch

I am using word2vec pretrained embedding in PyTorch (following code here). However, it does not seem to handle unseen words. Is there any good way to solve it?
FastText builds character ngram vectors as part of model training. When it finds an OOV word, it sums the character ngram vectors in the word to produce a vector for the word. You can find more detail here.

Where should I put the previous predicted sequence in LSTM for optical character recognition systems

I am trying to build an optical character recognition system that can recognize handwritten sentences using the LSTM cell.
Now what I have understood from literature is that you need to give two inputs to the LSTM cell: one is the image that you are trying to recognize and the second is the sequence of words it has already predicted. So for example if I had an image that read "I love machine learning", I would create the following pairs of input:
Image + startseq
Image + startseq + I
Image + startseq + I + love
So for each input you want the LSTM to predict the next word i.e. I, love, machine for the above sequences.
The problem that I'm having is that I can't figure out how to input the image AND the previous sequence to the LSTM cell. Do I divide my image (a 2-D matrix) into row/column vectors and send them to the LSTM one at a time and after I'm done with that I send in the previous sequence of words? But this way I'll have quite long input sequences which might lead to large converging times.
I know image captioning tasks vectorize input images using pretrained neural nets but can that be done for optical character recognition systems, i.e. would that cause accuracy issues?
No, you don't have to feed recognized words back into LSTM. You only feed an input (feature) sequence and the LSTM learns to propagate relevant information through this sequence.
You should think of input sequence and output sequence when talking about Recurrent Neural Networks (RNNs).
The input to a RNN at time-step t is:
state of memory cell at t-1
input element at t
LSTM has a more advanced internal structure than a vanilla RNN to allow more robust training. But from a user perspective, it works just like a vanilla RNN. You input a sequence and LSTM computes a output sequence for you.
When doing handwriting recognition, you usually extract a feature sequence from the input image (e.g. by using convolutional layers).
Then, you feed this feature sequence into LSTM layers.
You map the output sequence to a character-probability matrix which is then decoded into the final text by the CTC layer.
Here is a short tutorial how to build a handwriting recognition system, it should give you an idea of which data (see "Data": "CNN output" and "RNN output") flows into LSTM and which data flows out of LSTM:
https://towardsdatascience.com/2326a3487cd5