Can an OCR run in a split-second if it is highly targeted? (Small dictionary) - ocr

I am looking for an open source ocr (maybe tesseract) that uses a dictionary to match words against. For example, I know that this ocr will only be used to search for certain names. Imagine I have a master guest list (written) and I want to scan this list in under a second with the ocr and check this against a database of names.
I understand that a traditional ocr can attempt to read every letter and then I could just cross reference the results with the 100 names, but this takes too long. If the ocr was just focusing on those 100 words and nothing else then it should be able to do all this in a split second. i.e. There is no point in guessing that a word might be "Jach" since "Jach" isn't a name in my database. The ocr should be able to infer that it is "Jack" since that is an actual name in the database.
Is this possible?

It should be possible. Think of it this way: instead of having your OCR look for 'J' it could be looking for 'Jack' directly, sort of: as an individual symbol.
So when you train / calibrate your OCR, train it with images of whole words, similar to how you would - for an individual symbol.
(if this feature is not directly available in your OCR then first map images of whole words to a unique symbol and later transform that symbol into the final word string)

Related

How do I fill a list with all the world's phone prefixes in Dart on Flutter?

I'd like to implement an app with Dart on Flutter. I'm on my first approach with this new language and for the first time I meet this problem.
My app must necessarily work with a mobile phone number. I would like to see a ban on the insertion of unse prefixed telephone numbers or, alternatively, the typing of a number with more digits than expected. For example, in Italy the figures after +39 (0039) are at most 10. I probably thought I'd separate the two parts to make it easier to distinguish between lengths (one field where you select the country and another that allows you to enter the number).
Is there, as you know, a JSON that contains exactly: - the prefix of each state, - the length of the telephone number (excluding prefix), - name, *flag and *sigla (Italy, green-white-red, IT)?
Sifting through the web a little bit, I saw that flutter should actually provide already in itself with .demoTextFieldEnterITPhoneNumber, through GalleryLocalizations to do such a job, but I didn't quite understand if it bothers to control a particular regular expression for each nation or not. Could I copy and paste a number for example? Will nationality be automatically recognized?
In the end I think that such a control, so deep, is not possible so I would just need this, so make two fields, one with a list, which at the choice automatically fills in depending on the selected prefix, and a field on which the user types his number: in case of copied and pasted number check if that string also contains a +prefix.
Thank you very much, I need a lot, since my app will mainly revolve around a correct value for this field. :)
Try using the international_phone_input or country_code_picker flutter package. They are quite easy to implement

how to convert/match a handwritten list of names? (HWR)

I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written.
My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with a list of names in my database and if I get reasonably close matches, then that name is right.
Does this approach sound like a good one? If not, other ideas?
I tried using tesseract on a sample sheet (see below)
I used:
tesseract simple.png -psm 4 outtxt
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
I am assuming it didn't like line 2 because I went below the line.
The results I got were:
1.. AM: (harm;
l. ’E (J 22 a 00k
2‘ wau \\) [HQ
4. KIM TAYLOE
5. LN] Davis
6‘ Mzflé! Ha K
Obviously not the greatest, my guess is the distance matches for 4 & 5 would work, but the rest are not even close.
I have control of my sign-in sheet, but not the handwriting of folks coming in, so if any changes to that I can do to help, please let me know.
Since your goal is to get names only - I would suggest you to reduce tessedit_char_whitelist to english alphabetical ones("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.") so that you will not get characters that you don't expect as output like \\) [ .
Your initial approach to calculate L distance is fine if you success to extract text from handwritten image (which is a hard task for tesseract).
I would also suggest to run some preprocessing on your image. For example you can remove horizontal lines and extract text ROIs around them. In the best case you will be able to extract separated characters, but even if you don't do that - you will get better results & will be able to distinguish result names "line by line".
You should also try other recommended output quality improvement stages which you can find in Tesseract OCR wiki (link)

How to train tesseract and how to recognize multiple columns

I have the task of taking a PDF with images to a txt or csv file to store at a database. I am trying to use OCR on images like the one attached.
The results are as poor as the following:
`20—0
¿ ABÚEADD LDIDI ALBARH, JDSE
AHTÚHIÚ
—- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019
: ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD`
Of special importance is the phone number (944 355019), it seems close to correct but it still has wrong digits which makes the whole thing useless.
After much reading I still do not know how to train tesseract. I am following this instructions among others, which leads me with doubts such as:
It talks about getting a sample of the fonts to train. I have an image, so how do I get the exact font to somehow generate the training data?
More often than not I get the text moved from where you would expect to find it. I just read that that is because tesseract does OCR on a per column basis (and then I read it does not so I am confused). So, which one is it, and how make it to write it horizontally?

Association Rules for Text File

I am a student using Rapidminer, and I am doing a project using Yummly's What's Cooking dataset (https://www.kaggle.com/c/whats-cooking/data). The dataset has 20 different cuisine types (e.g. Italian, Chinese, Indian, etc.).
Our goal is to develop a data mining model that identifies the cuisine type of future dishes by analyzing the ingredient list of the dish. We are using association rules to do so. However, I keep getting "no rules found" and have no idea why. I am thinking this has something to do with my attributes being formatted as text and not using the nominal to binominal operator, but am not sure how to fix it.
Currently my process looks like....
data -> select attributes -> FP growth -> create association rules
Can you help?
According to the documentation for the FP-Growth operator, all the attributes in the example set need to be binomial.
I'll admit--I haven't looked at the data directly because I didn't want to register an account on kaggle, so I'm not sure exactly how it's formatted, but you would probably want to set the type of cuisine as a label and then have each of the remaining attributes represent each ingredient that is included in one or more of the recipes. Each dish would have a 1 in the column if the ingredient is used and a 0 if it's not used. (Depending on the original format of the data, since you mentioned it's text, you may want to check out the text processing extension, which can create an example set like what I just described.) Then, if you convert the 0s and 1s to binomial, you should be able to use FP-Growth.

Named Entity Recognition using NLTK. Relevance of extracted keywords

I was checking out the Named Entity Recognition feature of NLTK. Is it possible to find out which of the extracted keywords is most relevant to the original text? Also, is it possible to know the type (Person / Organization) of the extracted keywords?
If you have a trained tagger, you can first tag your text and then use the NE classifier that comes with NLTK.
The tagged text should be presented as a list
sentence = 'The U.N.'
tagged_sentence = [('The','DT'), ('U.N.', 'NNP')]
Then, the ne classifier would be called like this
nltk.ne_chunk(tagged_sentence)
It returns a Tree. The classified words will appear as Tree nodes inside the main structure.
The result will include if it is a PERSON, ORGANIZATION or GPE.
To find out the most relevant terms, you have to define a measure of "relevance". Usually tf/idf is used but if you are considering only one document, frequency could be enough.
Computing the frequency of each word within a document is easy with NLTK. First you have to load your corpus and once you have load it and have a Text object, simply call:
relevant_terms_sorted_by_freq = nltk.probability.FreqDist(corpus).keys()
Finally, you could filter out all words in relevant_terms_sorted_by_freq that don't belong to a NE list of words.
NLTK offers an online version of a complete book which I find interesting to start with