how to convert/match a handwritten list of names? (HWR) - ocr

I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written.
My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with a list of names in my database and if I get reasonably close matches, then that name is right.
Does this approach sound like a good one? If not, other ideas?
I tried using tesseract on a sample sheet (see below)
I used:
tesseract simple.png -psm 4 outtxt
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
I am assuming it didn't like line 2 because I went below the line.
The results I got were:
1.. AM: (harm;
l. ’E (J 22 a 00k
2‘ wau \\) [HQ
4. KIM TAYLOE
5. LN] Davis
6‘ Mzflé! Ha K
Obviously not the greatest, my guess is the distance matches for 4 & 5 would work, but the rest are not even close.
I have control of my sign-in sheet, but not the handwriting of folks coming in, so if any changes to that I can do to help, please let me know.

Since your goal is to get names only - I would suggest you to reduce tessedit_char_whitelist to english alphabetical ones("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.") so that you will not get characters that you don't expect as output like \\) [ .
Your initial approach to calculate L distance is fine if you success to extract text from handwritten image (which is a hard task for tesseract).
I would also suggest to run some preprocessing on your image. For example you can remove horizontal lines and extract text ROIs around them. In the best case you will be able to extract separated characters, but even if you don't do that - you will get better results & will be able to distinguish result names "line by line".
You should also try other recommended output quality improvement stages which you can find in Tesseract OCR wiki (link)

Related

solr highlighting not working for long words

I am using Solr 6.2. I want to get the matched words by using highlight option.
When I search with the word "miss" I can get the highlights. But I cant get results for the word "missing".
For Example:
when I search with "miss" I can get the below results:
http://localhost:8983/solr/logbook1/select?debugQuery=on&defType=dismax&defType=edismax&hl.fl=*&hl=on&indent=on&q=miss&rows=5&wt=json
highlighting":{
"3246a347-874a-44e2-bb3d-949a358f435d":{
"String1":["IN REFERENCE CABIN LOG PAGE 22838. TWO EXTENSION SEAT BELT <em>MISS</em> ING"]},
"46a340f8-949f-41fe-b2ee-c1936bfc6b4f":{
"String1":["IN REFERENCE CABIN LOG PAGE 22838. TWO EXTENSION SEAT BELT <em>MISS</em> ING"]},
"df6eef1c-971d-48f7-a93a-07874011ae5b":{
"String1":["ACCESS PANEL 343EB ON R/H HORIZONTAL STAB FOUND WITH SCREW <em>MISS</em> ING AND LOOSE"]},
"9a124f6d-f32b-4e24-beb2-11f7aa22894d":{
"String1":["AFT GALLEY # 4 COFFEE MAKER SHIELDS ON COMPT 419 - 420 ARE <em>MISS</em> ING."]}},
When I search with missing, I am getting no result as below:
http://localhost:8983/solr/logbook1/select?debugQuery=on&defType=dismax&defType=edismax&hl.fl=*&hl=on&indent=on&q=missing&rows=5&wt=json
"highlighting":{
"0d2963a7-adea-40ab-af0a-bb8fe069c4d9":{},
"9f23f4c0-6989-471d-8c61-4016a8e38813":{},
"c77b6be1-547c-43fe-94f0-ae5c0849eab4":{},
"f5792594-7fd2-42b5-92c4-03257c05adba":{},
"68d9251a-74d9-409e-84ec-a67a0eb94866":{}},
I have checked the fragsize options. Please guide is there anything to configure.
1) So i assume you already of lowercase filter on your index field as it will fetch upper and lower case results.
2) And have you added extra space between miss + ing ? if yes you need to remove that and have a try.
3) Please check stop word dictionary if you haven't accidentally added missing there as they get ignored in searching.
4)Try Analyzer from solr to see how to transforms your search term, analyzer is available in solrconsole.
Have you set indexed and stored to true?
For me it looks like, that there are probably different setting for token-handling on indexing and search time. Take a look at your schema.xml an try to work with same settings for indexing and searching.

How to train tesseract and how to recognize multiple columns

I have the task of taking a PDF with images to a txt or csv file to store at a database. I am trying to use OCR on images like the one attached.
The results are as poor as the following:
`20—0
¿ ABÚEADD LDIDI ALBARH, JDSE
AHTÚHIÚ
—- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019
: ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD`
Of special importance is the phone number (944 355019), it seems close to correct but it still has wrong digits which makes the whole thing useless.
After much reading I still do not know how to train tesseract. I am following this instructions among others, which leads me with doubts such as:
It talks about getting a sample of the fonts to train. I have an image, so how do I get the exact font to somehow generate the training data?
More often than not I get the text moved from where you would expect to find it. I just read that that is because tesseract does OCR on a per column basis (and then I read it does not so I am confused). So, which one is it, and how make it to write it horizontally?

Extract aligned sections of FASTA to new file

I've already looked here and in other forums, but couldn't find the answer to my question. I want to design baits for a target enrichment Sequencing approach and have the output of a MarkerMiner search for orthologous loci from four different genomes with A. thaliana as a Reference as. These output alignments are separate Fasta-Files for each A. thaliana annotated gene with the sequences from my datasets aligned to it.
I have already run a script to filter out those loci supported to be orthologous by at least two of my four input datasets.
However, now, I'm stumped.
My alignments are gappy, since the input data is mostly RNAseq whereas the Reference contains the introns as well. So it looks like this :
AT01G1234567
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC
MyData1
CGATGCGCGC-----------CGGATCGCGG---------------CGGATCGC
MyData2
CGCTGCGCGC------------GGATAGCGG---------------CGGATCCC
To effectively design baits I now need to extract all the aligned parts from the file, so that I will end up with separate files; or separate alignments within the file; for the parts that are aligned between MyData and the Reference sequence with all the gappy parts excluded. There are about 1300 of these fasta files, so doing it manually is no option.
I have a bit of programming experience in python and with Linux command line tools, however I am completely lost on how to go about this. I would appreciate a hint, on what kind of tools are out there I could use or what kind of algorithm I need to come up with.
Thank you.
Cheers

COBOL code to replace characters by html entities

I want to replace the characters '<' and '>' by < and > with COBOL. I was wondering about INSPECT statement, but it looks like this statement just can be used to translate one char by another. My intention is to replace all html characters by their html entities.
Can anyone figure out some way to do it? Maybe looping over the string and testing each char is the only way?
GnuCOBOL or IBM COBOL examples are welcome.
My best code is something like it: (http://ideone.com/MKiAc6)
IDENTIFICATION DIVISION.
PROGRAM-ID. HTMLSECURE.
ENVIRONMENT DIVISION.
DATA DIVISION.
WORKING-STORAGE SECTION.
77 INPTXT PIC X(50).
77 OUTTXT PIC X(500).
77 I PIC 9(4) COMP VALUE 1.
77 P PIC 9(4) COMP VALUE 1.
PROCEDURE DIVISION.
MOVE 1 TO P
MOVE '<SCRIPT> TEST TEST </SCRIPT>' TO INPTXT
PERFORM VARYING I FROM 1 BY 1
UNTIL I EQUAL LENGTH OF INPTXT
EVALUATE INPTXT(I:1)
WHEN '<'
MOVE "<" TO OUTTXT(P:4)
ADD 4 TO P
WHEN '>'
MOVE ">" TO OUTTXT(P:4)
ADD 4 TO P
WHEN OTHER
MOVE INPTXT(I:1) TO OUTTXT(P:1)
ADD 1 TO P
END-EVALUATE
END-PERFORM
DISPLAY OUTTXT
STOP RUN
.
GnuCOBOL (yes, another name branding change) has an intrinsic function extension, FUNCTION SUBSTITUTE.
move function substitute(inptxt, ">", ">", "<", "<") to where-ever-including-inptxt
Takes a subject string, and pairs of patterns and replacements. (This is not regex patterns, straight up text matching). See http://opencobol.add1tocobol.com/gnucobol/#function-substitute for some more details. The patterns and replacements can all be different lengths.
As intrinsic functions return anonymous COBOL fields, the result of the function can be used to overwrite the subject field, without worry of sliding overlap or other "change while reading" problems.
COBOL is a language of fixed-length fields. So no, INSPECT is not going to be able to do what you want.
If you need this for an IBM Mainframe, your SORT product (assuming sufficiently up-to-date) can do this using FINDREP.
If you look at the XML processing possibilities in Enterprise COBOL, you will see that they do exactly what you want (I'd guess). GnuCOBOL can also readily interface with lots of other things. If you are writing GnuCOBOL for running on a non-Mainframe, I'd suggest you ask on the GnuCOBOL part of SourceForge.
Otherwise, yes, it would come down to looping through the data. Once you clarify what you want a bit more, you may get examples of that if you still need them.

Can an OCR run in a split-second if it is highly targeted? (Small dictionary)

I am looking for an open source ocr (maybe tesseract) that uses a dictionary to match words against. For example, I know that this ocr will only be used to search for certain names. Imagine I have a master guest list (written) and I want to scan this list in under a second with the ocr and check this against a database of names.
I understand that a traditional ocr can attempt to read every letter and then I could just cross reference the results with the 100 names, but this takes too long. If the ocr was just focusing on those 100 words and nothing else then it should be able to do all this in a split second. i.e. There is no point in guessing that a word might be "Jach" since "Jach" isn't a name in my database. The ocr should be able to infer that it is "Jack" since that is an actual name in the database.
Is this possible?
It should be possible. Think of it this way: instead of having your OCR look for 'J' it could be looking for 'Jack' directly, sort of: as an individual symbol.
So when you train / calibrate your OCR, train it with images of whole words, similar to how you would - for an individual symbol.
(if this feature is not directly available in your OCR then first map images of whole words to a unique symbol and later transform that symbol into the final word string)