How to train tesseract and how to recognize multiple columns - ocr

I have the task of taking a PDF with images to a txt or csv file to store at a database. I am trying to use OCR on images like the one attached.
The results are as poor as the following:
`20—0
¿ ABÚEADD LDIDI ALBARH, JDSE
AHTÚHIÚ
—- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019
: ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD`
Of special importance is the phone number (944 355019), it seems close to correct but it still has wrong digits which makes the whole thing useless.
After much reading I still do not know how to train tesseract. I am following this instructions among others, which leads me with doubts such as:
It talks about getting a sample of the fonts to train. I have an image, so how do I get the exact font to somehow generate the training data?
More often than not I get the text moved from where you would expect to find it. I just read that that is because tesseract does OCR on a per column basis (and then I read it does not so I am confused). So, which one is it, and how make it to write it horizontally?

Related

Goole OCR not detecting tables with their structure

I followed a tutorial and tried using Google OCR to convert image into text, I have a table and it is of the form "Text value value value", but Google OCR is reading it as
Text
Value
Value
Value
Is there a way to read it as it is? Without losing the text to value relation?
Even i was facing the same problem.After pretty much research i found out that there are something called table OCR's and vision api's TEXT_DETECTION and DOCUMENT_TEXT_DETECTION are not table ocr's meaning they are not suitable for tabular data reproduction u need a lot of opencv image preprocessing to be done.Instead u can make use of table OCR API'(FREE) on the internet ,
FREE Table Ocr API
github opensource Table OCR's developed on TESSERACT
they make sure that your text-value relation isn't broken
they OCR'ed output of
nutrifact data
would be
Nutrition fact
blah blah boo
Totalfat 0g 0%//in the same line
Sodium 0mg 0%//in the same line
......... ..
hence u can keep "\t " as delimeter and reproduce the table
hope my answer would b helpful:)

how to convert/match a handwritten list of names? (HWR)

I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written.
My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with a list of names in my database and if I get reasonably close matches, then that name is right.
Does this approach sound like a good one? If not, other ideas?
I tried using tesseract on a sample sheet (see below)
I used:
tesseract simple.png -psm 4 outtxt
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
I am assuming it didn't like line 2 because I went below the line.
The results I got were:
1.. AM: (harm;
l. ’E (J 22 a 00k
2‘ wau \\) [HQ
4. KIM TAYLOE
5. LN] Davis
6‘ Mzflé! Ha K
Obviously not the greatest, my guess is the distance matches for 4 & 5 would work, but the rest are not even close.
I have control of my sign-in sheet, but not the handwriting of folks coming in, so if any changes to that I can do to help, please let me know.
Since your goal is to get names only - I would suggest you to reduce tessedit_char_whitelist to english alphabetical ones("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.") so that you will not get characters that you don't expect as output like \\) [ .
Your initial approach to calculate L distance is fine if you success to extract text from handwritten image (which is a hard task for tesseract).
I would also suggest to run some preprocessing on your image. For example you can remove horizontal lines and extract text ROIs around them. In the best case you will be able to extract separated characters, but even if you don't do that - you will get better results & will be able to distinguish result names "line by line".
You should also try other recommended output quality improvement stages which you can find in Tesseract OCR wiki (link)

Extract aligned sections of FASTA to new file

I've already looked here and in other forums, but couldn't find the answer to my question. I want to design baits for a target enrichment Sequencing approach and have the output of a MarkerMiner search for orthologous loci from four different genomes with A. thaliana as a Reference as. These output alignments are separate Fasta-Files for each A. thaliana annotated gene with the sequences from my datasets aligned to it.
I have already run a script to filter out those loci supported to be orthologous by at least two of my four input datasets.
However, now, I'm stumped.
My alignments are gappy, since the input data is mostly RNAseq whereas the Reference contains the introns as well. So it looks like this :
AT01G1234567
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC
MyData1
CGATGCGCGC-----------CGGATCGCGG---------------CGGATCGC
MyData2
CGCTGCGCGC------------GGATAGCGG---------------CGGATCCC
To effectively design baits I now need to extract all the aligned parts from the file, so that I will end up with separate files; or separate alignments within the file; for the parts that are aligned between MyData and the Reference sequence with all the gappy parts excluded. There are about 1300 of these fasta files, so doing it manually is no option.
I have a bit of programming experience in python and with Linux command line tools, however I am completely lost on how to go about this. I would appreciate a hint, on what kind of tools are out there I could use or what kind of algorithm I need to come up with.
Thank you.
Cheers

Generating truth tables for basic logic circuits

Let's say I have a text file that looks like this:
<number> <name> <type> <inputs...>
1 XOR1 XOR A B
2 SUM XOR 1 C
What would be the best approach to generate the truth table for this circuit?
That depends on what you have available, and how big your file is.
Perl is optimized for reading files and generating simple text output. It doesn't have a library of boolean operators, but they're easy enough to write. I'd use that if I just wanted text-in, text-out.
If I wanted to display the data online AND generate a results file, I'd use PHP to read the data and write the table to a CSV file that could either be opened in Excel, or posted online in an HTML table.
If your data is in a REALLY BIG data file, I'd use SQL.
If your data is in a really huge file that you want to be accessible to authorized users online, and you want THEM to be able to create truth tables, I'd use Oracle's APEX to create an easy interface for them to build their own truth tables and play around with the data without altering it.
If you're in an electrical engineering environment, use the tools designed for your problem -- Verilog or similar.
Whatcha got? Whatcha wanna do with it?
-- Ada
I prefer using C#. I already have the code to 'parse' the input
text file. I just don't know where to start in terms of
actually 'simulating' it. The output can simply be a text file
with inputs and output values – Don 12 mins ago
How many inputs and how many outputs in the circuit you want to simulate?
The size of the simulation determines how it can most easily be run. If the circuit is small(ish), you can enter the inputs and circuit values into vector arrays, then cross them to get the output matrix.
Matlab is ideal for this, as it was written for processing arrays.
Again: Whatcha got, and whatcha wanna do with it?
-- Ada

Can an OCR run in a split-second if it is highly targeted? (Small dictionary)

I am looking for an open source ocr (maybe tesseract) that uses a dictionary to match words against. For example, I know that this ocr will only be used to search for certain names. Imagine I have a master guest list (written) and I want to scan this list in under a second with the ocr and check this against a database of names.
I understand that a traditional ocr can attempt to read every letter and then I could just cross reference the results with the 100 names, but this takes too long. If the ocr was just focusing on those 100 words and nothing else then it should be able to do all this in a split second. i.e. There is no point in guessing that a word might be "Jach" since "Jach" isn't a name in my database. The ocr should be able to infer that it is "Jack" since that is an actual name in the database.
Is this possible?
It should be possible. Think of it this way: instead of having your OCR look for 'J' it could be looking for 'Jack' directly, sort of: as an individual symbol.
So when you train / calibrate your OCR, train it with images of whole words, similar to how you would - for an individual symbol.
(if this feature is not directly available in your OCR then first map images of whole words to a unique symbol and later transform that symbol into the final word string)