Tesseract receipt scanning advice needed - ocr

I have struggled off and on again with Tesseract for various OCR projects and I found a use case today which I thought would be a slam dunk for it but after many hours I am still coming away unsatisfied. I wanted to pose the problem here and see if anyone else has advice on how to solve this task.
My wife came to me this morning and asked if there was anyway she could easily scan her receipts from Wal-Mart and over time build a history of prices spent in categories and for specific items so that we could do some trending and easily deep dive on where the spending is going. At first I felt like this was a very tall order, but after doing some digging I found a few things that make me feel this is within reach:
Wal-Mart receipts are in general, very well structured and easy to read. They even include the UPC for every item (potential for lookups against a UPC database?) and appear to classify food items with an F or I (not sure what the difference is) and have a tax code column as well that may prove useful should I learn the secrets of what the codes mean.
I further discovered that there is some kind of Wal-Mart item lookup API that I may be able to get access to which would prove useful in the UPC lookup.
They have an app for smart phones that lets you scan a QR code printed on every receipt. That app looks up a "TC" code off the receipt and pulls down the entire itemized receipt from their servers. It shows you an excellent graphical representation of the receipt including thumbnail pictures of all the items and the cost, etc. If this app would simply categorize and summarize the receipt, I would be done! But alas, that's not the purpose of the app ....
The final piece of the puzzle is that you can export a computer generated PNG image of the receipt in case you want to save it and throw away the paper version. This to me is the money shot, as these PNGs are computer created and therefore not subject to the issues surrounding taking a picture or scanning a paper receipt
An example of one of these (slightly edited to white out some areas but otherwise exactly as obtained from the app) is here:
https://postimg.cc/image/s56o0wbzf/
You can see that the important part of the text is perfectly aligned in 5 columns and that is ultimately what this question is about. How to get Tesseract to accurately OCR this into text. I have lots of ideas where to take it from here, but it all starts with the OCR!
The closest I have come myself is this example here:
http://pastebin.com/nuZJBVg8
I used psm6 and a character limiting set to force it to do uppercase + numbers + a few symbols only:
tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#()/*#%-.
At first glance, the OCR seems to almost match. But as you dig deeper you will see that it fails pretty horribly overall. 3s and 8s are almost always wrong. Same with 6s and 5s. Then there are times it just completely skips over characters or just starts to fall apart (like line 31+ in the example). It starts seeing 2s as 1s, or even just missing characters. The SO PIZZA on line 33 should be "2.82" but comes out as "32".
I have tried doing some pre-processing on the image to thicken up the characters and make sure it's pure black and white but none of my efforts got any closer than the raw image from Wal-Mart + the above commands.
Ideally since this is such a well structured PNG which is presumably always the same width I would love if I could define the columns by pixel widths so that Tesseract would treat each column independently. I tried to research this but the UZN files I've seen mentioned don't translate to me as far as pixel widths and they seem like height is a factor which wouldn't work on these since the height is always going to be variable.
In addition, I need to figure out how to train Tesseract to recognize the numbers 100% accurately (the letters aren't really important). I started researching how to train the program but to be honest it got over my head pretty quickly as the scope of training in the documentation is more for having it recognize entire languages not just 10 digits.
The ultimate end game solution would be a pipeline chain of commands that took the original PNG from the app and gave me back a CSV with the 5 columns of data from the important part of the receipt. I don't expect that out of this question, but any assistance guiding me towards it would be greatly appreciated! At this point I just don't feel like being whipped by Tesseract once again and so I am determined to find a way to master her!

I ended up fully flushing this out and am pretty happy with the results so I thought I would post it in case anyone else ever finds it useful.
I did not have to do any image splitting and instead used a regex since the Wal-mart receipts are so predictable.
I am on Windows so I created a powershell script to run the conversion commands and regex find & replace:
# -----------------------------------------------------------------
# Script: ParseReceipt.ps1
# Author: Jim Sanders
# Date: 7/27/2015
# Keywords: tesseract OCR ImageMagick CSV
# Comments:
# Used to convert a Wal-mart receipt image to a CSV file
# -----------------------------------------------------------------
param(
[Parameter(Mandatory=$true)] [string]$image
) # end param
# create output and temporary files based on input name
$base = (Get-ChildItem -Filter $image -File).BaseName
$csvOutfile = $base + ".txt"
$upscaleImage = $base + "_150.png"
$ocrFile = $base + "_ocr"
# upscale by 150% to ensure OCR works consistently
convert $image -resize 150% $upscaleImage
# perform the OCR to a temporary file
tesseract $upscaleImage -psm 6 $ocrFile
# column headers for the CSV
$newline = "Description,UPC,Type,Cost,TaxType`n"
$newline | Out-File $csvOutfile
# read in the OCR file and write back out the CSV (Tesseract automatically adds .txt to the file name)
$lines = Get-Content "$ocrFile.txt"
Foreach ($line in $lines) {
# This wraps the 12 digit UPC code and the price with commas, giving us our 5 columns for CSV
$newline = $line -replace '\s\d{12}\s',',$&,' -replace '.\d+\.\d{2}.',',$&,' -replace ',\s',',' -replace '\s,',','
$newline | Out-File -Append $csvOutfile
}
# clean up temporary files
del $upscaleImage
del "$ocrFile.txt"
The resulting file needs to be opened in Excel and then have the text to columns feature run so that it won't ruin the UPC codes by auto converting them to numbers. This is a well known problem I won't dive into, but there are a multitude of ways to handle and I settled on this slightly more manual way.
I would have been happiest to end up with a simple .csv I could double click but I couldn't find a great way to do that without mangling the UPC codes even more like by wrapping them in this format:
"=""12345"""
That does work but I wanted the UPC code to be just the digits alone as text in Excel in case I am able to later do a lookup against the Wal-mart API.
Anyway, here is how they look after importing and some quick formating:
https://s3.postimg.cc/b6cjsb4bn/Receipt_Excel.png
I still need to do some garbage cleaning on the rows that aren't line items but that all only takes a few seconds so doesn't bother me too much.
Thanks for the nudge in the right direction #RevJohn, I would not have thought to try simply scaling the image but that made all the difference in the world with Tesseract!

Text recognition on receipts is one of the hardest problems for OCR to handle.
The reasons are numerous:
receipts are printed on cheap paper with cheap printers - to make them cheap, not readable!
they have very large amount of dense text (especially Wall-Mart receipts)
existing OCR engines are almost exclusively trained on non-receipt data (books, documents, etc.)
receipt structure, which is something between tabular and freeform, is hard for any layouting engine to handle.
Your best bet is to perform the following:
Analyse the input images. If they are hard to read by eyes, they are hard to read to tesseract as well.
Perform additional image preprocessing. Image scaling (0.5x, 1.5x, 2x) sometimes help a lot. Cleaning existing noise also helps.
Tesseract training. It's not that hard to do :)
OCR result postprocessing to ensure layouting.
Layouting is best performed by analysing the geometry of the results, not by regexes. Regexes have problems if the OCR has errors. Using geometry, for example, you find a good candidate for UPC number, draw a line through the centers of the characters, and then you know exactly which price belongs to that UPC.
Also, some commercial solutions have customisations for receipt scanning, and can even run very fast on mobile devices.
Company I'm working with, MicroBlink, has an OCR module for mobile devices. If you're on iOS, you can easily try it using CocoaPods
pod try PPBlinkOCR

Related

Formatting wide output via 'column' (or similar) command(s)

This question actually asks the 'inverse' solution as the one here, namely I would like to wrap the long column (column 4) on multiple lines. In effect, the output should look like:
cat test.csv | column -s"," -t -c5
col1 col2 col3 col4 col5
1 2 3 longLineOfText 5
ThatIWantTo
InspectAndWould
LikeToWrap
(excuse the u.u.o.c. duplicated over here :) )
The solution would ideally :
make use of standard *nix text processing utilities (e.g. column, paste, pr which usually are present on any modern Linux machine nowadays, usually coming from the core-utils package);
avoid jq as it is not necessarily present on every (production) system;
don't overheat the brain: yes... am looking mainly at you awk & co. gurus :). "Normal" awk / perl / sed is fine.
as a special bonus , a solution using vim would be even more welcome (again, no brain smoke please), since that would allow for syntax-coloring as well.
The background: I want to be able to make sense of the output of docker history, so as a last resort even some Go Template-magic would suit, as would using jq.
In extreme cases (if the benefits of ease-of-remembering-and-use outweigh the inconvenience of downloading a new utilty (preferably self-contained / static linked) utility on the server - is ok, or using json processing commands (in which case using pythons json module would be preferred)
Thanks !
LE:
Please keep in mind, that dockers output has the columns separated with several spaces, which unfortunately confuses most commands :(

Pari/Gp directing output

Is there an easy convenient way to direct output in Pari/GP to file? My aim is to get the full decimal expansion of 2^400000-1 either on screen or in a text file?
(23:37) gp > 2^400000-1
%947 = 996014342993......(4438 digits)......609762267975[+++]
GP terminal output gives this, which is not the goal. Basic output re-direction does not work either. Any ideas? Thanks.
(23:38) gp > 2^400000-1 > output.txt
There is a manual online, it does not say much about the output, except for the variable TeXstyle. I am unsure how to work with this though.
Quick and easy is to just do print(2^400000-1) and then you can cut+paste. Otherwise write(filename, 2^400000-1) if you want in a file.
Some other possibilities:
writebin(filename,2^400000-1) writes the object binary structure in a file: this is faster than traditional output (which implies a binary to decimal conversion), and loading it into another session will be faster as well. This is useful for a huge atomic write.
C-style output: fileopen, then successive filewrite allows many writes to a file referenced by a descriptor (which avoids re-opening / flushing / closing the file after each write). This is useful for a large write operation done through many tiny writes to a given file, e.g, character by character.

OCR tesseract: trained data creation issue for special type of fonts (using Jtessboxeditor)

Unable to create proper trained data for windows non-native fonts, i.e.,for catia drafting fonts
Even if some of the alpha-numerals are recognized, letters with broken characters like " i , j " etc., special symbols like Ø (Phi), ° (degree), ± (plus-minus) are not recognized properly. Its box file values are improper.
JTessboxeditor is the tool we used to train and create trained data for tesseract
Request your assistance on the same. Thanks
I also need these 3 characters - though it might be too late to answer this.
May not be of much help in all situations, but the Norwegian .traineddata file does include the Ø (Phi) character, this trained data file has helped me with this character.
The ° (degree) character may be a bit trickier, as it normally isn't recognized because it's too small, if you can see the inside of the character is clear, Tesseract might be able to decipher.
Now the most difficult, the ± (plus-minus). I haven't cracked this one yet, and this may be a very wooly approach; but I was thinking, the plus-minus is always recognized as + plus only.
I can use this to my advantage.
I could use Tesseract's engine which exposes PageSegMode.SingleChar to detect each individual character and use Tesseract's GetSegmentedRegions() to get the area of the bitmap/image where each character is - you can later reassemble all characters into a string.
Then I could run an ImageMagick to calculate/compare how similar the plus character found is to an image of either plus or plus-minus. The one with most similarity will tell you which character.
With my approach, I still have to parse the text recognised and transform it into something usable.
The Ø (Phi) character for example may be detected as lower-case, but I will want it upper-case.
Or the degree is detected as an apostrophe, but the expected result is the degree.
Another transformation is when I detect a dimension, a decimal may be incorrectly recognized with a comma, but I will want the decimal separator to be a dot (1,99 - 1.99)

In relative terms, how fast should TCL on Windows 10 be?

I have the latest TCL build from Active State installed on a desktop and laptop both running Windows 10. I'm new to TCL and a novice developer and my reason for learning TCL is to enhance my value on the F5 platform. I figured a good first step would be to stop the occasional work I do in VBScript and port that to TCL. Learning the language itself is coming along alright, but I'm worried my project isn't viable due to performance. My VBScripts absolutely destroy my TCL scripts in performance. I didn't expect that outcome as my understanding was TCL was so "fast" and that's why it was chosen by F5 for iRules etc.
So the question is, am I doing something wrong? Is the port for Windows just not quite there? Perhaps I misunderstood the way in which TCL is fast and it's not fast for file parsing applications?
My test application is a firewall log parser. Take a log with 6 million hits and find the unique src/dst/port/policy entries and count them; split up into accept and deny. Opening the file and reading the lines is fine, TCL processes 18k lines/second while VBScript does 11k. As soon as I do anything with the data, the tide turns. I need to break the four pieces of data noted above from the line read and put in array. I've "split" the line, done a for-next to read and match each part of the line, that's the slowest. I've done a regexp with subvariables that extracts all four elements in a single line, and that's much faster, but it's twice as slow as doing four regexps with a single variable and then cleaning the excess data from the match away with trims. But even this method is four times slower than VBScript with ad-hoc splits/for-next matching and trims. On my desktop, i get 7k lines/second with TCL and 25k with VBscript.
Then there's the array, I assume because my 3-dimensional array isn't a real array that searching through 3x as many lines is slowing it down. I may try to break up the array so it's looking through a third of the data currently. But the truth is, by the time the script gets to the point where there's a couple hundred entries in the array, it's dropped from processing 7k lines/second to less than 2k. My VBscript drops from about 25k lines to 22k lines. And so I don't see much hope.
I guess what I'm looking for in an answer, for those with TCL experience and general programming experience, is TCL natively slower than VB and other scripts for what I'm doing? Is it the port for Windows that's slowing it down? What kind of applications is TCL "fast" at or good at? If I need to try a different kind of project than reading and manipulating data from files I'm open to that.
edited to add code examples as requested:
while { [gets $infile line] >= 0 } {
some other commands I'm cutting out for the sake of space, they don't contribute to slowness
regexp {srcip=(.*)srcport.*dstip=(.*)dstport=(.*)dstint.*policyid=(.*)dstcount} $line -> srcip dstip dstport policyid
the above was unexpectedly slow. the fasted way to extract data I've found so far
regexp {srcip=(.*)srcport} $line srcip
set srcip [string trim $srcip "cdiloprsty="]
regexp {dstip=(.*)dstport} $line dstip
set dstip [string trim $dstip "cdiloprsty="]
regexp {dstport=(.*)dstint} $line dstport
set dstport [string trim $dstport "cdiloprsty="]
regexp {policyid=(.*)dstcount} $line a policyid
set policyid [string trim $policyid "cdiloprsty="]
Here is the array search that really bogs down after a while:
set start [array startsearch uList]
while {[array anymore uList $start]} {
incr f
#"key" returns the NAME of the association and uList(key) the VALUE associated with name
set key [array nextelement uList $start]
if {$uCheck == $uList($key)} {
##puts "$key CONDITOIN MET"
set flag true
adduList $uCheck $key $flag2
set flag2 false
break
}
}
Your question is still a bit broad in scope.
F5 has published some comment why they choose Tcl and how it is fast for their specific usecases. This is actually a bit different to a log parsing usecase, as they do all the heavy lifting in C-code (via custom commands) and use Tcl mostly as a fast dispatcher and for a bit of flow control. And Tcl is really good at that compared to various other languages.
For things like log parsing, Tcl is often beaten in performance by languages like Python and Perl in simple benchmarks. There are a variety of reasons for that, here are some of them:
Tcl uses a different regexp style (DFA), which are more robust for nasty patterns, but slower for simple patterns.
Tcl has a more abstract I/O layer than for example Python, and usually converts the input to unicode, which has some overhead if you do not disable it (via fconfigure)
Tcl has proper multithreading, instead of a global lock which costs around 10-20% performance for single threaded usecases.
So how to get your code fast(er)?
Try a more specific regular expression, those greedy .* patterns are bad for performance.
Try to use string commands instead of regexp, some string first commands followed by string range could be faster than a regexp for these simple patterns.
Use a different structure for that array, you probably want either a dict or some form of nested list.
Put your code inside a proc, do not put it all in a toplevel script and use local variables instead of globals to make the bytecode faster.
If you want, use one thread for reading lines from file and multiple threads for extracting data, like a typical producer-consumer pattern.

What does "the composition of UNIX byte streams" mean?

In the opening page of the book of "Lisp In Small Pieces", there is a paragraph goes like this:
Based on the idea of "function", an idea that has matured over
several centuries of mathematical research, applicative languages are
omnipresent in computing; they appear in various forms, such as the
composition of Un*x byte streams, the extension language for the Emacs
editor, as well as other scripting languages.
Can anyone elaborate a bit on "the composition of unix byte streams"? What does it mean? and how it is related to applicative/functional programming?
Thanks,
/bruin
My guess is that this is a reference to something like a pipe under linux.
cal | wc
the symbol | it's what invokes a pipe between 2 applications, a pipe is a feature provided by the kernel so you can use pipes where the applications are written using this kind of kernel APIs.
In this example cal is just the utility that prints a calendar, wc is an utility that counts words, rows and columns in the input that you pass to it, in this case the input is the result of piping cal to wc which makes things easier for you because it's more functional, you only care about what each applications does, you don't care, for example, about what is the name of the argument or where to allocate a temporary file to store the input/output in between.
Without the pipes you should do something like
cal > temp.txt
wc temp.txt
rm temp.xt
to obtain pretty much the same information. Also this second solution could possibly generate problems, for example what if temp.txt already exists ? Following what kind of rationale you will tell to your script to pick a name for your temporary file ? What if another process modifies your file in between the 2 calls to cal and wc ?