Tesseract OCR ignores the lines contain the asterisk - ocr

I have the image created from old fax document (the font is specific) Generally Tesseract works pretty ok with this input, except one use case. When the line starts with many leading asterisk '*' than it is ignored.
The result produces by ocr is different for given psm
psm 1: the empty page
psm 6: For queries please contact NA, KRKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK KKK KK KK KK
In every use case the tittle "comment" is skipped
But when I manually in Paint removed the all '*' from image then the ocr works fine. I ve no idea how to process the ocr without image preprocessing. Can someone understand it?

Try this: tesseract 9UIKs.png - --psm 4 --oem 0
Which produces:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxkkkk COMMENT kkkkxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
For queries p1ease contact NA.
XXXXXKKKKXXXVXKKKKKXXXXXKXXXXXXXXXXXXKXXXXXXXXXXXXXXXXKKXXXXXXXXXXXXXXXXXKXXXX.
You will need language model with support for legacy engine (from here https://github.com/tesseract-ocr/tessdata)

Related

Is it possible to give text format hints in google vision api?

I'm trying to detect handwritten dates isolated in images.
In the cloud vision api, is there a way to give hints about type?
example: the only text present will be dd/mm/yy, d,m and y being digits
The only thing I found is language hints in the documentation.
Sometimes I get results that include letters like O instead of 0.
There is not a way to give hints about type but you can filter the output using client libraries. I downloaded detect.py and requirements.txt from here and modified detect.py (in def detect_text, after line 283):
response = client.text_detection(image=image)
texts = response.text_annotations
#Import regular expressions
import re
print('Date:')
dateStr=texts[0].description
# Test case for letters replacement
#dateStr="Z3 OZ/l7"
#print(dateStr)
dateStr=dateStr.replace("O","0")
dateStr=dateStr.replace("Z","2")
dateStr=dateStr.replace("l","1")
dateList=re.split(' |;|,|/|\n',dateStr)
dd=dateList[0]
mm=dateList[1]
yy=dateList[2]
date=dd+'/'+mm+'/'+yy
print(date)
#for text in texts:
#print('\n"{}"'.format(text.description))
#print('Hello you!')
#vertices = (['({},{})'.format(vertex.x, vertex.y)
# for vertex in text.bounding_poly.vertices])
#print('bounds: {}'.format(','.join(vertices)))
# [END migration_text_detection]
# [END def_detect_text]
Then I launched detect.py inside the virtual environment using this command line:
python detect_dates.py text qAkiq.png
And I got this:
23/02/17
There are few letters that can be mistaken for numbers, so using str.replace(“letter”,”number”) should solve the wrong identifications. I added the most common cases for this example.

OCR tesseract: trained data creation issue for special type of fonts (using Jtessboxeditor)

Unable to create proper trained data for windows non-native fonts, i.e.,for catia drafting fonts
Even if some of the alpha-numerals are recognized, letters with broken characters like " i , j " etc., special symbols like Ø (Phi), ° (degree), ± (plus-minus) are not recognized properly. Its box file values are improper.
JTessboxeditor is the tool we used to train and create trained data for tesseract
Request your assistance on the same. Thanks
I also need these 3 characters - though it might be too late to answer this.
May not be of much help in all situations, but the Norwegian .traineddata file does include the Ø (Phi) character, this trained data file has helped me with this character.
The ° (degree) character may be a bit trickier, as it normally isn't recognized because it's too small, if you can see the inside of the character is clear, Tesseract might be able to decipher.
Now the most difficult, the ± (plus-minus). I haven't cracked this one yet, and this may be a very wooly approach; but I was thinking, the plus-minus is always recognized as + plus only.
I can use this to my advantage.
I could use Tesseract's engine which exposes PageSegMode.SingleChar to detect each individual character and use Tesseract's GetSegmentedRegions() to get the area of the bitmap/image where each character is - you can later reassemble all characters into a string.
Then I could run an ImageMagick to calculate/compare how similar the plus character found is to an image of either plus or plus-minus. The one with most similarity will tell you which character.
With my approach, I still have to parse the text recognised and transform it into something usable.
The Ø (Phi) character for example may be detected as lower-case, but I will want it upper-case.
Or the degree is detected as an apostrophe, but the expected result is the degree.
Another transformation is when I detect a dimension, a decimal may be incorrectly recognized with a comma, but I will want the decimal separator to be a dot (1,99 - 1.99)

Tesseract receipt scanning advice needed

I have struggled off and on again with Tesseract for various OCR projects and I found a use case today which I thought would be a slam dunk for it but after many hours I am still coming away unsatisfied. I wanted to pose the problem here and see if anyone else has advice on how to solve this task.
My wife came to me this morning and asked if there was anyway she could easily scan her receipts from Wal-Mart and over time build a history of prices spent in categories and for specific items so that we could do some trending and easily deep dive on where the spending is going. At first I felt like this was a very tall order, but after doing some digging I found a few things that make me feel this is within reach:
Wal-Mart receipts are in general, very well structured and easy to read. They even include the UPC for every item (potential for lookups against a UPC database?) and appear to classify food items with an F or I (not sure what the difference is) and have a tax code column as well that may prove useful should I learn the secrets of what the codes mean.
I further discovered that there is some kind of Wal-Mart item lookup API that I may be able to get access to which would prove useful in the UPC lookup.
They have an app for smart phones that lets you scan a QR code printed on every receipt. That app looks up a "TC" code off the receipt and pulls down the entire itemized receipt from their servers. It shows you an excellent graphical representation of the receipt including thumbnail pictures of all the items and the cost, etc. If this app would simply categorize and summarize the receipt, I would be done! But alas, that's not the purpose of the app ....
The final piece of the puzzle is that you can export a computer generated PNG image of the receipt in case you want to save it and throw away the paper version. This to me is the money shot, as these PNGs are computer created and therefore not subject to the issues surrounding taking a picture or scanning a paper receipt
An example of one of these (slightly edited to white out some areas but otherwise exactly as obtained from the app) is here:
https://postimg.cc/image/s56o0wbzf/
You can see that the important part of the text is perfectly aligned in 5 columns and that is ultimately what this question is about. How to get Tesseract to accurately OCR this into text. I have lots of ideas where to take it from here, but it all starts with the OCR!
The closest I have come myself is this example here:
http://pastebin.com/nuZJBVg8
I used psm6 and a character limiting set to force it to do uppercase + numbers + a few symbols only:
tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#()/*#%-.
At first glance, the OCR seems to almost match. But as you dig deeper you will see that it fails pretty horribly overall. 3s and 8s are almost always wrong. Same with 6s and 5s. Then there are times it just completely skips over characters or just starts to fall apart (like line 31+ in the example). It starts seeing 2s as 1s, or even just missing characters. The SO PIZZA on line 33 should be "2.82" but comes out as "32".
I have tried doing some pre-processing on the image to thicken up the characters and make sure it's pure black and white but none of my efforts got any closer than the raw image from Wal-Mart + the above commands.
Ideally since this is such a well structured PNG which is presumably always the same width I would love if I could define the columns by pixel widths so that Tesseract would treat each column independently. I tried to research this but the UZN files I've seen mentioned don't translate to me as far as pixel widths and they seem like height is a factor which wouldn't work on these since the height is always going to be variable.
In addition, I need to figure out how to train Tesseract to recognize the numbers 100% accurately (the letters aren't really important). I started researching how to train the program but to be honest it got over my head pretty quickly as the scope of training in the documentation is more for having it recognize entire languages not just 10 digits.
The ultimate end game solution would be a pipeline chain of commands that took the original PNG from the app and gave me back a CSV with the 5 columns of data from the important part of the receipt. I don't expect that out of this question, but any assistance guiding me towards it would be greatly appreciated! At this point I just don't feel like being whipped by Tesseract once again and so I am determined to find a way to master her!
I ended up fully flushing this out and am pretty happy with the results so I thought I would post it in case anyone else ever finds it useful.
I did not have to do any image splitting and instead used a regex since the Wal-mart receipts are so predictable.
I am on Windows so I created a powershell script to run the conversion commands and regex find & replace:
# -----------------------------------------------------------------
# Script: ParseReceipt.ps1
# Author: Jim Sanders
# Date: 7/27/2015
# Keywords: tesseract OCR ImageMagick CSV
# Comments:
# Used to convert a Wal-mart receipt image to a CSV file
# -----------------------------------------------------------------
param(
[Parameter(Mandatory=$true)] [string]$image
) # end param
# create output and temporary files based on input name
$base = (Get-ChildItem -Filter $image -File).BaseName
$csvOutfile = $base + ".txt"
$upscaleImage = $base + "_150.png"
$ocrFile = $base + "_ocr"
# upscale by 150% to ensure OCR works consistently
convert $image -resize 150% $upscaleImage
# perform the OCR to a temporary file
tesseract $upscaleImage -psm 6 $ocrFile
# column headers for the CSV
$newline = "Description,UPC,Type,Cost,TaxType`n"
$newline | Out-File $csvOutfile
# read in the OCR file and write back out the CSV (Tesseract automatically adds .txt to the file name)
$lines = Get-Content "$ocrFile.txt"
Foreach ($line in $lines) {
# This wraps the 12 digit UPC code and the price with commas, giving us our 5 columns for CSV
$newline = $line -replace '\s\d{12}\s',',$&,' -replace '.\d+\.\d{2}.',',$&,' -replace ',\s',',' -replace '\s,',','
$newline | Out-File -Append $csvOutfile
}
# clean up temporary files
del $upscaleImage
del "$ocrFile.txt"
The resulting file needs to be opened in Excel and then have the text to columns feature run so that it won't ruin the UPC codes by auto converting them to numbers. This is a well known problem I won't dive into, but there are a multitude of ways to handle and I settled on this slightly more manual way.
I would have been happiest to end up with a simple .csv I could double click but I couldn't find a great way to do that without mangling the UPC codes even more like by wrapping them in this format:
"=""12345"""
That does work but I wanted the UPC code to be just the digits alone as text in Excel in case I am able to later do a lookup against the Wal-mart API.
Anyway, here is how they look after importing and some quick formating:
https://s3.postimg.cc/b6cjsb4bn/Receipt_Excel.png
I still need to do some garbage cleaning on the rows that aren't line items but that all only takes a few seconds so doesn't bother me too much.
Thanks for the nudge in the right direction #RevJohn, I would not have thought to try simply scaling the image but that made all the difference in the world with Tesseract!
Text recognition on receipts is one of the hardest problems for OCR to handle.
The reasons are numerous:
receipts are printed on cheap paper with cheap printers - to make them cheap, not readable!
they have very large amount of dense text (especially Wall-Mart receipts)
existing OCR engines are almost exclusively trained on non-receipt data (books, documents, etc.)
receipt structure, which is something between tabular and freeform, is hard for any layouting engine to handle.
Your best bet is to perform the following:
Analyse the input images. If they are hard to read by eyes, they are hard to read to tesseract as well.
Perform additional image preprocessing. Image scaling (0.5x, 1.5x, 2x) sometimes help a lot. Cleaning existing noise also helps.
Tesseract training. It's not that hard to do :)
OCR result postprocessing to ensure layouting.
Layouting is best performed by analysing the geometry of the results, not by regexes. Regexes have problems if the OCR has errors. Using geometry, for example, you find a good candidate for UPC number, draw a line through the centers of the characters, and then you know exactly which price belongs to that UPC.
Also, some commercial solutions have customisations for receipt scanning, and can even run very fast on mobile devices.
Company I'm working with, MicroBlink, has an OCR module for mobile devices. If you're on iOS, you can easily try it using CocoaPods
pod try PPBlinkOCR

How to print Bar code 39 on star tsp future PRNT

Hi am working on the Star TSP100 futurePRNT, have got a sample example to print data which is working fine. They use the code "\x1b\x1d\x61\x1" to allign to center. Need to know which coding it is? And how to print Code - 39 barcode in using the code?
It's a hexadecimal code specified by the below lined document.
As an example I'll deconstruct this command:
1B = Ascii Escape Character. This denotes the beginning of the command.
1D = Ascii Group Separator Character. This is part of the command.
61 = Ascii numeral '1'. This is part of the command.
1 = Value. This is the only parameter of the command.
Consulting the linked specification, 1B 1D 61 a is the command to specify position alignment. The a is the parameter. This command specifically sets center horizontal alignment (page 3-32).
The command to specify a barcode can be found on pages 3-43 through 3-46. It follows a similar format, there will be a code specifying the task, then a number of variable parameters to select the correct format, and finally you'll send your data.
Latest Line Mode Specification: http://www.starasia.com/%5CDownload%5CManual%5Cstarline_cm_en.pdf

Use LibreOffice to convert HTML to PDF from Mac command in terminal?

I'm trying to convert a HTML file to a PDF by using the Mac terminal.
I found a similar post and I did use the code they provided. But I kept getting nothing. I did not find the output file anywhere when I issued this command:
./soffice --headless --convert-to pdf --outdir /home/user ~/Downloads/*.odt
I'm using Mac OS X 10.8.5.
Can someone show me a terminal command line that I can use to convert HTML to PDF?
I'm trying to convert a HTML file to a PDF by using the Mac terminal.
Ok, here is an alternative way to do convert (X)HTML to PDF on a Mac command line. It does not use LibreOffice at all and should work on all Macs.
This method (ab)uses a filter from the Mac's print subsystem, called xhtmltopdf. This filter is usually not meant to be used by end-users but only by the CUPS printing system.
However, if you know about it, know where to find it and know how to run it, there is no problem with doing so:
The first thing to know is that it is not in any desktop user's $PATH. It is in /usr/libexec/cups/filter/xhtmltopdf.
The second thing to know is that it requires a specific syntax and order of parameters to run, otherwise it won't. Calling it with no parameters at all (or with the wrong number of parameters) it will emit a small usage hint:
$ /usr/libexec/cups/filter/xhtmltopdf
Usage: xhtmltopdf job-id user title copies options [file]
Most of these parameter names show that the tool clearly related to printing. The command requires in total at least 5, or an optional 6th parameter. If only 5 parameters are given, it reads its input from <stdin>, otherwise from the 6ths parameter, a file name. It always emits its output to <stdout>.
The only CLI params which are interesting to us are number 5 (the "options") and the (optional) number 6 (the input file name).
When we run it on the command line, we have to supply 5 dummy or empty parameters first, before we can put the input file's name. We also have to redirect the output to a PDF file.
So, let's try it:
/usr/libexec/cups/filter/xhtmltopdf "" "" "" "" "" my.html > my.pdf
Or, alternatively (this is faster to type and easier to check for completeness, using 5 dummy parameters instead of 5 empty ones):
/usr/libexec/cups/filter/xhtmltopdf 1 2 3 4 5 my.html > my.pdf
While we are at it, we could try to apply some other CUPS print subsystem filters on the output: /usr/libexec/cups/filter/cgpdftopdf looks like one that could be interesting. This additional filter expects the same sort of parameter number and orders, like all CUPS filters.
So this should work:
/usr/libexec/cups/filter/xhtmltopdf 1 2 3 4 5 my.html \
| /usr/libexec/cups/filter/cgpdftopdf 1 2 3 4 "" \
> my.pdf
However, piping the output of xhtmltopdf into cgpdftopdf is only interesting if we try to apply some "print options". That is, we need to come up with some settings in parameter no. 5 which achieve something.
Looking up the CUPS command line options on the CUPS web page suggests a few candidates:
-o number-up=4
-o page-border=double-thick
-o number-up-layout=tblr
do look like they could be applied while doing a PDF-to-PDF transformation. Let's try:
/usr/libexec/cups/filter/xhtmltopdfcc 1 2 3 4 5 my.html \
| /usr/libexec/cups/filter/cgpdftopdf 1 2 3 4 5 \
"number-up=4 page-border=double-thick number-up-layout=tblr" \
> my.pdf
Here are two screenshots of results I achieved with this method. Both used as input files two HTML files which were identical, apart from one line: it was the line which referenced a CSS file to be used for rendering the HTML.
As you can see, the xhtmltopdf filter is able to (at least partially) take into account CSS settings when it converts its input to PDF:
Starting 3.6.0.1 , you would need unoconv on the system to converts documents.
Using unoconv with MacOS X
LibreOffice 3.6.0.1 or later is required to use unoconv under MacOS X. This is the first version distributed with an internal python script that works. No version of OpenOffice for MacOS X (3.4 is the current version) works because the necessary internal files are not included inside the application.
I just had the same problem, but I found this LibreOffice help post. It seems that headless mode won't work if you've got LibreOffice (the usual GUI version) running too. The fix is to add an -env option, e.g.
libreoffice "-env:UserInstallation=file:///tmp/LibO_Conversion" \
--headless \
--invisible \
--convert-to csv file.xls