Can I test tesseract ocr in windows command line? - ocr

I am new to tesseract OCR. I tried to convert an image to tif and run it to see what the output from tesseract using cmd in windows, but I couldn't. Can you help me? What will be command to use?
Here is my sample image:

The simplest tesseract.exe syntax is tesseract.exe inputimage output-text-file.
The assumption here, is that tesseract.exe is added to the PATH environment variable.
You can add the -psm N argument if your text argument is particularly hard to recognize.
I see that the regular syntax (without any -psm switches) works fine enough with the image you attached, unless the level of accuracy is not good enough.
Note that non-english characters (such as the symbol next to prescription) are not recognized; my default installation only contains the English training data.
Here's the tesseract syntax description:
C:\Users\vish\Desktop>tesseract.exe
Usage:tesseract.exe imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.
Single options:
-v --version: version info
--list-langs: list available languages for tesseract engine
And here's the output for your image (NOTE: When I downloaded it, it converted to a PNG image):
C:\Users\vish\Desktop>tesseract.exe ECL8R.png out.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
C:\Users\vish\Desktop>type out.txt.txt
1 Project Background
A prescription (R) is a written order by a physician or medical doctor to a pharmacist in the form of
medication instructions for an individual patient. You can't get prescription medicines unless someone
with authority prescribes them. Usually, this means a written prescription from your doctor. Dentists,
optometrists, midwives and nurse practitioners may also be authorized to prescribe medicines for you.
It can also be defined as an order to take certain medications.
A prescription has legal implications; this means the prescriber must assume his responsibility for the
clinical care ofthe patient.
Recently, the term "prescriptionΓÇ¥ has known a wider usage being used for clinical assessments,

Related

how to label training data for Tesseract

I want to train my own model to detect and recognize ID card with Tesseract. I want to extract the key information like name, id from it. The data looks like: [sample of data]
The introduction of training can only input text with single line.I'm confused how to train the detection model in Tesseract and should I label single character or label the whole text line in each box. (https://github.com/tesseract-ocr/tesstrain)
enter image description here
1 by One Character Replacement from image to text is based on training in groups.
so here in the first tesseract training test sample, the idea is to let tesseract understand that the ch ligature is to be output as two letters the δ is to be lower case d with f as k and that Uber is Aber etc.
However that does not correct spelling of words without a dictionary of accepted character permutations and thus you need to either train all words you could expect like 123 is allowed but not 321 or else you allow all numbers.
The problem then is should ¦ be i | l or 1 ! ? and only human intelligent context is likely to agree what is 100% correct, especially when italics so is / = i | l or 1 ! or is it italic / ?
The clearer the characters are compared in contrast to the background, is usually going to produce the best result, and well defined void space within a character will help to distinguish well between B and 8 thus resolution is also a help or hindrance.
= INT 3O 80 S~A MARIA
A dictionary entry of BO and STA would possibly help in this case.
Oh, I think I get it. Tesseract doesn't need a detection model to get the position of the text line, it recognize each blob(letter) and uses the position of each letter to locate the text line.

Data PreProcessing for BERT (base-german)

I am working on a sentiment analysis solution with BERT to analyze tweets in german. My training dataset of is a class of 1000 tweets, which have been manually annotated into the classes neutral, positive and negative.
The dataset with 10.000 tweets is quite unevenly distributed:
approx.
3000 positive
2000 negative
5000 neutral
the tweets contain formulations with #names, https links, numbers, punctuation marks, smileys like :3 :D :) etc..
The interesting thing is, if I remove them with the following code during Data Cleaning, the F1 score gets worse. Only the removal of https links (if I do it alone) leads to a small improvement.
# removing the punctuation and numbers
def remove_punct(text):
text = re.sub(r'http\S+', '', text) # removing links
text = re.sub(r'#\S+', '', text) # removing referencing on usernames with #
text = re.sub(r':\S+', '', text) # removing smileys with : (like :),:D,:( etc)
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text
data['Tweet_clean'] = data['Tweet'].apply(lambda x: remove_punct(x)) # extending the dataset with the column tweet_clean
data.head(40)
also steps like stop words removal or lemmitazation lead more to a deterioration. Is this because I do something wrong or can the model BERT actually handle such values?
A second question is:
I found other records that were also manually annotated, but these are not tweets and the structure of the sentences and language use is different. Would you still recommend to add these records to my original?
There are about 3000 records in German.
My last question:
Should I reduce the class sizes to the size of the smallest unit and thus balance?
BERT can handle punctuation, smileys etc. Of course, smileys contribute a lot to sentiment analysis. So, don't remove them. Next, it would be fair to replace #mentions and links with some special tokens, because the model will probably never see them again in the future.
If your model is designed for tweets, I suggest that you fine-tune BERT with additional corpus, and after fine-tune with Twitter corpus. Or do it simultaneously. More training samples is generally better.
No, it is better to use class weights instead of downsampling.
Based on this paper (By Adam Ek, Jean-Philippe Bernardy and Stergios Chatzikyriakidis), BERT models outperform BiLSTM in terms of better generalizing to punctuations. Looking at the experiments' results in the paper, I say keep the punctuations.
I couln't find anything solid for smiley faces; However, after doing some experiments with the HuggingFace API, I didn't notice much difference with/without smiley faces.

commands to predict the language with fastText in Linux

For language identification, I am using the following tutorial :
Fasttext language detection tutorial
After executing the command as in tutorial:
./fasttext test langdetect.bin valid.txt
I have the following the output:
N 10000
P#1 0.967
R#1 0.967
after this, which commands will predict the language? how to enter the text in other languages?
I am very new to this language detection. I could find ample tutorials for python prediction but not in linux command line.
Thanks in advance.
Language detection is a particular case of text classification using supervised models (here you can find the tutorial).
According to the tutorial, you can predict on new examples, by typing:
./fasttext predict-prob langdetect.bin - -1 0.5
(we want as many prediction as possible (argument -1) and we want only labels with probability higher or equal to 0.5)
and then typing the sentence.
If you have a txt file with sentences to be classified, you can type:
$ ./fasttext predict-prob langdetect.bin test.txt k
where k is the number of classes to show.
This cheatsheet may also be useful.

Define multiple columns in tesseract OCR parameters?

I'm using OCR on historical newspapers that contain 6 columns per page. At present I use FineReader and define text blocks for each column. I'd like to use Tesseract. Tesseract gets the columns mostly right, but every few lines it reads into adjacent columns. I wonder if there's a way to set its parameters so that it will look quite rigidly for six columns.
Following suggestions on other questions, I've tried playing with --psm and hocr without great success.
Working with a jpg I've posted on github, and converting it into a text-embedded pdf using this code tesseract 1906-07-02-p4.jpg out -l eng+fra --psm 1 pdf I get this result:
Clearly the engine is making a bloc containing the indented lines, and another containing the flush lines.
Confirming this is the text output of the flush lines:
Grocery, Bar and Coffea shop of the trpops
stationed at the Citadel, Cairo.
to received tender for this service by 10 a.m.,
on Saturday, the 14th Jaly, 1906.
application in person to the Commandant,
Citadel, between the hours of 10 a.m. and
12 noon, daily.
—_—_——
Is there a way to constrain tesseract to certain column boundaries? (Obviously I could do this by cutting up the images but I'd like to avoid that work.)
you can user
psm 4 oem 1
or psm 4 oem 3
to get better text and accuracy

Tesseract confuses two numbers

I'm writing an application to scan numbers from an image.
The numbers are using the OCR-B font and may also contain + and > characters.
This is my source image:
The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself.
I created this training image and made a box file from it. The box file is correct, all letters are matched correctly.
Then I did all steps described here to create the other necessary files.
Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All 1s are mistaken for 8s and vice-versa. The command used to process the image was
$ tesseract esr2c.tif ocrb-esr2c -l ocrb
and the output for the source image was
0800000001456>8 00000195731208 8 01050008 023+ 08 0301226>20
If you swap all 1s and 8s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore).
How could this happen? Did I do some mistake in the training process? How can I fix it?
It's likely that somewhere in your box file has incorrect values (characters) for 1 and 8. You can verify using jTessBoxEditor program. If so, correct, regenerate the language data file, and try again.
I have trained tesseract 2.04 after 1 month efforts for OCR A extended font. Its working very well and showing above 90 Accuracy with font size 14.
Training image should be high Contrast image.
Use "GIMP" image editor and do following
Menu Colors->Info->Histgram- Read Std Deviation value
colors-> Threshould -> Write "Std Deviation value" as Threshould value
Save image
Use it for training.
Check and edit your box file using "qt-box-editor-1.06.exe".It is very easy to use.
Check All boxes and characters in it.
It is very important. Somewhere in your box file has incorrect characters for 1 and 8.
Run other cmds.