Define multiple columns in tesseract OCR parameters? - ocr

I'm using OCR on historical newspapers that contain 6 columns per page. At present I use FineReader and define text blocks for each column. I'd like to use Tesseract. Tesseract gets the columns mostly right, but every few lines it reads into adjacent columns. I wonder if there's a way to set its parameters so that it will look quite rigidly for six columns.
Following suggestions on other questions, I've tried playing with --psm and hocr without great success.
Working with a jpg I've posted on github, and converting it into a text-embedded pdf using this code tesseract 1906-07-02-p4.jpg out -l eng+fra --psm 1 pdf I get this result:
Clearly the engine is making a bloc containing the indented lines, and another containing the flush lines.
Confirming this is the text output of the flush lines:
Grocery, Bar and Coffea shop of the trpops
stationed at the Citadel, Cairo.
to received tender for this service by 10 a.m.,
on Saturday, the 14th Jaly, 1906.
application in person to the Commandant,
Citadel, between the hours of 10 a.m. and
12 noon, daily.
—_—_——
Is there a way to constrain tesseract to certain column boundaries? (Obviously I could do this by cutting up the images but I'd like to avoid that work.)

you can user
psm 4 oem 1
or psm 4 oem 3
to get better text and accuracy

Related

Convert EM4x02 ID to Hitag2 Value

I've been working on an RFID project to produce our own RFID cards to work on our existing timeclocks and readers.
I've got most of the work done, and have been able to successfully write a Hitag2 card using the value of page 4 & 5 from another card (so basically copying the card) then changing the config bit which makes it act like an EM4x02 which allows our readers to read it.
What I'm struggling with is trying to relate the hex code on page4/5 to the output you get when scanning as an EM4x..
The values of the hitag page 4/5 are FF800000/003EDF10. This translates to 0000001EBC when read as an EM4x.
Does anybody have an idea on how this translation is done? I've tried using the methods in RFIDIOT but that doesn't seem to work for this.
I've managed to find how this is done after finding a hitag2 datasheet from 1999 (the only one I could find that explains the bits when hitag is in public mode A)
Firstly, convert the number you want on the EM4 card to hex.
Convert that hex into binary.
Split the binary into 4 bit chunks, then work out the even parity for each section and add it to the end of each chunk. (So you'll end up with 5 bits per chunk)
Then, work out the even parity of each column in the data (i.e first character of all chunks, then second etc. But ignoring the parity bit you added) and add these 4 bytes to the binary string.
Then add the correct amount of zeros at the start to ensure the data section has 50 bits.
Once you have the data section sorted, add 9 bits of 1 to the beginning (header) and a final 0 to the very end of the binary.
Your whole binary string should be 64 bits long.
Convert this to hex and split it in half. You can then write these onto pages 4/5 of a Hitag2 card.
You then need to change the configuration bit to 0x02 for the tag to work in public mode a.
Just thought I would send you the diagram of how this works.Em4X tag data

Evenly spaced points along a line while Plotting CSV data with gnuplot

I would like to plot a csv file with gnuplot. Instead of a line, I want to use points that are evenly distributed along the path of the curve. However, the data in the csv file is not evenly distributed, e.g. like this
x,p
0,2
1,4
1.1,4.2
1.2,4.4
2.8,7.6
2.85,7.7
4,10
It should be possible to achieve this, but how?
Here is an example graph where I plot every nth point. Because my numerical solution is so good :-) you only see one line, so I would like to have markers on the one curve. But the points should be distributed equidistantly (the current distribution is only due to the nature of the analytical solution).
In the development version of gnuplot (5.5) this can be done exactly as asked. smooth path does what you would expect, and pn 7 tells it to place exactly 7 evenly spaced points.
$DATA << EOD
0,2
1,4
1.1,4.2
1.2,4.4
2.8,7.6
2.85,7.7
4,10
EOD
set log y
set key top left
set datafile separator comma
plot $DATA smooth path with lp pn 7 title "smooth path pn 7", \
$DATA with points pt 6 ps 2 title "original points"
The current version 5.4 does not offer smooth path, but if your data points are sufficiently close to lying on a smooth curve one of the other smoothing options, e.g. smooth mcs, may be acceptable.
For the record, I think this is not a good thing to do. It is dishonest to hide the actual data points while showing artificially even points instead. It misleads the viewer as to where the curve is reliable and where it may not be.

Machine Learning for gesture recognition with Myo Armband

I'm trying to develop a model to recognize new gestures with the Myo Armband. (It's an armband that possesses 8 electrical sensors and can recognize 5 hand gestures). I'd like to record the sensors' raw data for a new gesture and feed it to a model so it can recognize it.
I'm new to machine/deep learning and I'm using CNTK. I'm wondering what would be the best way to do it.
I'm struggling to understand how to create the trainer. The input data looks like something like that I'm thinking about using 20 sets of these 8 values (they're between -127 and 127). So one label is the output of 20 sets of values.
I don't really know how to do that, I've seen tutorials where images are linked with their label but it's not the same idea. And even after the training is done, how can I avoid the model to recognize this one gesture whatever I do since it's the only one it's been trained for.
An easy way to get you started would be to create 161 columns (8 columns for each of the 20 time steps + the designated label). You would rearrange the columns like
emg1_t01, emg2_t01, emg3_t01, ..., emg8_t20, gesture_id
This will give you the right 2D format to use different algorithms in sklearn as well as a feed forward neural network in CNTK. You would use the first 160 columns to predict the 161th one.
Once you have that working you can model your data to better represent the natural time series order it contains. You would move away from a 2D shape and instead create a 3D array to represent your data.
The first axis shows the number of samples
The second axis shows the number of time steps (20)
The thirst axis shows the number of sensors (8)
With this shape you're all set to use a 1D convolutional model (CNN) in CNTK that traverses the time axis to learn local patterns from one step to the next.
You might also want to look into RNNs which are often used to work with time series data. However, RNNs are sometimes hard to train and a recent paper suggests that CNNs should be the natural starting point to work with sequence data.

Can I test tesseract ocr in windows command line?

I am new to tesseract OCR. I tried to convert an image to tif and run it to see what the output from tesseract using cmd in windows, but I couldn't. Can you help me? What will be command to use?
Here is my sample image:
The simplest tesseract.exe syntax is tesseract.exe inputimage output-text-file.
The assumption here, is that tesseract.exe is added to the PATH environment variable.
You can add the -psm N argument if your text argument is particularly hard to recognize.
I see that the regular syntax (without any -psm switches) works fine enough with the image you attached, unless the level of accuracy is not good enough.
Note that non-english characters (such as the symbol next to prescription) are not recognized; my default installation only contains the English training data.
Here's the tesseract syntax description:
C:\Users\vish\Desktop>tesseract.exe
Usage:tesseract.exe imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.
Single options:
-v --version: version info
--list-langs: list available languages for tesseract engine
And here's the output for your image (NOTE: When I downloaded it, it converted to a PNG image):
C:\Users\vish\Desktop>tesseract.exe ECL8R.png out.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
C:\Users\vish\Desktop>type out.txt.txt
1 Project Background
A prescription (R) is a written order by a physician or medical doctor to a pharmacist in the form of
medication instructions for an individual patient. You can't get prescription medicines unless someone
with authority prescribes them. Usually, this means a written prescription from your doctor. Dentists,
optometrists, midwives and nurse practitioners may also be authorized to prescribe medicines for you.
It can also be defined as an order to take certain medications.
A prescription has legal implications; this means the prescriber must assume his responsibility for the
clinical care ofthe patient.
Recently, the term "prescriptionΓÇ¥ has known a wider usage being used for clinical assessments,

Tesseract confuses two numbers

I'm writing an application to scan numbers from an image.
The numbers are using the OCR-B font and may also contain + and > characters.
This is my source image:
The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself.
I created this training image and made a box file from it. The box file is correct, all letters are matched correctly.
Then I did all steps described here to create the other necessary files.
Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All 1s are mistaken for 8s and vice-versa. The command used to process the image was
$ tesseract esr2c.tif ocrb-esr2c -l ocrb
and the output for the source image was
0800000001456>8 00000195731208 8 01050008 023+ 08 0301226>20
If you swap all 1s and 8s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore).
How could this happen? Did I do some mistake in the training process? How can I fix it?
It's likely that somewhere in your box file has incorrect values (characters) for 1 and 8. You can verify using jTessBoxEditor program. If so, correct, regenerate the language data file, and try again.
I have trained tesseract 2.04 after 1 month efforts for OCR A extended font. Its working very well and showing above 90 Accuracy with font size 14.
Training image should be high Contrast image.
Use "GIMP" image editor and do following
Menu Colors->Info->Histgram- Read Std Deviation value
colors-> Threshould -> Write "Std Deviation value" as Threshould value
Save image
Use it for training.
Check and edit your box file using "qt-box-editor-1.06.exe".It is very easy to use.
Check All boxes and characters in it.
It is very important. Somewhere in your box file has incorrect characters for 1 and 8.
Run other cmds.