tesseract fails at simple number detection - ocr

I want to perform OCR on images like this one:
It is a table with numerical data with colons as decimal separators.
It is not noisy, contrast is good, black text on white background.
As an additional preprocessing step, in order to get around issues with the frame borders, I cut out every cell, binarize it, pad it with a white border (to prevent edge issues) and pass only that single cell image to tesseract.
I also looked at the individual cell images to make sure the cutting process works as expected and does not produce artifacts. These are two examples of the input images for tesseract:
Unfortunately, tesseract is unable to parse these consistently. I have found no configuration where all 36 values are recognized correctly.
There exist a couple similar questions here on SO and the usual answer is a suggestion for a specific combination of the --oem and --psm parameters. So I wrote a python script with pytesseract that loops over all combinations of --oem from 0 to 3 and all values of --psm from 0 to 13 as well als lang=eng and lang=deu. I ignored the combinations that throw errors.
Example 1: With --psm 13 --oem 3 the above "1,7" image is misidentified as "4,7", but the "57" image is correctly recognized as "57".
Example 2: With --psm 6 --oem 3 the above "1,7" image is correctly recognized as "1,7", but the "57" image is misidentified as "o/".
Any suggestions what else might be helpful in improving the output quality of tesseract here?
My tesseract version:
tesseract v4.0.0.20190314
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
Found AVX2
Found AVX
Found SSE

Solution
Divide the image into the 5-different row
Apply division-normalization to each row
Set psm to 6 (Assume a single uniform block of text.)
Read
From the original image, we can see there are 5 different rows.
Each iteration, we will take a row, apply normalization and read.
We need to understand how to set image indexes carefully.
import cv2
from pytesseract import image_to_string
img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
start_index = 0
end_index = int(h/5)
Question Why do we declare start and end indexes?
We want to read a single row in each iteration. Lets see in an example:
The current image height and width are 645 and 1597 pixels.
Divide the images based on indexes:
start-index
end-index
0
129
129
258 (129 + 129)
258
387 (258 + 129)
387
516 (387 + 129)
Lets see whether the images are readable?
start-index
end-index
image
0
129
129
258
258
387
387
516
Nope, they are not suitable for reading, maybe a little adjustment might help us. Like:
start-index
end-index
image
0
129 - 20
109
218
218
327
327
436
436
545
545
654
Now they are suitable for reading.
When we apply the division-normalization to each row:
start-index
end-index
image
0
109
109
218
218
327
327
436
436
545
545
654
Now when we read:
1,7 | 57 | 71 | 59 | .70 | 65
| 57 | 1,5 | 71 | 59 | 70 | 65
| 71 | 59 | 1,3 | 57 | 70 | 60
| 71 | 59 | 56 | 1,3 | 70 | 60
| 72 | 66 | 71 | 59 | 1,2 | 56
| 72 | 66 | 71 | 59 | 56 | 4,3
Code:
import cv2
from pytesseract import image_to_string
img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
# print(img.shape[:2])
start_index = 0
end_index = int(h/5) - 20
for i in range(0, 6):
# print("{}->{}".format(start_index, end_index))
gry_crp = gry[start_index:end_index, 0:w]
blr = cv2.GaussianBlur(gry_crp, (145, 145), 0)
div = cv2.divide(gry_crp, blr, scale=192)
txt = image_to_string(div, config="--psm 6")
print(txt)
start_index = end_index
end_index = start_index + int(h/5) - 20

Related

Print sum of rows and other row value for each column in awk

I have a csv file structured as the one below:
| Taiwan | | US |
| ASUS | MSI | DELL | HP
------------------------------------------
CPU | 50 | 49 | 43 | 65
GPU | 60 | 64 | 75 | 54
HDD | 75 | 70 | 65 | 46
RAM | 60 | 79 | 64 | 63
assembled| 235 | 244 | 254 | 269
and I have to use an awk script to print a comparison between the sum of prices of the individual computer pieces (rows 3 to 6) "versus" the assembled computer price (row 7) displaying also the country each brand comes from. The printed result in the terminal should be something like:
Taiwan Asus 245 235
Taiwan MSI 262 244
US DELL 247 254
US HP 228 269
Where the third column is the sum of CPU, GPU, HDD and RAM prices and the fourth column is the price same value seen in row 7 per each computer brand.
So far I have been able to sum the individual columns transforming the solution provided at the post I link below, but I don´t know how I could display the result I want in the desired format. Could anyone help me with this? I´m a bit desperate at this point.
Sum all values in each column bash
This is the content of the original csv file represented at the top of this message:
,Taiwan,,US,
,ASUS,MSI,DELL,HP
CPU,50,49,43,65
GPU,60,64,75,54
HDD,75,70,65,46
RAM,60,79,64,63
assembled,235,244,254,269
Thank you very much in advance.
$ cat tst.awk
BEGIN { FS=","; OFS="\t" }
NR == 2 {
for (i=2; i<=NF; i++) {
corp[i] = (p[i] == "" ? p[i-1] : p[i]) OFS $i
}
}
NR > 2 {
for (i=2; i<=NF; i++) {
tot[i] += p[i]
}
}
{ split($0,p) }
END {
for (i=2; i<=NF; i++) {
print corp[i], tot[i], p[i]
}
}
.
$ awk -f tst.awk file
Taiwan ASUS 245 235
Taiwan MSI 262 244
US DELL 247 254
US HP 228 269

How to flip the trigonometric circle around the 45° angle?

I am looking for a mathematical function capable of flipping the trigonometric circle around the "axis" located on the 45° vector (or pi/4 radians).
Such as :
|---------|---------|
| x | f(x) |
|---------|---------|
| 0 | 90 |
| 45 | 45 |
| 90 | 0 |
| 135 | 315 |
| 180 | 270 |
| 225 | 225 |
| 270 | 180 |
| 315 | 135 |
|---------|---------|
Just to give a few examples.
So basically I need to turn a compass rose :
into a trigonometric circle :
I only found things like "180 - angle" but that is not the kind of rotation I'm looking for.
Is it possible ?
The main difficulty in your problem is the "jump" in the result, where f(90) should be 0 but f(91) should be 359. A solution is to use a jumpy operator--namely, the modulus operator, which is often represented by the % character.
In a language like Python, where the modulus operator always returns a positive result, you could use
def f(x):
return (90 - x) % 360
Some languages can return a negative result if the number before the operator is negative. You can treat that by adding 360 before taking the modulus, and this should work in all languages with a modulus operator:
def f(x):
return (450 - x) % 360
You can demonstrate either function with:
for a in range(0, 360, 45):
print(a, f(a))
which prints what was desired:
0 90
45 45
90 0
135 315
180 270
225 225
270 180
315 135
If your language does not have the modulus operator, it can be simulated by using the floor or int function. Let me know if you need a solution using either of those instead of modulus.

read csv with column-name x value pairs

I have a long (csv) file with "column-name x value" pairs which I would like to read into a pandas.DataFrame
user_id col val
00008901 1 55
00008901 2 66
00011501 1 77
00011501 3 88
00011501 4 99
The result should look like this:
1 2 3 4
00008901 55 66 0 0
00011501 77 0 88 99
I tried to read it into a list and create a DataFrame from it, but pandas crashed as I have 4.5 million elements.
What the best way to do that? Ideally directly with read_csv.
First use read_csv for create DataFrame:
df = pd.to_csv('file.csv')
Then need set_index with unstack:
df1 = df.set_index(['user_id','col'])['val'].unstack(fill_value=0)
print (df1)
col 1 2 3 4
user_id
8901 55 66 0 0
11501 77 0 88 99
Another solution with pivot, replacing NaN to 0 by fillna and last cast to int:
df1 = df.pivot(index='user_id', columns='col', values='val').fillna(0).astype(int)
print (df1)
col 1 2 3 4
user_id
8901 55 66 0 0
11501 77 0 88 99
If get error:
"ValueError: Index contains duplicate entries, cannot reshape"
It means you have some duplicates, so fastest solution is groupby with unstack and some aggreagte function like mean or sum:
print (df.groupby(['user_id','col'])['val'].mean().unstack(fill_value=0))
col 1 2 3 4
user_id
8901 55 66 0 0
11501 77 0 88 99
Better it see in a bit changed csv:
print (df)
user_id col val
0 8901 1 55
1 8901 2 66
2 11501 1 77 > duplicates -> 11501 and 1
3 11501 1 151 > duplicates -> 11501 and 1
4 11501 3 88
5 11501 4 99
print (df.groupby(['user_id','col'])['val'].mean().unstack(fill_value=0))
col 1 2 3 4
user_id
8901 55 66 0 0
11501 114 0 88 99
Actually I thought I had no duplicates, but found out that I really have some ...
I could not use ".mean" as it is categorial value, but solved the problem by first looking at the sorted table and then just keeping the last entry ... then applying the (great !) solution .. which I still have to fully understand ;-)
df.sort(columns=(['user_id','col']) ) # optional for debugging
df.drop_duplicates(subset=['user_id','col'], keep='last', inplace=True)
df_table = df.set_index(['user_id','col'])['val'].unstack(fill_value=0)
You can't directly read in the required structure using read_csv. But you can use pivot_table function to convert to the desired structure.
df = pd.read_csv('filepath/your.csv')
df = pd.pivot_table(df, index='user_id', columns='col', values='val, aggfunc='mean').reset_index()
The output will be like
1 2 3 4
00008901 55 66 0 0
00011501 77 0 88 99
I don't think it is possible to use read_csv to parse the csv file.
You can create a data structure such as dictionary and use it to create a dataframe:
import pandas as pd
from collections import defaultdict
import csv
data_dict = defaultdict(lambda: [0] * columns)
columns = 4
delimiter = ','
with open("my_csv.csv") as csv_file:
reader = csv.DictReader(csv_file,delimiter=delimiter)
for row in reader:
row_id = row["user_id"]
col = int(row["col"])-1
val = int(row["val"])
data_dict[row_id][col] = val
df = pd.DataFrame(data_dict.values(), index=data_dict.keys(), columns=range(1, columns+1))
For a csv file that contains:
user_id,col,val
00008901,1,55
00008901,2,66
00011501,1,77
00011501,3,88
00011501,4,99
The output is:
1 2 3 4
00008901 55 66 0 0
00011501 77 0 88 99

How to query out hours that has more than 72 hours in derived table?

I am trying to query out only the root_cause that is more than 72 hours and when it find 72 hours or more it will add up.. For example
I have root cause A = 78 hours and Root cause B = 100 hours since these two is more than 72 it should add up 178 hours as "MNPT". Anything that is less than 72 add up and make up routine NPT
I am using derived table query out but the outcome still display the hours including those that less than 72
Select operation_uid, sum (npt_duration) as mnpt from fact_npt_root_cause where npt_duration>72 group by root_cause_code having sum (npt_duration)>72
See this table
|ROOT CAUSE CODE | NPT Duration |
| | |
|A | 23 |
|B | 78 |
|C | 45 |
|D | 100 |
|E | 90 |
When the root cause value is more than 72 hours => then add up those value for example
root cause code B, D, E = 78 + 100 + 90 = 268 as MNPT
When the root cause value is less than 72 hours => then add up the value as 23 + 45 = 68 as routine NPT
I'm not sure of what you want to do but querying operation_uid and grouping by root_cause_code assumes that you always has the same operation_uid for a given root_cause_code... Don't you rather mean :
SELECT operation_uid,
sum (npt_duration) as mnpt
FROM fact_npt_root_cause
WHERE npt_duration>72
GROUP by operation_uid, root_cause_code
HAVING SUM (npt_duration)>72;

MySQL - Usage of 'Allow NULL'

Does usage of NULL to fill empty cells on my tables make searches / queries faster?
For example is this
90 2 65 2011-04-08 NULL NULL
134 2 64 2011-04-13 NULL 07:00:00
135 2 64 2011-04-13 NULL 07:00:00
136 2 64 2011-04-13 NULL 22:45:00
137 2 64 2011-04-14 NULL 19:30:00
better than
90 2 65 2011-04-08
134 2 64 2011-04-13 07:00:00
135 2 64 2011-04-13 07:00:00
136 2 64 2011-04-13 22:45:00
137 2 64 2011-04-14 19:30:00
If anyone could tell me any specific advantage to using NULL (performance, good practice, etc) that would be much appreciated.
There is a semantic difference.
NULL means "value unknown or not applicable".
If this describes the data for that column, then use it.
The empty string means "value known, and it's 'nothingness'".
If this describes the data for that column, then use it. Of course, this applies only to strings; it's unusual to have such a scenario for other data types, but usually for numeric fields a value of 0 would be appropriate here.
In short, it depends mostly on what your fields mean. Worry about performance and profiling later, when you know what data you're representing. (NULL vs "" is never going to be your bottleneck, though.)