How to improve OCR accuracy? - ocr

I have 2 images like shown below. A.png is perfectly read by tesseract but B.png is terribly bad accuracy even though the B.png is similar to A.png. How can I improve the accuracy? I have no idea where to start debugging?
A.png
B.png
Run OCR
# tesseract -v
tesseract 4.1.1-rc2-22-g08899
# tesseract A.png stdout -l jpn --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
第 3 期 決算 公告 令 和 2 年 2 月 7 日
大 阪 市 中 央 区 南 新町 一 丁目 3 番 10 号
株 式 会 社 Link_Mobile
代表 取締 役 佐々 木 勉
貸借 対照 表 の 要旨 (平成 31 年 3 月 31 日 現在 }
# tesseract B.png stdout -l jpn --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
。 人 加計
区 三 6 番 12 号
中 野 駅 前 ビル 5 | 、
am 人 mw
に て
貸借 対照 表 の 要旨 ( 令 和 元 年 11 月 30 日 現在 }
Update 1
Were both scanned using the same scanner, and at the same resolution?
Yes. The images that were originally included in the same PDF were cut out.
Are you taking advantage of any APIs which Tesseract exposes for pre-processing the images before doing OCR?
No. I did not know that. I am checking now about it.

It improved. I read "Tesseract documentation" and rescaled the image.
Rescaling
Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see the FAQ.
Rescaled image
Run OCR
# tesseract B2.png stdout -l jpn --psm 6
第 54 期 決 算 公 告 _ 令 和 2 年 1 月 29 日
東京 都 中 野 区 中 野 三 丁目 36 番 12 号
中 野 駅 前 ビル 5 F
株 式 会 社 コ ー エ ー テ クニ カ
代表 取締 役 小 空 _ 修
貸借 対照 表 の 要旨 ( 令 和 元 年 11 月 30 日 現在 )

Related

Python OCR Tesseract, find a certain word in the image and return me the coordinates

I wanted your help, I've been trying for a few months to make a code that finds a word in the image and returns the coordinates where that word is in the image.
I was trying this using OpenCV, OCR tesseract, but I was not successful, could someone here in the community help me?
I'll leave an image here as an example:
Here is something you can start with:
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r'C:\<path-to-your-tesseract>\Tesseract-OCR\tesseract.exe'
img = Image.open("img.png")
data = pytesseract.image_to_data(img, output_type='dict')
boxes = len(data['level'])
for i in range(boxes):
if data['text'][i] != '':
print(data['left'][i], data['top'][i], data['width'][i], data['height'][i], data['text'][i])
If you have difficulties with installing pytesseract see: https://stackoverflow.com/a/53672281/18667225
Output:
153 107 277 50 Palavras
151 197 133 37 com
309 186 154 48 R/RR
154 303 126 47 Rato
726 302 158 47 Resto
154 377 144 50 Rodo
720 379 159 47 Arroz
152 457 160 48 Carro
726 457 151 46 Ferro
154 532 142 50 Rede
726 534 159 47 Barro
154 609 202 50 Parede
726 611 186 47 Barata
154 690 124 47 Faro
726 685 288 50 Beterraba
154 767 192 47 Escuro
726 766 151 47 Ferro
I managed to find the solution and I'll post it here for you:
import pytesseract
import cv2
from pytesseract import Output
pytesseract.pytesseract.tesseract_cmd = r'C:\<path-to-your-tesseract>\Tesseract-OCR\tesseract.exe'
filepath = 'image.jpg'
image = cv2.imread(filepath, 1)
# converting image to grayscale image
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# converting to binary image by Thresholding
# this step is necessary if you have a color image because if you skip this part
# then the tesseract will not be able to detect the text correctly and it will give an incorrect result
threshold_img = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# displays the image
cv2.imshow('threshold image', threshold_img)
# Holds the output window until the user presses a key
cv2.waitKey(0)
# Destroying windows present on the screen
cv2.destroyAllWindows()
# setting parameters for tesseract
custom_config = r'--oem 3 --psm 6'
# now feeding image to tesseract
details = pytesseract.image_to_data(threshold_img, output_type=Output.DICT, config=custom_config, lang='eng')
# Color
vermelho = (0, 0, 255)
#Exibe todas as chaves encontradas
print(details.keys())
print(details['text'])
# For in all found texts
for i in range(len(details['text'])):
# If it finds the text "UNIVERIDADE" it will print the coordinates, and draw a rectangle around the word
if details['text'][i] == 'UNIVERSIDADE':
print(details['text'][i])
print(f"left: {details['left'][i]}")
print(f"top: {details['top'][i]}")
print(f"width: {details['width'][i]}")
print(f"height: {details['height'][i]}")
cv2.rectangle(image, (details['left'][i], details['top'][i]), (details['left'][i]+details['width'][i], details['top'][i]+details['height'][i]), vermelho)

Decoding a hex file

I would like to use a webservice who deliver a binary file with some data, I know the result but I don't know how I can decode this binary with a script.
Here is my binary :
https://pastebin.com/3vnM8CVk
0a39 0a06 3939 3831 3438 1206 4467 616d
6178 1a0b 6361 7264 6963 6f6e 5f33 3222
0d54 6865 204f 6c64 2047 7561 7264 2a02
....
Some part are in ASCII so it easy to catch, at the end of the file you got vehicle name in ASCII and some data, it should be kill/victory/battle/XP/Money data but I don't understand how I can decode these hexa value, I tried to compare 2 vehicles who got same kills but I don't see any match.
There is a way to decode these data ?
Thanks :)
Hello guys, after 1 year I started again to find a solution, so here is the structure of the packet I guessed : (the part between [ ] I still don't know what is it for)
[52 37 08 01 10] 4E [18] EA [01 25] AB AA AA 3E [28] D4 [01 30] EC [01 38] 88 01 [40] 91 05 [48] 9F CA 22 [50] F5 C2 9A 02 [5A 12]
| | | | | | | | |
Victories Victory Ratio| | Air target| Xp Money earned
| | | Ground Target
Battles Deaths Respawns
So here is the result :
Victory : 78
Battles : 234
Victory Ratio : ? (should be arround 33%)
Deaths : 212
Respawns : 236
Air Target : 136
Ground Target : 657
Xp : ? (should be arround 566.56k)
Money : ? (should be arround 4.63M)
Is there a special way to calculate the result of a long hex like this ?
F5 C2 9A 02 (should be arround 4.63M)
I tell you a bit more :
I know the result, but I don't know how to calculate it with these hex from the packet.
If I check a packet with a small amout of money or XP to be compatible with one hex :
[52 1E 08 01 10] 01 [18] [01 25] 00 00 80 3F [28] 01 [30] 01 [48] 24 [50] 6E [5A 09]
6E = 110 Money earned
24 = 36 XP earned
Another exemple :
[52 21 08 01 10] 02 [18] 03 [25] AB AA 2A 3F [28] 02 [30] 03 [40] 01 [48] 78 [50] C7 08 [5A 09]
XP earned = hex 78 = 120
Money earned = hex C7 08 = 705
How C7 08 can do 705 decimal ?
Here is the full content in case but I know how to isolate just these part I don't need to decode all these hex data :
https://pastebin.com/vAKPynNb
What you have asked is nothing but how to reverse engineer a binary file. Lot of threads already on SO
Reverse engineer a binary dictionary file to extract strings
Tools to help reverse engineer binary file formats
https://reverseengineering.stackexchange.com/questions/3495/what-tools-exist-for-excavating-data-structures-from-flat-binary-files
http://www.iwriteiam.nl/Ha_HTCABFF.html
The final take out on all is that no single solution for you, you need to spend effort to figure it out. There are tools to help you, but don't expect a magic wand tool to give you the structure/data.
Any kind of file read operation is done in text or binary format with basic file handlers. And some languages offer type reading of int, float etc. or arrays of them.
The complex operations behind these reading are almost always kept hidden from normal users. And then the user has to learn more when it comes to read/write operations of data structures.
In this case, OFFSET and SEEK are the words one must find value and act accordingly. whence data read, it must be converted to suitable data type too.
The following code shows basics for these operations to write data and read blocks to get numbers back. It is written in PHP as the OP has commented in the question he uses PHP.
Offset is calculated with these byte values to be 11: char: 1 byte, short: 2 bytes, int: 4 bytes, float: 4 bytes.
<?php
$filename = "testdata.dat";
$filehandle = fopen($filename, "w+");
$data=["test string","another test string",77,777,77777,7.77];
fwrite($filehandle,$data[0]);
fwrite($filehandle,$data[1]);
$numbers=array_slice($data,2);
fwrite($filehandle,pack("c1s1i1f1",...$numbers));
fwrite($filehandle,"end"); // gives 3 to offset
fclose($filehandle);
$filename = "testdata.dat";
$filehandle = fopen($filename, "rb+");
$offset=filesize($filename)-11-3;
fseek($filehandle,$offset);
$numberblock= fread($filehandle,11);
$numbersback=unpack("c1a/s1b/i1c/f1d",$numberblock);
var_dump($numbersback);
fclose($filehandle);
?>
Once this example understood, the rest is to find the data structure in the requested file. I had written another example but it uses assumptions. I leave the rest to readers to find what assumptions I made here. Be careful though: I know nothing about real structure and values will not be correct.
<?php
$filename = "testfile";
$filehandle = fopen($filename, "rb");
$offset=17827-2*41; //filesize minus 2 user area
fseek($filehandle,$offset);
print $user1 = fread($filehandle, 41);echo "<br>";
$user1pr=unpack("s1kill/s1victory/s1battle/s1XP/s1Money/f1Life",$user1);
var_dump($user1pr); echo "<br>";
fseek($filehandle,$offset+41);
print $user2 = fread($filehandle, 41);echo "<br>";
$user2pr=unpack("s1kill/s1victory/s1battle/i1XP/i1Money/f1Life",$user2);
var_dump($user2pr); echo "<br>";
echo "<br><br>";
$repackeduser2=pack("s3i2f1",$user2pr["kill"],$user2pr["victory"],
$user2pr["battle"],$user2pr["XP"],$user2pr["Money"],
$user2pr["Life"]
);
print $user2 . "<br>" .$repackeduser2;
print "<br>3*s1=6bytes, 2*i=6bytes, 1*f=*bytes (machine dependent)<br>";
print pack("s1",$user2pr["kill"]) ."<br>";
print pack("s1",$user2pr["victory"]) ."<br>";
print pack("s1",$user2pr["battle"]) ."<br>";
print pack("i1",$user2pr["XP"]) ."<br>";
print pack("i1",$user2pr["Money"]) ."<br>";
print pack("f1",$user2pr["Life"]) ."<br>";
fclose($filehandle);
?>
PS: pack and unpack uses machine dependent size for some data types such as int and float, so be careful with working them. Read Official PHP:pack and PHP:unpack manuals.
This looks more like the hexdump of a binary file. Some methods of converting hex to strings resulted in the same scrambled output. Only some lines are readable like this...
Dgamaxcardicon_32" The Old Guard
As #Tarun Lalwani said, you would have to know the structure of this data to get the in plaintext.
If you have access to the raw binary, you could try using strings https://serverfault.com/questions/51477/linux-command-to-find-strings-in-binary-or-non-ascii-file

rjson::fromJSON returns only the first item

I have a sqlite database file with several columns. One of the columns has a JSON dictionary (with two keys) embedded in it. I want to extract the JSON column to a data frame in R that shows each key in a separate column.
I tried rjson::fromJSON, but it reads only the first item. Is there a trick that I'm missing?
Here's an example that mimics my problem:
> eg <- as.vector(c("{\"3x\": 20, \"6y\": 23}", "{\"3x\": 60, \"6y\": 50}"))
> fromJSON(eg)
$3x
[1] 20
$6y
[1] 23
The desired output is something like:
# a data frame for both variables
3x 6y
1 20 23
2 60 50
or,
# a data frame for each variable
3x
1 20
2 60
6y
1 23
2 50
What you are looking for is actually a combination of lapply and some application of rbind or related.
I'll extend your data a little, just to have more than 2 elements.
eg <- c("{\"3x\": 20, \"6y\": 23}",
"{\"3x\": 60, \"6y\": 50}",
"{\"3x\": 99, \"6y\": 72}")
library(jsonlite)
Using base R, we can do
do.call(rbind.data.frame, lapply(eg, fromJSON))
# X3x X6y
# 1 20 23
# 2 60 50
# 3 99 72
You might be tempted to do something like Reduce(rbind, lapply(eg, fromJSON)), but the notable difference is that in the Reduce model, rbind is called "N-1" times, where "N" is the number of elements in eg; this results in a LOT of copying of data, and though it might work alright with small "N", it scales horribly. With the do.call option, rbind is called exactly once.
Notice that the column labels have been R-ized, since data.frame column names should not start with numbers. (It is possible, but generally discouraged.)
If you're confident that all substrings will have exactly the same elements, then you may be good here. If there's a chance that there will be a difference at some point, perhaps
eg <- c(eg, "{\"3x\": 99}")
then you'll notice that the base R solution no longer works by default.
do.call(rbind.data.frame, lapply(eg, fromJSON))
# Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors()) :
# numbers of columns of arguments do not match
There may be techniques to try to normalize the elements such that you can be assured of matches. However, if you're not averse to a tidyverse package:
library(dplyr)
eg2 <- bind_rows(lapply(eg, fromJSON))
eg2
# # A tibble: 4 × 2
# `3x` `6y`
# <int> <int>
# 1 20 23
# 2 60 50
# 3 99 72
# 4 99 NA
though you cannot call it as directly with the dollar-method, you can still use [[ or backticks.
eg2$3x
# Error: unexpected numeric constant in "eg2$3"
eg2[["3x"]]
# [1] 20 60 99 99
eg2$`3x`
# [1] 20 60 99 99

Universal App with localized app name crashes

After a small update to my app, it crashes quite often with the following stack trace:
Frame Image Function Offset
0 KERNELBASE.dll RaiseException 0x00000036
1 mrmcorer.dll Microsoft::Resources::ReportFatalException 0x00000052
2 mrmcorer.dll Windows::ApplicationModel::Resources::Core::CResourceManagerFactory::get_Current 0x0002b1b0
3 mrmcorer.dll Windows::ApplicationModel::Resources::Core::CResourceManagerFactory::GetCurrentResourceManagerInternal 0x0000004a
4 mrmcorer.dll _GetResourceManagerForCurrentApplicationInternal 0x00000026
5 mrmcorer.dll GetStringValueForManifestField 0x00000140
6 twinapi.appcore.dll Windows::ApplicationModel::Core::CoreApplication::SetAppDisplayName 0x00000048
7 twinapi.appcore.dll Windows::ApplicationModel::Core::CoreApplication::InitializeApplicationServer 0x0000012a
8 twinapi.appcore.dll Windows::ApplicationModel::Core::CoreApplicationFactory::RunInternal 0x00000192
9 twinapi.appcore.dll Windows::ApplicationModel::Core::CoreApplicationFactory::Run 0x00000012
10 backgroundTaskHost.exe main 0x0000006c
11 backgroundTaskHost.exe __mainCRTStartup 0x000000ec
12 backgroundTaskHost.exe mainCRTStartup 0x0000000e
13 ntdll.dll RtlUserThreadStart 0x00000016
The app name is localized in several languages, default language is "en".
Localized the "Display name" and "Package display name" in the Appxmanifest like this: "ms-resource:AppName".
Unfortunately, I couldn't reproduce the crash yet.

Automatically sum numeric columns and print total

Given the output of git ... --stat:
3 files changed, 72 insertions(+), 21 deletions(-)
3 files changed, 27 insertions(+), 4 deletions(-)
4 files changed, 164 insertions(+), 0 deletions(-)
9 files changed, 395 insertions(+), 0 deletions(-)
1 files changed, 3 insertions(+), 2 deletions(-)
1 files changed, 1 insertions(+), 1 deletions(-)
2 files changed, 57 insertions(+), 0 deletions(-)
10 files changed, 189 insertions(+), 230 deletions(-)
3 files changed, 111 insertions(+), 0 deletions(-)
8 files changed, 61 insertions(+), 80 deletions(-)
I wanted to produce the sum of the numeric columns but preserve the formatting of the line. In the interest of generality, I produced this awk script that automatically sums any numeric columns and produces a summary line:
{
for (i = 1; i <= NF; ++i) {
if ($i + 0 != 0) {
numeric[i] = 1;
total[i] += $i;
}
}
}
END {
# re-use non-numeric columns of last line
for (i = 1; i <= NF; ++i) {
if (numeric[i])
$i = total[i]
}
print
}
Yielding:
44 files changed, 1080 insertions(+), 338 deletions(-)
Awk has several features that simplify the problem, like automatic string->number conversion, all arrays as associative arrays, and the ability to overwrite auto-split positional parameters and then print the equivalent lines.
Is there a better language for this hack?
Perl - 47 char
Inspired by ChristopheD's awk solution. Used with the -an command-line switch. 43 chars + 4 chars for the command-line switch:
$i-=#a=map{($b[$i++]+=$_)||$_}#F}{print"#a"
I can get it to 45 (41 + -ap switch) with a little bit of cheating:
$i=0;$_="Ctrl-M#{[map{($b[$i++]+=$_)||$_}#F]}"
Older, hash-based 66 char solution:
#a=(),s#(\d+)(\D+)#$b{$a[#a]=$2}+=$1#gefor<>;print map$b{$_}.$_,#a
Ruby — 87
puts ' '+[*$<].map(&:split).inject{|i,j|[0,3,5].map{|k|i[k]=i[k].to_i+j[k].to_i};i}*' '
Python - 101 chars
import sys
print" ".join(`sum(map(int,x))`if"A">x[0]else x[0]for x in zip(*map(str.split,sys.stdin)))'
Using reduce is longer at 126 chars
import sys
print" ".join(reduce(lambda X,Y:[str(int(x)+int(y))if"A">x[0]else x for x,y in zip(X,Y)],map(str.split,sys.stdin)))
AWK - 63 characters
(in a bash script, $1 is the filename provided as command line argument):
awk -F' ' '{x+=$1;y+=$4;z+=$6}END{print x,$2,$3,y,$5,z,$7}' $1
One could of course also pipe the input in (would save another 3 characters when allowed).
This problem is not challenging or difficult... it is "cute" though.
Here is solution in Python:
import sys
r = []
for s in sys.stdin:
r = map(lambda x,y:(x or 0)+int(y) if y.isdigit() else y, r, s.split())
print ' '.join(map(str, r))
What does it do... it keeps tally in r while proceeding line by line. Splits the line, then for each element of the list, if it is a number, adds it to the tally or keeps it as string. At the end they all get re-mapped to string and merged with spaces in between to be printed.
Alternative, more "algebraic" implementation, if we did not care about reading all input at once:
import sys
def totalize(l):
try: r = str(sum(map(int,l)))
except: r = l[-1]
return r
print ' '.join(map(totalize, zip(*map(str.split, sys.stdin))))
What does this one do? totalize() takes a list of strings and tries to calculate sum of the numbers; if that fails, it simply returns the last one. zip() is fed with a matrix that is list of rows, each of them being list of column items in the row - zip transposes the matrix so it turns into list of column items and then totalize is invoked on each column and the results are joined as before.
At the expense of making your code slightly longer, I moved the main parsing into the BEGIN clause so the main clause is only processing numeric fields. For a slightly larger input file, I was able to measure a significant improvement in speed.
BEGIN {
getline
for (i = 1; i <= NF; ++i) {
# need to test for 0, too, in this version
if ($i == 0 || $i + 0 != 0) {
numeric[i] = 1;
total[i] = $i;
}
}
}
{
for (i in numeric) total[i] += $i
}
END {
# re-use non-numeric columns of last line
for (i = 1; i <= NF; ++i) {
if (numeric[i])
$i = total[i]
}
print
}
I made a test file using your data and doing paste file file file ... and cat file file file ... so that the result had 147 fields and 1960 records. My version took about 1/4 as long as yours. On the original data, the difference was not measurable.
JavaScript (Rhino) - 183 154 139 bytes
Golfed:
x=[n=0,0,0];s=[];readFile('/dev/stdin').replace(/(\d+)(\D+)/g,function(a,b,c){x[n]+=+b;s[n++]=c;n%=3});print(x[0]+s[0]+x[1]+s[1]+x[2]+s[2])
Readable-ish:
x=[n=0,0,0];
s=[];
readFile('/dev/stdin').replace(/(\d+)(\D+)/g,function(a,b,c){
x[n]+=+b;
s[n++]=c;
n%=3
});
print(x[0]+s[0]+x[1]+s[1]+x[2]+s[2]);
PHP 152 130 Chars
Input:
$i = "
3 files changed, 72 insertions(+), 21 deletions(-)
3 files changed, 27 insertions(+), 4 deletions(-)
4 files changed, 164 insertions(+), 0 deletions(-)
9 files changed, 395 insertions(+), 0 deletions(-)
1 files changed, 3 insertions(+), 2 deletions(-)
1 files changed, 1 insertions(+), 1 deletions(-)
2 files changed, 57 insertions(+), 0 deletions(-)
10 files changed, 189 insertions(+), 230 deletions(-)
3 files changed, 111 insertions(+), 0 deletions(-)
8 files changed, 61 insertions(+), 80 deletions(-)";
Code:
$a = explode(" ", $i);
foreach($a as $k => $v){
if($k % 7 == 0)
$x += $v;
if(3-$k % 7 == 0)
$y += $v;
if(5-$k % 7 == 0)
$z += $v;
}
echo "$x $a[1] $a[2] $y $a[4] $z $a[6]";
Output:
44 files changed, 1080 insertions(+), 338 deletions(-)
Note: explode() will require that there is a space char before the new line.
Haskell - 151 135 bytes
import Char
c a b|all isDigit(a++b)=show$read a+read b|True=a
main=interact$unwords.foldl1(zipWith c).map words.filter(not.null).lines
... but I'm sure it can be done better/smaller.
Lua, 140 bytes
I know Lua isn't the best golfing language, but compared by the size of the runtimes, it does pretty well I think.
f,i,d,s=0,0,0,io.read"*a"for g,a,j,b,e,c in s:gmatch("(%d+)(.-)(%d+)(.-)(%d+)(.-)")do f,i,d=f+g,i+j,d+e end print(table.concat{f,a,i,b,d,c})
PHP, 176 166 164 159 158 153
for($a=-1;$a<count($l=explode("
",$i));$r=explode(" ",$l[++$a]))for($b=-1;$b<count($r);$c[++$b]=is_numeric($r[$b])?$c[$b]+$r[$b]:$r[$b]);echo join(" ",$c);
This would, however, require the whole input in $i... A variant with $i replaced with $_POST["i"] so it would be sent in a textarea... Has 162 chars:
for($a=-1;$a<count($l=explode("
",$_POST["i"]));$r=explode(" ",$l[$a++]))for($b=0;$b<count($r);$c[$b]=is_numeric($r[$b])?$c[$b]+$r[$b]:$r[$b])$b++;echo join(" ",$c);
This is a version with
NO HARDCODED COLUMNS