tesseract fails to form shapetable - ocr

i am attempting to extract OCR data of a 3-digit counter within a video via tesseract 4.1.1 on Kubuntu 21.04. (full tesseract version string below.) i am failing to add characters during the shapetable phase, and no other troubleshooting has worked for me -- i turn to you with humble heart. n.b.: the images are of a small pixel font, which takes up the entirety of my source image
image preparation and collation
from the source videos, i: crop to only the counter, invert, grayscale, dump at 1 fps, and then increase resolution by 1000% to 780x180 resolution. the results are individual frames such as this. i take a section of sequential numbers counting down from 500 (without any duplicates or blank images) and combine them into a .tif. (i can't upload the file here, but find the set of images mosaic'd together here)
i import this file into jTessBoxEditor as, for example, type_3.font.exp0.tif. i run tesseract --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox to create a .box file, with understandably nonsensical results.
with the hand-chosen source frames and the consistent positions, i'm able to edit the .box file with known box sizes, quantities, like so:
5 0 0 240 180 0
0 270 0 510 180 0
0 540 0 780 180 0
4 0 0 240 180 1
9 270 0 510 180 1
9 540 0 780 180 1
4 0 0 240 180 2
9 270 0 510 180 2
8 540 0 780 180 2
4 0 0 240 180 3
9 270 0 510 180 3
7 540 0 780 180 3
...
i load the edited .box into the jTessBoxEditor to check that it indeed matches my data. this is a 131-page .tif, meaning roughly 40 trains per digit.
training steps (where the problems begin)
i create font_properties and load it with font 0 0 0 0 0. Please note that i've also tried type_3 0 0 0 0 0 and type_3.font.exp0 0 0 0 0 0, with no difference on the below results
i input tesseract type_3.font.exp0.tif type_3.font.exp0 nobatch box.train and a training file is created; however, each page is listed as blank (is this normal?). e.g.:
Page 108
Warning: Invalid resolution 1 dpi. Using 70 instead.
Estimating resolution as 2263
Empty page!!
i input unicharset_extractor font_name.font.exp0.box with success -- the resulting extraction contains the characters i've identified, with some extra lines
13
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
5 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 5 # 5 [35 ]0
0 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 0 # 0 [30 ]0
4 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 5 4 # 4 [34 ]0
9 8 0,255,0,255,0,0,0,0,0,0 Common 6 2 6 9 # 9 [39 ]0
8 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 7 8 # 8 [38 ]0
7 8 0,255,0,255,0,0,0,0,0,0 Common 8 2 8 7 # 7 [37 ]0
6 8 0,255,0,255,0,0,0,0,0,0 Common 9 2 9 6 # 6 [36 ]0
3 8 0,255,0,255,0,0,0,0,0,0 Common 10 2 10 3 # 3 [33 ]0
2 8 0,255,0,255,0,0,0,0,0,0 Common 11 2 11 2 # 2 [32 ]0
1 8 0,255,0,255,0,0,0,0,0,0 Common 12 2 12 1 # 1 [31 ]0
but i know that failure has come for me when shapeclustering -F font_properties -U unicharset -O type_3.unicharset type_3.font.exp0.tr
results in
Reading type_3.font.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
...
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Master shape_table:Number of shapes = 0 max unichars = 0 number with multiple unichars = 0
It has not recognized any shapes at all.
my plea:
what have i missed?? what can i do to pass these 10 humble characters to tesseract?
full version string (installed via apt)
tesseract 4.1.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.5

Related

Automatically converting a 1-level list to a nested list

Here we have a link where there is a table:
http://pitzavod.ru/products/upakovka/
When I read it with pd.read_html I do get a list, but it 1.) is not nested, thus when converted to a dataframe it is not readable, 2.) contains integers 0 to number of rows in the table on the website.
The list I get looks like:
[ 0 1 \
0 Показатели Марка целлюлозы
1 ОСН NaN
2 Механическая прочность при размоле в мельнице ... 10 000 740 520
3 Степень делигнификации, п.е. 28 - 45
4 Сорность - число соринок в условной массе 500г... 6500
5 Влажность, % не более 20
2
0 Методы испытаний
1 NaN
2 ГОСТ13525.1 ГОСТ 13525.3 ГОСТ 13525.8
3 ГОСТ 10070
4 ГОСТ 14363.3
5 ГОСТ 16932 ]
Is there a way to easily clean this pandas outpute, or do I properly need to parse the website? Thank you.
That's because read_html returns always a list (even if the number of tables is 1).
pandas.read_html :Read HTML tables into a list of DataFrame objects.
You need to slice it with [0] :
df = pd.read_html("http://pitzavod.ru/products/upakovka/")[0]
​
Output (showing the last two columns) :
1 2
0 Марка целлюлозы Методы испытаний
1 ОСН Методы испытаний
2 10 000 740 520 ГОСТ13525.1 ГОСТ 13525.3 ГОСТ 13525.8
3 28 - 45 ГОСТ 10070
4 6500 ГОСТ 14363.3
5 20 ГОСТ 16932

Encode data into 1 byte

I have to encode data to 1 byte. I have the following data as of now.
size - 500 ml and 1 litre
Frequency - 0 to 12
% - 0-100
So i decided to break the data into the following -
0 0 0 0 0 0 0 0
1st bit - Size - 0 for 500ml and 1 for 1 litre
2-5 bits - Frequency - 0 to 12 (0000 for 0 and 1100 for 12)
I am not sure how to get the % in this setting. Am i looking into solving this in a wrong way? Is there any other way to do it. Any direction is highly appreciated.
You are left with 3 bits. you need to store a value between 0-100, which atleast needs 7 bits. (2^7 = 128). However, if you only need 8 different percentage values, you can get away with using 3 bits

Formatting JSON data in R

I'm really new to working with JSON data, so I had a question about formatting.
Here's the link to the data I was trying to work with
I was using JSONlite and did this:
shot<-"http://stats.nba.com/stats/playerdashptshotlog?DateFrom=&DateTo=&
GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&
Outcome=&Period=0&PlayerID=202322&Season=2014-15&SeasonSegment=&
SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision="
I then did fromJSON:
json_data <- fromJSON(paste(readLines(shot), collapse=""))
This gives me the data in a list. My issue (although for all I know I messed up working towards this) is trying to create a data frame out of this info. I was able to make a data frame with code I read under similar questions on the site, but it is all of the data in just one column. Any recommendations would be appreciated!
Thanks
Normally, first thing to do when you get a JSON, you look at the structure.
str(json_data)
Doing so will reveal that your data has a very simple structure: is is a dataframe with rows, a line of headers, wrapped in some metadata about what each column means. Using the $ will allow you to address those specific components. In other words, your specific json is already a data frame structure, all you gotta to is take it out of json
library(jsonlite)
json_data <- fromJSON(paste(readLines(shot), collapse=""))
str(json_data)
mydf <- data.frame(json_data$resultSets$rowSet)
colnames(mydf) <- unlist(json_data$resultSets$headers)
You ought to get something like this:
head(mydf)
GAME_ID MATCHUP LOCATION W FINAL_MARGIN SHOT_NUMBER PERIOD
1 0021401215 APR 14, 2015 - WAS # IND A L -4 1 1
2 0021401215 APR 14, 2015 - WAS # IND A L -4 2 1
3 0021401215 APR 14, 2015 - WAS # IND A L -4 3 1
4 0021401215 APR 14, 2015 - WAS # IND A L -4 4 1
5 0021401215 APR 14, 2015 - WAS # IND A L -4 5 1
6 0021401215 APR 14, 2015 - WAS # IND A L -4 6 1
GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE SHOT_RESULT
1 10:33 7.7 0 1 25 3 missed
2 8:41 14 10 9.6 10.7 2 made
3 6:42 14.9 11 9.7 18.2 2 missed
4 5:16 19 3 3.5 4.2 2 made
5 4:45 19.8 3 3.7 3.3 2 missed
6 3:08 13.5 10 9.7 18 2 missed
CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM PTS
1 Hill, George 201588 4.3 0 0
2 Hill, George 201588 5.7 1 2
3 Hill, George 201588 3 0 0
4 Miles, CJ 101139 4 1 2
5 Hill, Solomon 203524 3 0 0
6 Hill, George 201588 4.5 0 0

GNUPLOT: Joining different series of points with vectors

I have a file with data in 2 columns X and Y. There are some blocks and they are separated by a blank line. I want to join the points (given by their coordenates x and y in the file) in each block using vectors. I'm trying to use these functions:
prev_x = NaN
prev_y = NaN
dx(x) = (x_delta = x-prev_x, prev_x = ($0 > 0 ? x : 1/0), x_delta)
dy(y) = (y_delta = y-prev_y, prev_y = ($0 > 0 ? y : 1/0), y_delta)
which I've taken from Plot lines and vector in graphical gnuplot (first answer). The command to plot would be plot for[i=0:5] 'Field_lines.txt' every :::i::i u (prev_x):(prev_y):(dx($1)):(dy($2)) with vectors. The output is
and the problem is that the point (0,0) is being included even though it's not in the file. I don't think I understand what the functions dx and dy do exactly and how they are being used with the option using (prev_x):(prev_y):(dx($1)):(dy($2)) so an explanation of this would help me a lot to try to fix this.
This is the file:
#1
0 5
0 4
0 3
0.4 2
0.8 1
0.8 1
#2
2 5
2 4
2 3
2 2
2 1
2 0
#3
4 5
4.2 4
4.5 3
4.6 2
4.7 1
4.7 0
#4
7 5
7.2 4
7.5 3
7.9 2
7.9 1
7.9 0
#5
9 5
9 4
9.2 3
9.5 2
9.5 1
9.5 0
#6
11 7
12 6
13 5
13.3 4
13.5 3
13.5 2
13.6 1
14 0
Thanks!
I'm not completely sure, what the real problem is, but I think you cannot rely on the columns in the using statement to be evaluated from left to right, and your check $0 > 0 in the dx and dy some too late in my opinion.
I usually put all the assignments and conditionals in the first column, and that works fine also in your case:
set offsets 1,1,1,1
unset key
prev_x = prev_y = 1
plot for [i=0:5] 'Field_lines.txt' every :::i::i \
u (x_delta = prev_x-$1, prev_x=$1, y_delta=prev_y-$2, prev_y=$2, ($0 == 0 ? 1/0 : prev_x)):(prev_y):(x_delta):(y_delta) with vectors backhead
Also, to draw a vector from j-th row to the point in the following row you must invert the definition of x_delta and use backhead to draw the vectors in the correct direction

How to build vtkPolyData based on the information within a txt file

I have a txt file which contains a set of 3 Dimensional data points and I would like to create a vtkPolyData based on those points.
In the file, I have the number of points on the first line, in my case they are 6 x 6. And after that the actual coordinates of each point. The content of the file is like this.
6 6
1 1 3
2 1 3.4
3 1 3.6
4 1 3.6
5 1 3.4
6 1 3
1 2 3
2 2 3.8
3 2 4.2
4 2 4.2
5 2 3.8
6 2 3
1 3 3
2 3 3
3 3 3
4 3 3
5 3 3
6 3 3
1 4 3
2 4 3
3 4 3
4 4 3
5 4 3
6 4 3
1 5 3
2 5 3.8
3 5 4.2
4 5 4.2
5 5 3.8
6 5 3
1 6 3
2 6 3.4
3 6 3.6
4 6 3.6
5 6 3.4
6 6 3
How can I build a vtkPolyData structure with a txt file with this data?
It looks to me like you have a regularly gridded series of points, right? If so, vtkImageData might be a better choice. You can always use a geometry filter afterwards to convert to polydata if you really need it that way.
Create a vtkImageData instance.
Set its dimensions to (6, 6, 1) (the third dimension is ignored).
Set its data type to an appropriate type (float or double, I guess).
Call AllocateScalars();
If in C++,
call GetScalarPointer() and cast it to the data type set in 3.
This pointer will point to an array of size 36. You can just fill each point as you would normally.
If in another language (TCL/Python/Java), call SetScalarComponentFromFloat on the image data, with the arguments (x, y, 0, 0, value). The first 0 is the 3rd dimension and the second is for the first component.
This will give you a grid, and it'll be far more memory efficient than a polydata.
If you want to visualize only the points, use a vtkDataSetMapper, and setup the actor's property with SetRepresentationToPoints(), setting an appropriate point size. That will do a simple job of visualization.
Are these examples useful? In particular, this does generation of points and polygons, so it should be possible to adapt. The core seems to be (with lots left out):
# ...
vtkPolyData shell
vtkFloatPoints points
vtkCellArray strips
# Generate points...
loop {
...
points InsertPoint $k $x0 $x1 $x2
}
shell SetPoints points
points Delete
# Generate triangles/polygons...
loop {
strips InsertNextCell $NP2
# ...
strips InsertCellPoint [expr $kb +$ke ]
# ...
strips InsertCellPoint [expr $kb +$ke ]
}
shell SetStrips strips
strips Delete
# ...