I'm trying to fine-tune tesseract 4.1.1 on my own specific data according to this guide. I want it to become able to detect and recognize text in boxes like that:
I have generated a number of images like that and corresponding to them .box files containing bounding boxes with text. To reproduce my issue here i'm going to show my pipeline using only one image. Here is the .box file for the image above:
0 1804 1659 1858 1813 0
5 1804 1659 1858 1813 0
9 1804 1659 1858 1813 0
9 1266 715 1334 1169 0
7 1266 715 1334 1169 0
8 1266 715 1334 1169 0
3 1266 715 1334 1169 0
6 1266 715 1334 1169 0
8 1266 715 1334 1169 0
0 1266 715 1334 1169 0
5 1266 715 1334 1169 0
3 1266 715 1334 1169 0
2 876 303 930 607 0
7 876 303 930 607 0
2 876 303 930 607 0
8 876 303 930 607 0
2 876 303 930 607 0
2 876 303 930 607 0
8 1671 120 1725 224 0
0 1671 120 1725 224 0
5 300 1278 354 1482 0
2 300 1278 354 1482 0
3 300 1278 354 1482 0
7 300 1278 354 1482 0
7 917 1451 975 1605 0
6 917 1451 975 1605 0
4 917 1451 975 1605 0
1 1058 1310 1132 1716 0
9 1058 1310 1132 1716 0
8 1058 1310 1132 1716 0
7 1058 1310 1132 1716 0
7 1058 1310 1132 1716 0
1 1058 1310 1132 1716 0
8 1058 1310 1132 1716 0
6 1058 1310 1132 1716 0
3 998 76 1070 382 0
4 998 76 1070 382 0
4 998 76 1070 382 0
8 998 76 1070 382 0
3 998 76 1070 382 0
6 998 76 1070 382 0
3 722 548 776 652 0
2 722 548 776 652 0
7 1782 1332 1838 1586 0
7 1782 1332 1838 1586 0
2 1782 1332 1838 1586 0
6 1782 1332 1838 1586 0
2 1782 1332 1838 1586 0
1 714 140 768 244 0
2 714 140 768 244 0
0 220 500 278 754 0
5 220 500 278 754 0
5 220 500 278 754 0
6 220 500 278 754 0
6 220 500 278 754 0
8 1676 1052 1742 1406 0
4 1676 1052 1742 1406 0
5 1676 1052 1742 1406 0
9 1676 1052 1742 1406 0
1 1676 1052 1742 1406 0
2 1676 1052 1742 1406 0
4 1676 1052 1742 1406 0
5 357 161 419 317 0
1 357 161 419 317 0
4 357 161 419 317 0
9 1424 848 1480 952 0
8 1424 848 1480 952 0
0 438 324 498 478 0
6 438 324 498 478 0
9 438 324 498 478 0
8 1503 1246 1559 1700 0
1 1503 1246 1559 1700 0
8 1503 1246 1559 1700 0
5 1503 1246 1559 1700 0
3 1503 1246 1559 1700 0
0 1503 1246 1559 1700 0
5 1503 1246 1559 1700 0
5 1503 1246 1559 1700 0
4 1503 1246 1559 1700 0
8 1553 477 1609 581 0
4 1553 477 1609 581 0
3 527 258 581 512 0
7 527 258 581 512 0
7 527 258 581 512 0
9 527 258 581 512 0
1 527 258 581 512 0
6 1665 1592 1727 1748 0
8 1665 1592 1727 1748 0
3 1665 1592 1727 1748 0
5 595 1362 651 1766 0
9 595 1362 651 1766 0
3 595 1362 651 1766 0
9 595 1362 651 1766 0
4 595 1362 651 1766 0
3 595 1362 651 1766 0
3 595 1362 651 1766 0
1 595 1362 651 1766 0
I have also converted the image into .tiff format and placed it in the same directory with .box file. Lets say we have 87.tiff and 87.box inside the directory.
Next i generate 87.lstmf file using
tesseract 87.tiff 87 lstm.train
Next i extract model using
combine_tessdata -e /usr/share/tesseract-ocr/4.00/tessdata/rus.traineddata lstm_model/rus.lstm
Next i create train.txt file containing the single line: 87.lstmf
Finally, i create bash script train.sh
/usr/bin/lstmtraining \
--model_output output/fine_tuned \
--continue_from lstm_model/rus.lstm \
--traineddata /usr/share/tesseract-ocr/4.00/tessdata/rus.traineddata \
--train_listfile train.txt \
--eval_listfile train.txt \
--max_iterations 400\
--debug_level -1
And when i run it, i have the following logs:
$ bash train.sh
Loaded file lstm_model/rus.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from lstm_model/rus.lstm
Loaded 1/1 lines (1-1) of document 87.lstmf
Loaded 1/1 lines (1-1) of document 87.lstmf
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
The message "Compute CTC targets failed!" repeats infinitely until i interrupt the script.
What am i doing wrong? I'm also concerned about message "Loaded 1/1 lines (1-1)" since i have multiple bounding boxes on the image.
For lstm training on images you may need to have the .box files in the new lstm format (these can be generated by running tesseract with the lstmbox argument):
TrainingTesseract-4.00
so after each line of text mark it with a special line:
<tab> <left> <bottom> <right> <top> <page>
then run lstm.train .
First run the function b(n):
? b(n) = lcm(vector(n, i, i))/n
After function c(n):
? c(n)=sum(j=1,n,sum(i=1,n,(-1)^(i+j)/(i+j-1)))
Last run d(n):
? d(n)=factor(denominator(c(n))/b(n))~
and test with 202
? d(202)
The result is:
%8 =
[3 7 17 19 31 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293
307 311 313 317 331 337 347 349 353 359 367 373 379 383 389 397 401]
[1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
What indicates the -1 in factoring results?
You try to factor the rational number. Note, type(denominator(c(202))/b(202)) is t_FRAC instead of t_INT because of denominator(c(202))/b(202) = <some big number>/31. So -1 means just -1 power for divisor 31, and no bug is here.
I have a tls file such as:
1224 926 1380 688 845 109 118 88 1275 1306 91 796 102 1361 27 995
1928 2097 138 1824 198 117 1532 2000 1478 539 1982 125 1856 139 475 1338
848 202 1116 791 1114 236 183 186 150 1016 1258 84 952 1202 988 866
946 155 210 980 896 875 925 613 209 746 147 170 577 942 475 850
1500 322 43 95 74 210 1817 1631 1762 128 181 716 171 1740 145 1123
3074 827 117 2509 161 206 2739 253 2884 248 3307 2760 2239 1676 1137 3055
183 85 143 197 243 72 291 279 99 189 30 101 211 209 77 198
175 149 259 372 140 250 168 142 146 284 273 74 162 112 78 29
169 578 97 589 473 317 123 102 445 217 144 398 510 464 247 109
3291 216 185 1214 167 495 1859 194 1030 3456 2021 1622 3511 222 3534 1580
2066 2418 2324 93 1073 82 102 538 1552 962 91 836 1628 2154 2144 1378
149 963 1242 849 726 1158 164 1134 658 161 1148 336 826 1303 811 178
3421 1404 2360 2643 3186 3352 1112 171 168 177 146 1945 319 185 2927 2289
543 462 111 459 107 353 2006 116 2528 56 2436 1539 1770 125 2697 2432
1356 208 5013 4231 193 169 3152 2543 4430 4070 4031 145 4433 4187 4394 1754
5278 113 4427 569 5167 175 192 3903 155 1051 4121 5140 2328 203 5653 3233
how can I read it in a list of list of int in haskell?
I have tried few options but I could not manage to do it. I am very new to haskell so please be patience.
First break your input into lines using lines:
let test = "1 2 3 4\n 5 6 7 \n 4 2 5"
let rows = lines test --literally "lines test"! Beautiful, eh?
Result:
["1 2 3 4"," 5 6 7 "," 4 2 5"] :: [[Char]]
Then, extract individual numbers as strings using words:
let nums_as_strings = map words rows
Result:
[["1","2","3","4"],["5","6","7"],["4","2","5"]] :: :: [[[Char]]]
The last thing to do is convert these strings to integers with read:
let numbers = map (map read) nums_as_strings :: [[Int]]
Result:
[[1,2,3,4],[5,6,7],[4,2,5]] :: [[Int]]
Or, squashed into one line:
let numbers = map (map read) (map words $ lines test) :: [[Int]]
Example with your data:
Prelude> let test = "1224 926 1380 688 845 109 118 88 1275 1306 91 796 102 1361 27 995\n1928 2097 138 1824 198 117 1532 2000 1478 539 1982 125 1856 139 475 1338"
Prelude> map (map read) (map words $ lines test) :: [[Int]]
[[1224,926,1380,688,845,109,118,88,1275,1306,91,796,102,1361,27,995],[1928,2097,138,1824,198,117,1532,2000,1478,539,1982,125,1856,139,475,1338]]
You may need to take care of empty lines, but that's really simple.
import System.IO
readListOfLists :: Handle -> IO [[Int]]
readListOfLists handle = do
contents <- hGetContents handle
let ls :: [String]
ls = lines contents
ws :: [[String]]
ws= map words ls
res :: [[Int]]
res = map (map read) ws
return res;
or you can write the same code in one line:
readListOfLists :: Handle -> IO [[Int]]
readListOfLists = fmap (map (map read . words) . lines) . hGetContents
To use it:
do
handle <- openFile fileName ReadMode
table <- readListOfLists handle
hClose handle
print table
I have this table, similar to the one below.
Table show player points:
s main player points
d sub main points;
date when it is calculated.
I want to be able to filter rows that are same as s and d staying next to each other. Date should be as the last last one that are the same.
For example, here we should skip ri - 13 as it is the same as ri -12. Also skip ri - 15,19,20,21,22,23 and so on. But rows 28, 29,30,31 should not be skipped and grouped.
I'm asking because GROUP BY for my case do not work. Any ideas?
Table example:
ri date s d
1 2016-05-23 4 355
2 2016-05-16 4 352
3 2016-05-09 4 349
4 2016-05-02 4 352
5 2016-04-25 4 358
6 2016-04-18 4 359
7 2016-04-11 4 200
8 2016-04-04 4 201
9 2016-03-21 4 198
10 2016-03-07 4 199
11 2016-02-29 4 201
12 2016-02-22 4 203
13 2016-02-15 4 203
14 2016-02-08 4 200
15 2016-02-01 4 200
16 2016-01-18 4 201
17 2016-01-11 4 198
18 2016-01-04 4 183
19 2015-12-28 4 183
20 2015-12-21 4 183
21 2015-12-14 4 183
22 2015-12-07 4 183
23 2015-11-30 4 183
24 2015-11-23 4 182
25 2015-11-16 4 149
26 2015-11-09 4 148
27 2015-11-02 4 145
28 2015-10-26 4 109
29 2015-10-19 4 110
30 2015-10-12 4 109
31 2015-10-05 4 110
32 2015-09-28 4 106
33 2015-09-21 4 108
34 2015-09-14 4 109
35 2015-08-31 5 108
36 2015-08-24 5 108
37 2015-08-17 5 136
38 2015-08-10 5 136
39 2015-08-03 4 123
40 2015-07-27 4 122
41 2015-07-20 4 125
42 2015-07-13 4 126
43 2015-06-29 4 130
44 2015-06-22 4 128
45 2015-06-15 4 126
46 2015-06-08 4 120
47 2015-05-25 9 120
48 2015-05-18 9 122
49 2015-05-11 9 121
50 2015-05-04 9 119
51 2015-04-27 9 122
52 2015-04-20 10 124
53 2015-04-13 9 173
54 2015-04-06 9 172
55 2015-03-23 8 174
56 2015-03-09 7 89
57 2015-03-02 7 89
58 2015-02-23 7 92
59 2015-02-16 7 96
60 2015-02-09 8 93
61 2015-02-02 9 88
62 2015-01-19 4 89
63 2015-01-12 4 89
64 2015-01-05 4 94
Coulb be you need a join ..
select a.*, b.*
from my_table as a
inner join my_table as b on a.ri != b.ri
where (a.d - b.d) = 0;
This can be done using not exists. This would select the first of many rows which have the same s and d.
select *
from tablename t1
where not exists (select 1 from tablename t2
where t1.ri = t2.ri+1 and t1.s = t2.s and t1.d = t2.d)
I have a table like this.
id day1 day2 day3
1 411 523 223
2 413 554 245
3 417 511 209
4 420 515 232
5 422 522 212
6 483 567 212
7 456 512 256
8 433 578 209
9 438 532 234
10 418 555 223
11 460 510 263
12 453 509 245
13 441 524 233
14 430 543 261
15 456 582 222
16 444 524 241
17 478 511 211
18 421 583 222
I want to select all the IDs that have duplicate values in day2.
I'm doing
select day2,count(*) from resultater group by day having count(*)>1;
Is it possible to list all the IDs within the groups?
select day2,count(*), group_concat(id)
from resultater
group by day
having count(*)>1;
should do the trick.