I think I must be fundamentally misunderstanding something here, but the documentation for Making Box Files 4.0 states:
The required format for LSTM 4.0alpha is still the tiff/box file pair, except that the boxes only need to cover a textline instead of individual characters.
However it then goes to link to a Box File which has character-by-character boxes, e.g:
T 112 4663 140 4696 0
e 140 4662 160 4686 0
s 163 4662 179 4686 0
s 182 4661 198 4686 0
e 200 4661 220 4685 0
r 221 4662 238 4685 0
a 239 4661 260 4685 0
c 261 4661 281 4685 0
t 281 4661 296 4691 0
Can someone explain why this seems to be a discrepancy?
There are tab characters (\t) that mark ends of lines. If you read further down that documentation, it will state just that.
T 112 4663 140 4696 0
e 140 4662 160 4686 0
s 163 4662 179 4686 0
s 182 4661 198 4686 0
e 200 4661 220 4685 0
r 221 4662 238 4685 0
a 239 4661 260 4685 0
c 261 4661 281 4685 0
t 281 4661 296 4691 0
296 4661 311 4696 0
O 311 4661 344 4696 0
C 347 4661 377 4696 0
R 378 4661 414 4695 0
414 4694 415 4695 0
A 110 4575 146 4609 0
b 145 4574 167 4610 0
o 171 4573 193 4598 0
u 195 4573 219 4596 0
t 220 4573 234 4603 0
234 4602 235 4603 0
The LSTM training does not really need individual char coordinates.
The problem arises from not very good wording in tesseract wiki, old textline box example file and the fact that "Multiple formats of box files are accepted by Tesseract4".
Please see #2357 for details and examples provided by #shreeshrii.
Related
I'm trying to fine-tune tesseract 4.1.1 on my own specific data according to this guide. I want it to become able to detect and recognize text in boxes like that:
I have generated a number of images like that and corresponding to them .box files containing bounding boxes with text. To reproduce my issue here i'm going to show my pipeline using only one image. Here is the .box file for the image above:
0 1804 1659 1858 1813 0
5 1804 1659 1858 1813 0
9 1804 1659 1858 1813 0
9 1266 715 1334 1169 0
7 1266 715 1334 1169 0
8 1266 715 1334 1169 0
3 1266 715 1334 1169 0
6 1266 715 1334 1169 0
8 1266 715 1334 1169 0
0 1266 715 1334 1169 0
5 1266 715 1334 1169 0
3 1266 715 1334 1169 0
2 876 303 930 607 0
7 876 303 930 607 0
2 876 303 930 607 0
8 876 303 930 607 0
2 876 303 930 607 0
2 876 303 930 607 0
8 1671 120 1725 224 0
0 1671 120 1725 224 0
5 300 1278 354 1482 0
2 300 1278 354 1482 0
3 300 1278 354 1482 0
7 300 1278 354 1482 0
7 917 1451 975 1605 0
6 917 1451 975 1605 0
4 917 1451 975 1605 0
1 1058 1310 1132 1716 0
9 1058 1310 1132 1716 0
8 1058 1310 1132 1716 0
7 1058 1310 1132 1716 0
7 1058 1310 1132 1716 0
1 1058 1310 1132 1716 0
8 1058 1310 1132 1716 0
6 1058 1310 1132 1716 0
3 998 76 1070 382 0
4 998 76 1070 382 0
4 998 76 1070 382 0
8 998 76 1070 382 0
3 998 76 1070 382 0
6 998 76 1070 382 0
3 722 548 776 652 0
2 722 548 776 652 0
7 1782 1332 1838 1586 0
7 1782 1332 1838 1586 0
2 1782 1332 1838 1586 0
6 1782 1332 1838 1586 0
2 1782 1332 1838 1586 0
1 714 140 768 244 0
2 714 140 768 244 0
0 220 500 278 754 0
5 220 500 278 754 0
5 220 500 278 754 0
6 220 500 278 754 0
6 220 500 278 754 0
8 1676 1052 1742 1406 0
4 1676 1052 1742 1406 0
5 1676 1052 1742 1406 0
9 1676 1052 1742 1406 0
1 1676 1052 1742 1406 0
2 1676 1052 1742 1406 0
4 1676 1052 1742 1406 0
5 357 161 419 317 0
1 357 161 419 317 0
4 357 161 419 317 0
9 1424 848 1480 952 0
8 1424 848 1480 952 0
0 438 324 498 478 0
6 438 324 498 478 0
9 438 324 498 478 0
8 1503 1246 1559 1700 0
1 1503 1246 1559 1700 0
8 1503 1246 1559 1700 0
5 1503 1246 1559 1700 0
3 1503 1246 1559 1700 0
0 1503 1246 1559 1700 0
5 1503 1246 1559 1700 0
5 1503 1246 1559 1700 0
4 1503 1246 1559 1700 0
8 1553 477 1609 581 0
4 1553 477 1609 581 0
3 527 258 581 512 0
7 527 258 581 512 0
7 527 258 581 512 0
9 527 258 581 512 0
1 527 258 581 512 0
6 1665 1592 1727 1748 0
8 1665 1592 1727 1748 0
3 1665 1592 1727 1748 0
5 595 1362 651 1766 0
9 595 1362 651 1766 0
3 595 1362 651 1766 0
9 595 1362 651 1766 0
4 595 1362 651 1766 0
3 595 1362 651 1766 0
3 595 1362 651 1766 0
1 595 1362 651 1766 0
I have also converted the image into .tiff format and placed it in the same directory with .box file. Lets say we have 87.tiff and 87.box inside the directory.
Next i generate 87.lstmf file using
tesseract 87.tiff 87 lstm.train
Next i extract model using
combine_tessdata -e /usr/share/tesseract-ocr/4.00/tessdata/rus.traineddata lstm_model/rus.lstm
Next i create train.txt file containing the single line: 87.lstmf
Finally, i create bash script train.sh
/usr/bin/lstmtraining \
--model_output output/fine_tuned \
--continue_from lstm_model/rus.lstm \
--traineddata /usr/share/tesseract-ocr/4.00/tessdata/rus.traineddata \
--train_listfile train.txt \
--eval_listfile train.txt \
--max_iterations 400\
--debug_level -1
And when i run it, i have the following logs:
$ bash train.sh
Loaded file lstm_model/rus.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from lstm_model/rus.lstm
Loaded 1/1 lines (1-1) of document 87.lstmf
Loaded 1/1 lines (1-1) of document 87.lstmf
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
The message "Compute CTC targets failed!" repeats infinitely until i interrupt the script.
What am i doing wrong? I'm also concerned about message "Loaded 1/1 lines (1-1)" since i have multiple bounding boxes on the image.
For lstm training on images you may need to have the .box files in the new lstm format (these can be generated by running tesseract with the lstmbox argument):
TrainingTesseract-4.00
so after each line of text mark it with a special line:
<tab> <left> <bottom> <right> <top> <page>
then run lstm.train .
So I need some coordinates. I you open this page: https://www.inatur.no/jakt/5891ac4ee4b0f06b0de95636/blafjellet-jaktfelt-smaviltjakt-i-lierne there is a map option that shows a polygon over an area.
I want to get those paths as normal lat/long coordinates, but I can't find any solution to this.
In the HTML I find this code:
<path fill="url(#dojoxUnique1)" stroke="rgb(206, 80, 53)" stroke-opacity="1" stroke-width="2.6666666666666665" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="4" path="M 789,616 789,616 636,580 567,563 510,549 499,547 514,412 522,354 528,309 533,280 541,223 547,180 568,199 584,210 598,213 612,211 620,212 644,218 673,232 707,245 751,232 911,289 957,307 984,340 982,341 985,348 983,353 979,359 973,361 964,369 963,366 960,366 955,363 953,366 957,370 959,381 957,386 961,387 962,389 959,394 961,405 957,405 951,408 949,405 945,404 943,409 945,414 948,417 949,423 948,428 944,431 942,434 946,437 945,450 930,458 928,465 919,472 919,474 913,481 911,482 898,503 891,505 888,508 884,503 876,504 858,514 851,540 847,547 800,571 789,616 Z" d="M 789 616 789 616 636 580 567 563 510 549 499 547 514 412 522 354 528 309 533 280 541 223 547 180 568 199 584 210 598 213 612 211 620 212 644 218 673 232 707 245 751 232 911 289 957 307 984 340 982 341 985 348 983 353 979 359 973 361 964 369 963 366 960 366 955 363 953 366 957 370 959 381 957 386 961 387 962 389 959 394 961 405 957 405 951 408 949 405 945 404 943 409 945 414 948 417 949 423 948 428 944 431 942 434 946 437 945 450 930 458 928 465 919 472 919 474 913 481 911 482 898 503 891 505 888 508 884 503 876 504 858 514 851 540 847 547 800 571 789 616Z" stroke-dasharray="none" dojoGfxStrokeStyle="solid" fill-rule="evenodd"></path>
Does someone know how I can convert these to lat/long coordinates?
I have a tls file such as:
1224 926 1380 688 845 109 118 88 1275 1306 91 796 102 1361 27 995
1928 2097 138 1824 198 117 1532 2000 1478 539 1982 125 1856 139 475 1338
848 202 1116 791 1114 236 183 186 150 1016 1258 84 952 1202 988 866
946 155 210 980 896 875 925 613 209 746 147 170 577 942 475 850
1500 322 43 95 74 210 1817 1631 1762 128 181 716 171 1740 145 1123
3074 827 117 2509 161 206 2739 253 2884 248 3307 2760 2239 1676 1137 3055
183 85 143 197 243 72 291 279 99 189 30 101 211 209 77 198
175 149 259 372 140 250 168 142 146 284 273 74 162 112 78 29
169 578 97 589 473 317 123 102 445 217 144 398 510 464 247 109
3291 216 185 1214 167 495 1859 194 1030 3456 2021 1622 3511 222 3534 1580
2066 2418 2324 93 1073 82 102 538 1552 962 91 836 1628 2154 2144 1378
149 963 1242 849 726 1158 164 1134 658 161 1148 336 826 1303 811 178
3421 1404 2360 2643 3186 3352 1112 171 168 177 146 1945 319 185 2927 2289
543 462 111 459 107 353 2006 116 2528 56 2436 1539 1770 125 2697 2432
1356 208 5013 4231 193 169 3152 2543 4430 4070 4031 145 4433 4187 4394 1754
5278 113 4427 569 5167 175 192 3903 155 1051 4121 5140 2328 203 5653 3233
how can I read it in a list of list of int in haskell?
I have tried few options but I could not manage to do it. I am very new to haskell so please be patience.
First break your input into lines using lines:
let test = "1 2 3 4\n 5 6 7 \n 4 2 5"
let rows = lines test --literally "lines test"! Beautiful, eh?
Result:
["1 2 3 4"," 5 6 7 "," 4 2 5"] :: [[Char]]
Then, extract individual numbers as strings using words:
let nums_as_strings = map words rows
Result:
[["1","2","3","4"],["5","6","7"],["4","2","5"]] :: :: [[[Char]]]
The last thing to do is convert these strings to integers with read:
let numbers = map (map read) nums_as_strings :: [[Int]]
Result:
[[1,2,3,4],[5,6,7],[4,2,5]] :: [[Int]]
Or, squashed into one line:
let numbers = map (map read) (map words $ lines test) :: [[Int]]
Example with your data:
Prelude> let test = "1224 926 1380 688 845 109 118 88 1275 1306 91 796 102 1361 27 995\n1928 2097 138 1824 198 117 1532 2000 1478 539 1982 125 1856 139 475 1338"
Prelude> map (map read) (map words $ lines test) :: [[Int]]
[[1224,926,1380,688,845,109,118,88,1275,1306,91,796,102,1361,27,995],[1928,2097,138,1824,198,117,1532,2000,1478,539,1982,125,1856,139,475,1338]]
You may need to take care of empty lines, but that's really simple.
import System.IO
readListOfLists :: Handle -> IO [[Int]]
readListOfLists handle = do
contents <- hGetContents handle
let ls :: [String]
ls = lines contents
ws :: [[String]]
ws= map words ls
res :: [[Int]]
res = map (map read) ws
return res;
or you can write the same code in one line:
readListOfLists :: Handle -> IO [[Int]]
readListOfLists = fmap (map (map read . words) . lines) . hGetContents
To use it:
do
handle <- openFile fileName ReadMode
table <- readListOfLists handle
hClose handle
print table
I am developing a website with a web font that runs on apache.
On google chrome on android (desktop and ios is fine) I am seeing those weird characters. I first thought of an encoding problem, but those characters are not replacing any character they just popup in between characters.
How to solve that?
Solved: I had hidden characters in the text. Probably from copy pasting it. Removed it and wrote it again by hand.
Without more details, it's a bit harder to tell.
One way to diagnose this problem is to use a command like od to perform a data dump on the index file and finding out what is holding up that space.
You can perform that by running: cat index.html | od -cb for example and receive and output that will look like this:
0000000 < h t m l > \n < b o d y > \n
074 150 164 155 154 076 012 040 040 074 142 157 144 171 076 012
0000020 < p > S a f e t y a n
040 040 040 040 074 160 076 123 141 146 145 164 171 040 141 156
0000040 d s e c u r i t y a r e p
144 040 163 145 143 165 162 151 164 171 040 141 162 145 040 160
0000060 r i o r i t y o n e < / p > \n
162 151 157 162 151 164 171 040 157 156 145 074 057 160 076 012
0000100 < / b o d y > \n < / h t m l
040 040 074 057 142 157 144 171 076 012 074 057 150 164 155 154
0000120 > \n
076 012
0000122
Then you'll be able to better determine what is going on.
I'm a complete novice when it comes to SQL, just getting started.
I need help writing a query to update values in my SQL table:
Two tables: Members, Chapters
Concerned with three columns in Chapters table: CHAP_NO, FA, FA2
i.e.
CHAP_NO FA FA2
111 1234567 2345689
222 2234567 4567899
333 3225545
444
555 2358878 4566665
666 4568799
777 4566878 1233666
888 1119998
999 3555879 6544799
etc. . .
Each value is a unique identifier
Concerned with two columns in Members table: MEMB_NO, CURR_CHAP
i.e.
MEMB_NO CURR_CHAP
1234567 665
5468787 664
4577789 122
4578767 233
7775666 588
4114748 787
etc. . .
Is it possible to automate an update based on if FA or FA2 is in Chapters table, update their CURR_CHAP value in the Members table from the CHAP_NO value?
From the above example data, I need MEMB_NO '1234567' to have his CURR_CHAP updated to '111' because he is listed as FA for CHAP_NO '111'
I really need to do this for similar MS SQL and MySQL databases if possible. If this can't be automated, I need help writing a query to manually update the Members table as exampled above with a 2 column manual update row of data:
MEMB_NO CURR_CHAP
5470011 547
5030038 545
3880188 544
1140753 543
4130019 543
5420011 542
5410010 541
2590511 540
4190109 540
4180296 539
5380020 538
5370012 537
1050859 536
4390125 535
4860144 535
5330009 533
5330061 533
1080746 532
2060321 531
1750750 529
4250135 528
8070013 528
1080645 527
5270053 527
2580695 526
2440073 525
2440163 525
5240010 524
4980035 523
2120380 522
4000418 521
3270185 520
4350210 519
4610218 518
5160004 516
1610450 515
5150065 515
5130046 513
5130050 513
5120047 512
1940306 510
2500170 510
5090087 509
5080014 508
1270803 505
1381026 505
2260505 504
3900106 504
5030006 503
1770526 501
1780355 501
5000017 500
4980037 498
2380411 497
4970019 497
4960044 496
4960127 496
4950012 495
4950095 495
1720409 494
2260867 494
2300466 493
3990055 492
4920204 492
1311252 491
2100252 491
1750592 490
1760563 490
2520403 489
4890051 489
4870076 487
4870143 487
4860153 486
1670856 485
4840054 484
4840143 484
2920024 483
4830136 483
1751087 482
1790828 481
1970128 481
2050815 480
4800027 480
1870246 478
3210174 478
4770100 477
4760124 476
4760126 476
1350640 475
2280722 475
2200077 474
3410230 474
4730100 473
4250159 470
4250156 470
3790179 464
4630164 463
4630139 463
2210062 461
4610188 461
4210110 460
4870065 459
4500246 450
1110937 449
1110934 449
2280501 447
4450323 445
4440114 444
4410135 441
4410216 441
1600799 435
2280449 435
4080089 431
4310132 431
1780525 427
4270190 427
4260502 426
4260550 426
4250467 425
4250485 425
4210328 421
4190230 419
4180005 418
4180341 418
4250232 417
4130004 413
4110444 411
4090133 409
4080308 408
4430119 408
4070279 407
4070443 407
1650354 405
1670725 404
2240204 402
2870319 400
3990114 399
3980014 398
4050073 398
3170399 397
3970348 397
1760487 395
4180191 395
1800443 394
2580288 394
1280499 393
3930227 393
3780058 391
3900377 390
2590362 389
1720492 385
1720398 384
2840325 383
3710142 381
3800235 380
3780407 378
1760459 375
1730026 373
3710306 371
3710228 371
1051294 370
3700332 370
3670174 367
1780583 359
4640038 359
1280614 358
2580373 358
3570449 357
3530560 353
3500046 350
3490275 349
3490244 349
3320203 348
3480310 348
4210188 346
3440364 344
4490223 344
1750642 342
3990257 342
1790541 341
3370562 337
3370738 337
1870336 334
3340382 334
1950674 333
1460619 328
3280586 328
4250013 326
1340705 324
2590495 324
2870029 322
3030290 322
1880232 321
2280415 321
3200547 320
3200568 320
3180132 318
3180178 318
3930433 317
4850072 317
2870449 315
3150168 315
1390763 313
3120170 312
3110048 311
3110110 311
3070267 307
3500231 306
3980122 306
1160708 305
3050510 305
2280197 304
3040348 304
1060785 303
1340760 303
3020534 302
3980151 301
2990239 299
1770425 297
2950573 295
2280513 294
2320434 287
2870594 287
4110133 284
4260131 278
2770221 277
2770366 277
2760484 276
2750397 275
2580694 272
1751006 267
4010252 267
2660235 266
2780335 265
2640326 264
3840125 263
1270872 259
2590690 259
2580728 258
2030556 257
4600151 257
2550390 255
4440010 255
2520461 252
4130095 252
3910117 250
2490314 249
1361032 247
1900370 247
2440211 244
2440101 244
1730150 243
1440258 242
2420062 242
1350511 238
2380559 238
1800598 237
2350417 235
2340372 234
2320453 232
2590582 232
2120104 230
2280696 228
3480122 227
1111011 226
2260626 226
3230234 222
2270200 221
4470101 221
3010326 219
2180334 218
2170591 217
1620648 213
2120524 212
3010424 212
3130060 210
2070261 207
2070313 207
1640858 206
1620684 205
2030573 203
2030810 203
1270589 201
1111015 200
1990448 199
1950384 195
1920328 192
1920684 192
1750798 188
1880607 188
1870445 187
1850587 185
2960295 185
1800721 180
1791166 179
3990116 178
3130119 177
4170034 177
1051172 176
1380942 176
1751011 175
4500021 175
2840346 174
3460307 174
1730027 173
4070275 173
1110986 171
1670586 167
1111222 166
2060385 164
1560459 163
1740135 162
3130093 161
1600695 160
1600682 160
1350600 159
1590341 159
1580464 158
1570742 157
1570761 157
4440077 156
1520404 152
4700010 152
3390033 147
4170240 145
4730144 143
4250191 142
1400502 140
2170212 140
1360713 139
3040299 139
1800519 136
1270930 135
1720638 134
1800462 133
3930387 133
1111000 131
1311274 131
1360547 128
2260776 128
4830091 127
1800431 123
1280523 122
1750851 122
1291052 121
3850165 121
1180219 118
1180477 118
2240110 116
2870263 116
3900143 114
1111488 111
1490386 111
1060765 110
1780463 110
3200394 108
5050015 108
3870219 105
I think this should work on both...
UPDATE
IGNORE Chapters,
Members
SET Members.CURR_CHAP = Chapters.CHAP_NO
WHERE Chapters.FA = Members.MEMB_NO
AND Chapters.FA != ''
AND Chapters.FA2 != ''
UPDATE members m
INNER JOIN chapters c on (m.memb_no = c.fa or m.memb_no=c.fa2)
SET m.curr_chap=c.chap_no
This query worked for me in MySql
update Members m
INNER JOIN chapters c
ON (m.memb_no=c.FA or m.memb_no=c.FA2)
set CURR_CHAP = c.CHAP_NO