Training tesseract - shapeclustering issue

Training tesseract - shapeclustering issue - ocr

I'm trying to train tesseract (adding a new, digit only font) as per the instructions found here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
What I've done:
Created a PDF with sample text, converted to tif, ran tesseract num.dot.exp0.tif num.dot.exp0 batch.nochop makebox digits. Then edited the generated box file, correcting wrong detections
Ran tesseract on training mode: tesseract num.dot.exp0.tif num.dot.exp0 nobatch box.train and extracted the unicharset with unicharset_extractor num.dot.exp0.box
Created the font_properties file: echo "num.dot.exp0 0 0 0 0 0" > font_properties
Everything was OK so far, the .box and unicharset files are correct, num.dot.exp0.tr was generated.
Then I ran shapeclustering -F font_properties -U unicharset num.dot.exp0.tr and got the following error:
Reading num.dot.exp0.tr ...
*** glibc detected *** shapeclustering: double free or corruption (!prev): 0x098c52e0 ***
======= Backtrace: =========
/lib/i386-linux-gnu/libc.so.6(+0x75ee2)[0x82eee2]
/usr/lib/i386-linux-gnu/libstdc++.so.6(_ZdlPv+0x1f)[0x77d51f]
/usr/lib/i386-linux-gnu/libstdc++.so.6(_ZdaPv+0x1b)[0x77d57b]
shapeclustering(_ZN13GenericVectorIiE5clearEv+0x8b)[0x8050949]
shapeclustering(_ZN13GenericVectorIiED1Ev+0x2b)[0x805056b]
/usr/lib/libtesseract.so.3(_ZN9tesseract17TrainingSampleSet14SetupFontIdMapEv+0x137)[0x488699]
/usr/lib/libtesseract.so.3(_ZN9tesseract17TrainingSampleSet22OrganizeByFontAndClassEv+0x22)[0x48823c]
/usr/lib/libtesseract.so.3(_ZN9tesseract13MasterTrainer24ReplaceFragmentedSamplesEv+0x1d7)[0x477ebd]
/usr/lib/libtesseract.so.3(_ZN9tesseract13MasterTrainer15PostLoadCleanupEv+0x47)[0x47587b]
shapeclustering[0x804e2b9]
shapeclustering(main+0x5f)[0x804cb13]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7d24d3]
shapeclustering[0x804ca21]
(...)
00cba000-00cc1000 rw-p 0039c000 08:01 4465015 /usr/lib/libtesseract.so.3.0.2
00cc1000-00d5c000 rw-p 00000000 00:00 0
00ef8000-00f22000 r-xp 00000000 08:01 4211867 /lib/i386-linux-gnu/libm-2.15.so
00f22000-00f23000 r--p 00029000 08:01 4211867 /lib/i386-linux-gnu/libm-2.15.so
00f23000-00f24000 rw-p 0002a000 08:01 4211867 /lib/i386-linux-gnu/libm-2.15.so
08048000-08056000 r-xp 00000000 08:01 4464615 /usr/bin/shapeclustering
08056000-08057000 r--p 0000d000 08:01 4464615 /usr/bin/shapeclustering
08057000-08058000 rw-p 0000e000 08:01 4464615 /usr/bin/shapeclustering
093c5000-094cf000 rw-p 00000000 00:00 0 [heap]
b779a000-b77a0000 rw-p 00000000 00:00 0
b77b6000-b77ba000 rw-p 00000000 00:00 0
bfb6c000-bfb8d000 rw-p 00000000 00:00 0 [stack]
Aborted (core dumped)
Then an empty shapetable is created.
Have I done something wrong? Any clues as to why this is happening?
I'm using tesseract 3.02

I managed to find out the problem. I should have used echo "dot 0 0 0 0 0" > font_properties instead of echo "num.dot.exp0 0 0 0 0 0" > font_properties
shapeclustering worked properly after that. It needs the real font name on font_properties, not the complete name ("dot", in my case).

I was getting same issue but found solution by verifying font name in font_properties file should be same as in eng.Imagefile.tr.
echo "NewFont 0 0 0 0 0" > font_properties
shapeclustering -F font_properties -U unicharset eng.Imagefile.tr
mftraining -F font_properties -U unicharset -O eng.unicharset eng.Imagefile.tr

Related

Documantation error at U-boot for qemu mips

I am trying to emulate some router I found online (just for practice), and this router runs u-boot as a bootloader.
I want to understand how to use qemu and u-boot to create a linux-embedded machine.
Inside u-boot there's a doc that explains exactly how to run a linux-embedded system using qemu and u-boot. (u-boot/doc/board/emulation/qemu_mips.rst).
The following quote is stage 6 of that documentation:
Generate Ide Disk
# dd of=ide bs=1k cout=100k if=/dev/zero
# sfdisk -C 261 -d ide
# partition table of ide
unit: sectors
ide1 : start= 63, size= 32067, Id=83
ide2 : start= 32130, size= 32130, Id=83
ide3 : start= 64260, size= 4128705, Id=83
ide4 : start= 0, size= 0, Id= 0
To be clear, this is a copy paste from the documentation file.
The problem is, that sfdisk does not have a -C argument, so that the sfdisk command is invalid.
Has anyone encountered that and has a solution?
Thanks!

You can use the following commands to create the partitioned disk image:
dd of=ide bs=1k count=100k if=/dev/zero
# Create partion table
sudo sfdisk ide << EOF
label: dos
label-id: 0x6fe3a999
device: image
unit: sectors
image1 : start= 63, size= 32067, Id=83
image2 : start= 32130, size= 32130, Id=83
image3 : start= 64260, size= 4128705, Id=83
EOF
I have posted a patch to correct the documentation: https://lists.denx.de/pipermail/u-boot/2020-January/395133.html
You can download it via the mbox link from https://patchwork.ozlabs.org/patch/1216937

Chrome Crashing Report Trace

Whenever I opens some web link which is hosted on some other system, My chrome crashes (Using 32bit 3GB RAM OS: Windows 8). But it was working fine on other browsers
Following is the trace:
24bdf6a4 55b07bbd e0000008 00000001 00000001 KERNELBASE!RaiseException+0x58
24bdf6c0 55b07b9a 216eb000 24bdf764 5506ffa0 chrome_child!base::`anonymous namespace'::OnNoMemory+0x1d [C:\b\c\b\win_clang\src\base\process\memory_win.cc # 44]
24bdf6cc 5506ffa0 216eb000 54a2d3dc 24bdf6d8 chrome_child!base::TerminateBecauseOutOfMemory+0xa [C:\b\c\b\win_clang\src\base\process\memory_win.cc # 53]
24bdf764 5506facd 24bdf79c 216eb000 0000003e chrome_child!discardable_memory::ClientDiscardableSharedMemoryManager::AllocateLockedDiscardableSharedMemory+0x19e [C:\b\c\b\win_clang\src\components\discardable_memory\client\client_discardable_shared_memory_manager.cc # 373]
24bdf7b0 56403990 24bdf808 216eac00 c40b8000 chrome_child!discardable_memory::ClientDiscardableSharedMemoryManager::AllocateLockedDiscardableMemory+0x207 [C:\b\c\b\win_clang\src\components\discardable_memory\client\client_discardable_shared_memory_manager.cc # 212]
24bdf824 564032f2 24bdf910 24bdfa18 24bdfab0 chrome_child!cc::SoftwareImageDecodeCache::GetExactSizeImageDecode+0x90 [C:\b\c\b\win_clang\src\cc\tiles\software_image_decode_cache.cc # 605]
24bdf8b4 56404777 24bdf910 24bdfa18 24bdfab0 chrome_child!cc::SoftwareImageDecodeCache::DecodeImageInternal+0x1b2 [C:\b\c\b\win_clang\src\cc\tiles\software_image_decode_cache.cc # 477]
24bdf94c 56403e13 24bdf9cc 24bdfa18 24bdfab0 chrome_child!cc::SoftwareImageDecodeCache::GetDecodedImageForDrawInternal+0x327 [C:\b\c\b\win_clang\src\cc\tiles\software_image_decode_cache.cc # 548]
24bdfb9c 564032de 24bdfca0 050ca638 050ca6d0 chrome_child!cc::SoftwareImageDecodeCache::GetScaledImageDecode+0x1c3 [C:\b\c\b\win_clang\src\cc\tiles\software_image_decode_cache.cc # 728]
24bdfc2c 56402dfe 24bdfca0 050ca638 050ca6d0 chrome_child!cc::SoftwareImageDecodeCache::DecodeImageInternal+0x19e [C:\b\c\b\win_clang\src\cc\tiles\software_image_decode_cache.cc # 479]
24bdfcd4 564068c0 050ca638 050ca6d0 00000000 chrome_child!cc::SoftwareImageDecodeCache::DecodeImage+0x10e [C:\b\c\b\win_clang\src\cc\tiles\software_image_decode_cache.cc # 408]
24bdfd48 54d711bc 030a8a64 030a8a48 050ca610 chrome_child!cc::`anonymous namespace'::SoftwareImageDecodeTaskImpl::RunOnWorkerThread+0x70 [C:\b\c\b\win_clang\src\cc\tiles\software_image_decode_cache.cc # 119]
24bdfd98 54add477 00000001 030a8a48 0311111c chrome_child!content::CategorizedWorkerPool::RunTaskInCategoryWithLockAcquired+0x6a [C:\b\c\b\win_clang\src\content\renderer\categorized_worker_pool.cc # 361]
24bdfdb0 54add417 0311111c 030a8a60 031110c8 chrome_child!content::CategorizedWorkerPool::RunTaskWithLockAcquired+0x39 [C:\b\c\b\win_clang\src\content\renderer\categorized_worker_pool.cc # 340]
24bdfdcc 56ae5d06 0311111c 030a8ab0 24bdfe30 chrome_child!content::CategorizedWorkerPool::Run+0x2b [C:\b\c\b\win_clang\src\content\renderer\categorized_worker_pool.cc # 232]
24bdfddc 54add371 031110c8 0000001a 00000000 chrome_child!content::`anonymous namespace'::CategorizedWorkerPoolThread::Run+0x14 [C:\b\c\b\win_clang\src\content\renderer\categorized_worker_pool.cc # 35]
24bdfe30 55af9679 031110c8 000002b8 000002b8 chrome_child!base::SimpleThread::ThreadMain+0x1c1 [C:\b\c\b\win_clang\src\base\threading\simple_thread.cc # 68]
24bdfe54 774aef8c 03133dc0 24bdfea0 7735367a chrome_child!base::`anonymous namespace'::ThreadFunc+0xb9 [C:\b\c\b\win_clang\src\base\threading\platform_thread_win.cc # 92]
24bdfe60 7735367a 03133dc0 405b4b08 00000000 kernel32!BaseThreadInitThunk+0xe
24bdfea0 7735364d 55af95c0 03133dc0 ffffffff ntdll!__RtlUserThreadStart+0x70
24bdfeb8 00000000 55af95c0 03133dc0 00000000 ntdll!_RtlUserThreadStart+0x1b
Any reason, why this error occurs?

Kernel exception log understanding

I am getting an exception like
------------[ cut here ]------------
WARNING: at kernel/workqueue.c:806 wq_worker_waking_up+0x74/0x88()
Modules linked in: dc_incdhad1(OF) pvrsrvkm(OF) corelockr(OF) topazkm(OF) vdecdd(OF) imgvideo(OF) tty_hci(F) st_drv(F) wl12xx(F) wlcore(F)
CPU: 1 PID: 277 Comm: kworker/u8:2 Tainted: GF O 3.10.60 #13
Workqueue: pvr_workqueue MISRWrapper [pvrsrvkm]
Stack : 8de85380 80685fd0 8009f7a4 00000326 8068af7c 8077ee40 80715d00 81c08e40
80715cfc 805d7638 80685fd0 800a5bf0 80715d00 00000000 806b193c 8df33bfc
8df33bfc 8de85380 80685fd0 800a0bd4 80688620 00000000 00000151 802eb040
00000000 00000000 00000000 00000000 00000000 00000000 5f727670 6b726f77
75657571 00000065 00000000 00000000 8ea35280 8e848100 c1a5a07c 8df33ba8
...
Call Trace:
[<8005d420>] show_stack+0x64/0x7c
[<800811d4>] warn_slowpath_common+0x78/0xa8
[<8008128c>] warn_slowpath_null+0x18/0x24
[<8009f7a4>] wq_worker_waking_up+0x74/0x88
[<800b35cc>] ttwu_do_activate.constprop.80+0x60/0x80
[<800b58ec>] try_to_wake_up+0x1d4/0x348
[<805dc44c>] __mutex_unlock_slowpath+0x3c/0x54
[<c1a73098>] DisableSGXClocks+0x58/0x108 [pvrsrvkm]
[<c1a72d00>] SysDevicePrePowerState+0x44/0x54 [pvrsrvkm]
---[ end trace f8733685011c8bb9 ]---
Is there any way to read this log, and to reach a definite conlcusion. This log changes in the next boot, but first 5 lines remains same.

Format output file from a bash script into HTML table format

I have a script that gets data from multiple sources and I want to format its output to HTML table format.
Edited:
The format at the moment:
[Environment Name]
[Back end version]
[DB Version]
[event1 status] [event2 status] [event schema] [nodes] [node_no] [vpool] [ver] [node_ip]
The list at the moment:
grid-dev
BE version: 6.0
Database version: 10
DISABLED DISABLED dev_1 3 01 1 10.0.19-MariaDB 10.101.666.11:3306
grid-test
BE version: 7.0
Database version: 11
ENABLED ENABLED test_1 2 02 4 10.0.17-MariaDB 10.108.777.14:3306
grid-test
BE version: 7.0
Database version: 11
SLAVESIDE_DISABLE SLAVESIDE_DISABLE test_2 1 02 3 10.0.17-MariaDB 10.108.777.47:3306
grid-staging
BE version: 6.0
Database version: 10
DISABLED DISABLED staging_1 2 02 4 10.0.18-MariaDB 10.109.888.22:3306
and I want to format it to HTML table in something like this
ENVIRONMENT BACKEND_VERSION DB_VERSION EVENT1 EVENT2 SCHEMA NODES NODE_NO VPOOL VERSION IP
----------------------------------------------------------------------------------------------------------------------------------------------------------
grid-dev 6 10 DISABLED DISABLED dev_1 3 01 1 10.0.19-MariaDB 10.101.666.11:3306
grid-test 7 11 ENABLED ENABLED test_1 2 02 4 10.0.17-MariaDB 10.108.777.14:3306
grid-test 7 11 SLAVES... SLAVESI... test_2 2 01 3 10.0.17-MariaDB 10.108.777.47:3306
grid-staging 6 10 DISABLED DISABLED stag_1 2 02 4 10.0.18-MariaDB 10.109.888.22:3306
Is it possible to do it using bash script ? Any help will be appreciated I am new to bash and HTML so I am stuck.
My attemp using the code on the answer:
awk 'BEGIN{print "ENVIRONMENT BACKEND_VERSION DB_VERSION EVENT1 EVENT2 SCHEMA NODES NODE_NO VPOOL VERSION IP" } NF==1{env=$0; t=1; next;} t==1{t++; be=$3; next;} t==2{t++; db=$3; next;} t==3{printf "%s %s %s %s\n", env, be, db, $0; env="#";be="#";db="#";}' < "$output" | column -t | tr '#' ' ' >> "$dbstats"
The out put is
ENVIRONMENT BACKEND_VERSION DB_VERSION EVENT1 EVENT2 SCHEMA NODES NODE_NO VPOOL VERSION IP
grid-dev56.0 136 grid_dev Database version: 138
DISABLED DISABLED grid_systest 3 03 1 10.0.19-MariaDBgrid-systest56.0
Database version: 138
SLAVESIDE_DISABLED SLAVESIDE_DISABLED grid_systest 3 01 1 10.0.19-MariaDBgrid-systest56.0
Database version: 138
SLAVESIDE_DISABLED SLAVESIDE_DISABLED grid_systest 3 02 1 10.0.19-MariaDBgrid-staging56.0
Database version: 136
SLAVESIDE_DISABLED SLAVESIDE_DISABLED grid_staging 3 03 1 10.0.19-MariaDBgrid-staging56.0
Database version: 136
SLAVESIDE_DISABLED SLAVESIDE_DISABLED grid_staging 3 02 1 10.0.19-MariaDBgrid-staging56.0
Database version: 136
ENABLED ENABLED grid_staging 3 01 1 10.0.19-MariaDBgrid-production56.0
Database version: 136
SLAVESIDE_DISABLED SLAVESIDE_DISABLED grid_production 3 03 1 10.0.19-MariaDBgrid-production56.0
Database version: 136
SLAVESIDE_DISABLED SLAVESIDE_DISABLED grid_production 3 02 1 10.0.19-MariaDBgrid-production56.0
Database version: 136
DISABLED SLAVESIDE_DISABLED grid_production 3 01 1 10.0.19-MariaDB
Thanks

$ awk 'BEGIN{print "Envirnoment BackEndVersion DBVersion EventName Status Schema" } NF==1{env=$0; t=1; next;} t==1{t++; be=$3; next;} t==2{t++; db=$3; next;} t==3{printf "%s %s %s %s\n", env, be, db, $0; env="#";be="#";db="#";}' <input_file | column -t | tr '#' ' '
Envirnoment BackEndVersion DBVersion EventName Status Schema
grid-dev 6.0 10 swap DISABLED dev_1
busy DISABLED dev_1
grid-test 7.0 11 swap ENABLED test_1
busy ENABLED test_1
grid-staging 6.0 10 swap DISABLED staging_1
busy DISABLED staging_1
grid-production 5.0 9 swap ENABLED prod
busy ENABLES prod
After you edit your question with your attempts, Please comment on this answer, so that I will add explanation.

With the format above is possible to get into a HTML format using:
awk -v header=1 'BEGIN{OFS="\t"; print "<html><body><table>" }
{
gsub(/</, "\\<")
gsub(/>/, "\\>")
gsub(/&/, "\\>")
print "\t<tr>"
for(f = 1; f <=NF; f++) {
if(NR == 1 && header) {
printf "\t\t<th>%s</th>\n", $f
}
else printf "\t\t<td>%s</td>\n", $f
}
print "\t</tr>"
}
END {
print "</table></body></html>"
}' "$FORMATED_TABLE" )
This could be useful for someone looking to convert into HTML.

I know it's a late answer to this question, but will help those googling for a solution, for converting bash command output to html table format. There is an easy script available to do this at : https://sourceforge.net/projects/command-output-to-html-table/ which can be used to convert any command output or file to a nice html table format. You can specify the delimiter to this script, including special ones like tabs, newlines etc. and get the output in html table format with a html search at the top.
Just download the script, extract it and issue the following command :
cat test.txt | { cat ; echo ; } | ./tabulate.sh -d " " -t "My Report" -h "My Report" > test.html
This assumes that fields are separated by a space character, as specified by the other solution : https://stackoverflow.com/a/31245048/16923394
If the delimiter is a tab character, then change -d " " to -d $'\t' above.
The output file generated is attached here: https://sourceforge.net/projects/my-project-files/files/test.html/download

Compiling csv into one master file then output error?

I am trying to do something for my company. Basically what I need to do is
Compile all the csv in a folder into one master file.
From the master file, output potential error code found in the master file to the user.
The key thing of this is to make it automated. Meaning, I only want to press one button or do one step and it will do step 1 and 2 for me immediately.
The question is I have no idea what software or coding I should be using or looking at. Will be great if someone can enlighten on how I should approach this?
Note: I have limited knowledge of such things but am willing to learn.
====
Edit:
To give better example,
File1.csv
Voltage Ampere Power Error ID
==============================================
6V 3A 6W 18-ABB 000123
8V 2A 7W 0 123991
8V 10A 25W 25-ASB 461233
10V 23A 10W 18-ABB 248811
1V 2A 9W 0 321881
File2.csv
Voltage Ampere Power Error ID
==============================================
6V 4A 6W 0 312313
3V 5A 7W 0 123312
2V 10A 5W 25-ASB 461643
1V 2A 10W 18-ABB 656474
11V 2A 9W 0 124242
What I want to achieve,
Compile file1 and file 2 into one master.csv as below,
master.csv
File1
Voltage Ampere Power Error ID
==============================================
6V 3A 6W 18-ABB 000123
8V 2A 7W 0 123991
8V 10A 25W 25-ASB 461233
10V 23A 10W 18-ABB 248811
1V 2A 9W 0 321881
File2
Voltage Ampere Power Error ID
==============================================
6V 4A 6W 0 312313
3V 5A 7W 0 123312
2V 10A 5W 25-ASB 461643
1V 2A 10W 18-ABB 656474
11V 2A 9W 0 124242
The master.csv must contain the filenamewhen it is being compile. From master.csv, find and isolate the machine ID with the error code 18-ABB or 25-ASB (it will be variable but if its 0,it means no error) into a new called for example outputerror.csv file.
The headers (Voltage etc.) needs to be carry forward to the new outputerror.csv file.
Hence, the outputerror.csv should look like this,
outputerror.csv
Voltage Ampere Power Error ID
==============================================
File1
6V 3A 6W 18-ABB 000123
8V 10A 25W 25-ASB 461233
10V 23A 10W 18-ABB 248811
File2
2V 10A 5W 25-ASB 461643
1V 2A 10W 18-ABB 656474

Updated
#ECHO OFF
REM Delete any old output files, ignoring any error messages
DEL MASTER.CSV ERROR.CSV 2>NUL:
REM Keep track of file number in FNUM
SET /A FNUM=1
REM Loop through all files whose names look like "2015-03-01.CSV"
FOR %%A IN ( *-*-*.csv ) DO (
SET FNAME=%%A
CALL :PROCESSFILE
SET /A FNUM+=1
)
GOTO :EOF
REM ######################################################################
REM PROCESSFILE SUBROUTINE
REM ######################################################################
:PROCESSFILE
SET /A LNUM=1
REM New file, append its name to MASTER
ECHO %FNAME% >> MASTER.CSV
FOR /F "tokens=*" %%L IN (%FNAME%) DO (
SET LINE=%%L
CALL :PROCESSLINE
SET /A LNUM+=1
)
GOTO :EOF
REM ######################################################################
REM PROCESSLINE SUBROUTINE
REM ######################################################################
:PROCESSLINE
FOR /F "tokens=1-5 delims=," %%T in ("%LINE%") DO (
ECHO %LINE% >> MASTER.CSV
IF %LNUM% EQU 1 (
REM Output header line to ERROR if processing first file
IF %FNUM% EQU 1 ECHO %LINE% >> ERROR.CSV
REM Output filename to ERROR for all files
ECHO %FNAME% >> ERROR.CSV
) ELSE (
REM Output lines where field 4 is not "-" to ERROR
IF NOT "%%W" == "-" ECHO %LINE% >> ERROR.CSV
)
)
GOTO :EOF

This is actually MUCH easier using awk - in fact it is only 2 lines of code! I would suggest downloading awk.exe from here. It is INCREDIBLY powerful and will help with any scripting or text processing task.
The manual is available here.
The whole thing then becomes, many lines of comment and 2 lines of code (the third and the last line), which you run just the same as my other all-Windows solution.
#ECHO OFF
REM Print the contents of all CSV files whose names look like a date, e.g. 2012-11-01.csv, and add their name in ahead of line 3
awk "FNR==3{print FILENAME}1" *-*-*.csv > MASTER.CSV
REM From MASTER.CSV, print the following lines out to file ERROR.CSV:
REM ... first 3 lines, i.e. Record Number < 4
REM ... any lines containing "CSV" or "csv"
REM ... no lines with "Voltage" or "="
REM ... any lines with field4 != "0"
awk "NR<4 || /csv/ || /CSV/{print;next} /Voltage|=/{next} $4!=\""0\""" MASTER.CSV > ERROR.CSV

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Training tesseract - shapeclustering issue - ocr

I managed to find out the problem. I should have used echo "dot 0 0 0 0 0" > font_properties instead of echo "num.dot.exp0 0 0 0 0 0" > font_properties shapeclustering worked properly after that. It needs the real font name on font_properties, not the complete name ("dot", in my case).

Related

Documantation error at U-boot for qemu mips

Chrome Crashing Report Trace

Kernel exception log understanding

Format output file from a bash script into HTML table format

Compiling csv into one master file then output error?

Categories

Resources