Tweak tesseract for better detection of URLs in image - ocr

I have an image that I can't get tesseract to recognise as text. All my input text will be URLs.
As you can see, the image is as clear as it can be.
When running tesseract test2.png stdout it returns http:II11111111111111111111111111111111111
1111111111111111111.coml
Which is close, but not correct.
When setting the tessedit_char_whitelist parameter to htp:/1.com it recognises the string correctly (but I want more general recognition of URLs as well).
Passing in a pattern file that looks like below using command line tesseract test2.png stdout --user-patterns ./patterns.txt
\n\*://\n\*
http://\n\*
\n\*.com
doesn't help with recognition. It still prefers I over /. (More details about the pattern file )
I have also tried to set the parameters ok_repeated_ch_non_alphanum_wds to include / (and chs_trailing_punct{1,2} for trailing /, but it doesn't seem to work. Specifying --user-words doesn't help either. (With "words" being http://)
Is there a way of specifying char priority for tesseract?
Version info:
$ tesseract -v
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

You can achieve this by adding the following line to your unicharambigs
file:
3 : I I 3 : / / 1
Extract the unicharambigs file with combine_tessdata -e eng.traineddata eng.unicharambigs
Edit the unicharambigs file, e.g. with nano eng.unicharambigs (make sure to use tabs after both 3s and the second /).
Overwrite the unicharambigs file in the traineddata file with the edited version combine_tessdata -o eng.traineddata eng.unicharambigs
Output using the amended traineddata file:
$ tesseract test2.png stdout
http://11111111111111111111111111111111111
1111111111111111111.coml

Related

How to fix warning format JSON deprecation in cucumber 4.1.0 gem?

I've recently updated to the latest Ruby cucumber gem and now getting the following warning:
WARNING: --format=json is deprecated and will be removed after version 5.0.0.
Please use --format=message and stand-alone json-formatter.
json-formatter homepage: https://github.com/cucumber/cucumber/tree/master/json-formatter#cucumber-json-formatter.
I'm using json output later for my reporting. In my cucumber.yml I have the following default profile:
default:
-r features
--expand -f pretty --color
-f json -o reports/cucumber.json
According to the reference https://github.com/cucumber/cucumber/tree/master/json-formatter#cucumber-json-formatter they say to use something like
cucumber --format protobuf:cucumber-messages.bin
cat cucumber-messages.bin | cucumber-json-formatter > cucumber-results.json
and
Trying it out
../fake-cucumber/javascript/bin/fake-cucumber \
--results=random \
../gherkin/testdata/good/*.feature \ |
go/dist/json-formatter-darwin-amd64
But it's not really clear how to do that.
I guess you need to change your cucumber profile to produce protobuf output instead of json, and then add a step to post-process the file into the json you want?
I'd assumed that the 'Trying it out' above was your output from actually trying it out, rather than just a straight cut and paste from the formatter's Github page...

Why does Tesseract fail with "Empty page" with this image?

I have the following screenshot:
I want to extract the manuscript word count, 3.574 in this case, from that image (see red rectangle below).
To do this, I run following script:
magick screenshot.png -crop 33x20+2+83 screenshot-cropped.png
tesseract screenshot-cropped.png screenshot-ocred -l eng
The first line cuts out the place with the word count and saves it in screenshot-cropped.png which looks like this:
tesseract screenshot-cropped.png screenshot-ocred -l eng is supposed to recognize the characters and save them as text in screenshot-ocred.txt.
However, it produces the following error:
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>ocr.bat
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>magick screenshot.png -crop 33x20+2+83 screenshot-cropped.png
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>tesseract screenshot-cropped.png screenshot-ocred -l eng
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Empty page!!
Empty page!!
How can I fix it, i. e. make Tesseract recognize 3.574 and save it in screenshot-ocred.txt?
Note: All of this runs on Windows. Here is the output of magick --version:
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>magick --version
Version: ImageMagick 7.0.10-7 Q16 x64 2020-04-20 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2018 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Visual C++: 180040629
Features: Cipher DPC Modules OpenCL OpenMP(2.0)
Delegates (built-in): bzlib cairo flif freetype gslib heic jng jp2 jpeg lcms lqr lzma openexr pangocairo png ps raw rsvg tiff webp xml zlib
Adding --psm 7 to the Tesseract call solved the problem (tesseract screenshot-cropped.png screenshot-ocred -l eng --psm 7).

Packer HCL2 config file support

In https://packer.io/guides/hcl/from-json-v1/, it says
Note: Starting from version 1.5.0 Packer can read HCL2 files.
And my packer is packer_1.5.5_linux_amd64.zip which is suppose to be able to read HCL2 files. However, when I tried it, I got
$ packer build -only=docker hcl-example
Failed to parse template: Error parsing JSON: invalid character '#' looking for beginning of value
At line 1, column 1 (offset 1):
1: #
^
==> Builds finished but no artifacts were created.
$ packer build -h
Usage: packer build [options] TEMPLATE
Will execute multiple builds in parallel as defined in the template.
The various artifacts created by the template will be outputted.
Options:
-color=false Disable color output. (Default: color)
-debug Debug mode enabled for builds.
-except=foo,bar,baz Run all builds and post-procesors other than these.
-only=foo,bar,baz Build only the specified builds.
-force Force a build to continue if artifacts exist, deletes existing artifacts.
-machine-readable Produce machine-readable output.
-on-error=[cleanup|abort|ask] If the build fails do: clean up (default), abort, or ask.
-parallel=false Disable parallelization. (Default: true)
-parallel-builds=1 Number of builds to run in parallel. 0 means no limit (Default: 0)
-timestamp-ui Enable prefixing of each ui output with an RFC3339 timestamp.
-var 'key=value' Variable for templates, can be used multiple times.
-var-file=path JSON file containing user variables. [ Note that even in HCL mode this expects file to contain JSON, a fix is comming soon ]
and I don't see any switches from above to switch to HCL2 mode.
What I'm missing here?
$ packer version
Packer v1.5.5
$ cat hcl-example
# the source block is what was defined in the builders section and represents a
# reusable way to start a machine. You build your images from that source.source
"amazon-ebs" "example" {
ami_name = "packer-test"
region = "us-east-1"
instance_type = "t2.micro"
}
[UPDATE:]
To address Matt's comment/concern, I've changed the content of hcl-example to the whole list in https://packer.io/guides/hcl/from-json-v1/, and
mv hcl-example hcl-example.hcl
$ packer validate hcl-example.hcl
Failed to parse template: Error parsing JSON: invalid character '#' looking for beginning of value
At line 1, column 1 (offset 1):
1: #
^
Named it with .pkr.hcl extension solved the problem.

JMeter 2.1.13 not loading properties file

I'm running a test with JMeter 2.1.13 on Ubuntu 14.04, getting the output as csv. I use the following command line in Ubuntu 14.04 to try to get it to read the properties file to add fields to the CSV output
./jmeter -n -p /opt/apache-jmeter-2.13/bin/jmeter.properties -l n1.csv -t Apache-DB.jmx
With the following in the properties file
jmeter.save.saveservice.output_format=csv
jmeter.save.saveservice.print_field_names=true
jmeter.save.saveservice.response_code=true
jmeter.save.saveservice.successful=true
jmeter.save.saveservice.latency=true
jmeter.save.saveservice.connect_time=true
jmeter.save.saveservice.bytes=true
jmeter.save.saveservice.default_delimiter=,
It doesn't seem to pick it up, as no field headers are printed. Here's an example from the first line of the csv file
1448233211742,313,HTTP Request,200,OK,Thread Group 1-1,text,false,209666,1,1,96
I've also tried --propfile instead of -p, which didn't work. Am I doing something wrong or does JMeter not read those configuration options like it should?
Background information / helpful information for others
I have managed to turn on a couple of extra fields using command line switches (just in case anyone finds this on Google). This at puts field labels on the JMeter CSV output.
./jmeter -n -Jjmeter.save.saveservice.print_field_names=true -Jjmeter.save.saveservice.connect_time=true -l n1.csv -t Apache-DB.jmx
For reference here are the JMeter default csv fields
timeStamp,elapsed,label,responseCode,responseMessage, threadName,dataType,success,bytes,grpThreads,allThreads,Latency
The header at the top of jmeter.properties advices:
################################################################################
#
# THIS FILE SHOULD NOT BE MODIFIED
#
# This avoids having to re-apply the modifications when upgrading JMeter
# Instead only user.properties should be modified:
# 1/ copy the property you want to modify to user.properties from jmeter.properties
# 2/ Change its value there
#
################################################################################
Your settings are likely being overridden when default saveservice properties are loaded afterjmeter.properties.
Try putting your properties in user.properties.

convert VMX to OVF using OVFtool

I am trying to convert VMX to OVF format using OVFTool as below, however it gives error:
C:\Program Files\VMware\VMware OVF Tool>ovftool.exe
vi://vcenter.com:port/folder/myfolder/abc.vmx abc.ovf
Error: Failed to open file: https://vcenter.com:port/folder/myfolder/abc.vmx
Completed with errors
Please let me know if you have any solution.
I had a similar situation in vmware fusion trying to use a .vmx that was probably created on windows. I could boot the VM, but any attempt to export the machine with ovftool or use vmware-vdiskmanager bombed out with:
Error: Failed to open disk: source.vmdk
Completed with errors
the diskname was totally valid, path was valid, permissions were valid, and the only clue was running ovftool with:
ovftool --X:logToConsole --X:logLevel=verbose source.vmx dest.ova
Opening VMX source: source.vmx
verbose -[10C2513C0] Opening source
verbose -[10C2513C0] Failed to open disk: ./source.vmdk
verbose -[10C2513C0] Exception: Failed to open disk: source.vmdk. Reason: Disk encoding error
Error: Failed to open disk: source.vmdk
as others suggested, i took a peek in the .vmdk. therein i found 3 other clues:
encoding="windows-1252"
createType="monolithicSparse"
# Extent description
RW 16777216 SPARSE "source.vmdk"
so first i converted the monolithicSparse vmdk to "preallocated virtual disk split in 2GB files":
vmware-vdiskmanager -r source.vmdk -t3 foo.vmdk
then i could edit the "foo.vmdk" to change the encoding, which now looks like:
encoding="utf-8"
createType="twoGbMaxExtentFlat"
# Extent description
RW 8323072 FLAT "foo-f001.vmdk" 0
RW 8323072 FLAT "foo-f002.vmdk" 0
RW 131072 FLAT "foo-f003.vmdk" 0
and finally, after fixing up the source.vmx:
scsi0:0.fileName = "foo.vmdk"
profit:
ovftool source.vmx dest.ova
...
Opening VMX source: source.vmx
Opening OVA target: dest.ova
Writing OVA package: dest.ova
Transfer Completed
Completed successfully
I had a similar problem with OVFTool trying to export to OVF format.
Export failed: Failed to open file: C:\Virtual\test\test.vmx.
First, I opened .VMX file in editor (it's a text file) and made sure that settings like
scsi0:0.fileName = "test.vmdk"
nvram = "test.nvram"
extendedConfigFile = "test.vmxf"
mention proper file names.
Then I noticed this line:
.encoding = "windows-1251"
This is Cyrillic code page, so I modified it to use Western code page
.encoding = "windows-1252"
Then, running OVFTool gave a different error
Export failed: Failed to open disk: test.vmdk.
To fix it I had to open .VMDK file in HEX editor (because it's usually a big binary file), found there the string
encoding = "windows-1251"
(it's somewhere in the beginning of the file), and replaced "1251" with "1252".
And it did the trick!
In my case, was needed repair the disk 'abc.vmdk' before convert the 'abc.vmx' to 'abc.ovf'.
Use this for Linux:
$ /usr/bin/vmware-vdiskmanager -R /home/user/VMware/abc.vmdk
Look this link https://kb.vmware.com/s/article/2019259 for resolved issue in Windows and Linux
Try to run as described below.
C:\Program Files\VMware\VMware OVF Tool>ovftool C:\Win-Test\Win-Test.vmx(location of your vmx file) C:\Win-Test\win-test.ovf (destination)
Maybe ovftool is unable to recognize the path you are giving.
Try with following command:
ovftool --eula#=[path to eula] --X:logToConsole --targetType=OVA --compress=9 vi://[username]:[ESX address] [target address]
Once you provide the ESX address, it will list down the folders you have created in your ESX box. Then you can trigger the command above mentioned again with appending folder name.
If no folder hierarchy present in your box, then it will simply list down vm names.
Retry the same command appending [foldername]/[vmname no vmx file name required]
ovftool --eula#=[path to eula] --X:logToConsole --targetType=OVA --compress=9 vi://[username]:[ESX address]/[foldername if exist]/[vmname no vmx file name required] [target address]
I had this same exact issue. In my case I opened up the VMX file and dropped the IDE and sound controllers from the file and saved. I was then able to convert everything to an OVA using the tool with the standard syntax.
e.g. I dropped:
ide1:0.present = "TRUE"
ide1:0.deviceType = "cdrom-image"
and:
sound.present = "TRUE"
sound.fileName = "-1"
sound.autodetect = "TRUE"
This allowed me to convert the file like normal.
For me opening the .vmx and deleting the following line worked:
sata0:1.deviceType = "cdrom-image"
In my case, this works:
ide1:0.present = "TRUE"
ide1:0.deviceType = "cdrom-image"
I did change true to false and works fine, as cdrom-image not exist, this change permit the format conversion.
if your goal is to move a windows based vm to virtual box you only need to:
uninstall vmware tools from the guest vm
shut down the machine
copy the hd to a new folder
create a new empty vm in virtualbox
mount the hd (the .vmdk file) in that vm
Easy and rapid to do.