Tesseract SetVariable tessedit_char_whitelist in another language - ocr

Tesseract setVariable whitelist works ok for english language for example i use this to recognize only digits and letters from image (excluding special characters &*^%! etc)
_ocr.SetVariable("tessedit_char_whitelist",
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ");
But i can't do the same thing for Thai language
_ocr.SetVariable("tessedit_char_whitelist","0123456789กขคงจฉ");
Is there a different principle? Because this does not work. Instead of all determined characters I receive only digits in output, tesseract ignores all Thai letters which I put into the whitelist.
How can I pass this variable correctly?

You might need to use the language package for Thai first... please refer the download list here https://code.google.com/p/tesseract-ocr/downloads/list
Then you need to replace "eng" with "tha" in your code to use the new language data to OCR

Related

Delimiters in BIML

I am migrating to biml for SSIS integration of flat, excel and CSV files and I want to know all the possible delimeters and text qualifiers we can use since the documentation isn't telling that much.
Basically, there are 3 options to choose from:
The “official” ENUM
Allowed values here are:
– CRLF
– CR
– LF
– Semicolon
– Comma
– Tab
– VerticalBar
– UnitSeparator
Use the Hex Code
If you know the ASCII Code of your qualifier, you can use it starting with “x” and ending with “”.
A ” would be described by “x0022” for example.
Use the actual character (HTML encoded or escaped)
If you want to (for example) define a ” as your qualifier, you can do so. Just make sure, depending on how you use it, to either encode or escape it. When defining it as an actual Biml property, it has to be encoded.

Alternative to entering entity references in source code

Google's HTML/CSS Style Guide advises against using entity references:
Do not use entity references.
There is no need to use entity references like —, ”, or ☺, assuming the same encoding (UTF-8) is used for files and editors as well as among teams.
<!-- Not recommended -->
The currency symbol for the Euro is “&eur;”.
<!-- Recommended -->
The currency symbol for the Euro is “€”.
I'm not sure I understand what it is that they are proposing. The only thing I can think of is that they are saying that you should be using your text editor's insert character command (e.g., in Atom, Ctrl-Shift-U, or in Emacs, C-x 8) to enter Unicode characters rather than typing in the literal entity references. Is that it?
The only thing I can think of is that they are saying that you should be using your text editor's insert character command […] rather than typing in the literal entity references. Is that it?
Yes, that's precisely what they're saying.
You don't write A to insert the letter A, after all! There's no more reason to write ä for ä, or ♥ for ♥, when those characters can be represented directly in the HTML file.

OCR tesseract: trained data creation issue for special type of fonts (using Jtessboxeditor)

Unable to create proper trained data for windows non-native fonts, i.e.,for catia drafting fonts
Even if some of the alpha-numerals are recognized, letters with broken characters like " i , j " etc., special symbols like Ø (Phi), ° (degree), ± (plus-minus) are not recognized properly. Its box file values are improper.
JTessboxeditor is the tool we used to train and create trained data for tesseract
Request your assistance on the same. Thanks
I also need these 3 characters - though it might be too late to answer this.
May not be of much help in all situations, but the Norwegian .traineddata file does include the Ø (Phi) character, this trained data file has helped me with this character.
The ° (degree) character may be a bit trickier, as it normally isn't recognized because it's too small, if you can see the inside of the character is clear, Tesseract might be able to decipher.
Now the most difficult, the ± (plus-minus). I haven't cracked this one yet, and this may be a very wooly approach; but I was thinking, the plus-minus is always recognized as + plus only.
I can use this to my advantage.
I could use Tesseract's engine which exposes PageSegMode.SingleChar to detect each individual character and use Tesseract's GetSegmentedRegions() to get the area of the bitmap/image where each character is - you can later reassemble all characters into a string.
Then I could run an ImageMagick to calculate/compare how similar the plus character found is to an image of either plus or plus-minus. The one with most similarity will tell you which character.
With my approach, I still have to parse the text recognised and transform it into something usable.
The Ø (Phi) character for example may be detected as lower-case, but I will want it upper-case.
Or the degree is detected as an apostrophe, but the expected result is the degree.
Another transformation is when I detect a dimension, a decimal may be incorrectly recognized with a comma, but I will want the decimal separator to be a dot (1,99 - 1.99)

Reading CSV file with Chinese Character [One character cannot be shown]

When I am opening a csv file containing Chinese characters, using Microsoft Excel, TextWrangler and Sublime Text, there are some Chinese words, which cannot be displayed properly. I have no ideas why this is the case.
Specifically, the csv file can be found in the following link: https://www.hkex.com.hk/eng/plw/csv/List_of_Current_SEHK_EP.CSV
One of the word that cannot be displayed correctly is shown here:
As you can see a ? can be found.
Using mac file command as suggested by
http://osxdaily.com/2015/08/11/determine-file-type-encoding-command-line-mac-os-x/ tell me that the csv format is utf-16le.
I am wondering what's the problem, why I cannot read that specific text?
Is it related to encoding? Or is it related to my laptop setting? Trying to use Mac and windows 10 on Mac (via Parallel Desktop) cannot display the work correctly.
Thanks for the help. I really want to know why this specific text cannot be displayed properly.
The actual name of HSBC Broking Securities is:
滙豐金融證券(香港)有限公司
The first character, U+6ED9 滙, is one of the troublesome HKSCS characters: characters that weren't available in standard pre-Unicode Big-5, which were grafted on in incompatible ways later.
For a while there was an unfortunate convention of converting these characters into Private Use Area characters when converting to Unicode. This data was presumably converted back then and is now mangled, replacing 滙 with U+E05E  Private Use Area Character.
For PUA cases that you're sure are the result of HKSCS-compatibility-bodge, you can convert back to proper Unicode using this table.

word2vec : find words similar in a case insensitive manner

I have access to word vectors on a text corpus of my interest. Now, the issue I am faced with is that these vectors are case sensitive, i.e for example "Him" is different from "him" is different from "HIM".
I would like to find words most similar to the word "Him" is a case insensitive manner. I use the distance.c program that comes bundled with the Google word2vec package. Here is where I am faced with an issue.
Should I pass as arguments "Him him HIM" to the distance.c executable. This would return the sent of words closed to the 3 words.
Or should I run the distance.c program separately with each of the 3 arguments ("Him" and "him" and "HIM"), and then put together these lists in a sensible way to arrive at the most similar words? Please suggest.
If you want to find similar words in a case-insensitive manner, you should convert all your word vectors to lowercase or uppercase, and then run the compiled version of distance.c.
This is fairly easy to do using standard shell tools.
For example, if your original data in a file called input.txt, the following will work on most Unix-like shells.
tr '[:upper:]' '[:lower:]' < input.txt > output.txt
You can transform the binary format to text, then manipulate as you see fit.