Tesseract OCR doesn't recognize some symbols such as the ^ Circumflex - ocr

I've been trying to use Tesseract to recognize texts that have the circumflex ^ or in other words the power symbol. Tesseract never recognized it in any of the documents. I've tried to include the Greek language because maybe it's supported there, but it didn't work. I've also gone through the official issues posted on Github, but nothing there.
Is there any workaround? Any help is greatly appreciated!

There is a 'language pack' for equations. The symbol may be included there. The file is named 'equ.traineddata'. I got it from here: https://github.com/tesseract-ocr/tessdata

Related

How to make a House/Email symbol code for html5

like the house/home symbol is &#8962, I need the most popular symbol codes for a website, like contact us, about us, home, etc.
Thanks in advance for any help.
The notation ⌂ (which should really include the semicolon) is just a reference to the character with Unicode code number 8962 in decimal. You can use similar notations for all Unicode characters, so the ultimate reference would be the Unicode Standard, and in practice you might want to look at the Code Charts for symbols there. The symbol denoted by ⌂, U+2302 HOUSE, is in the Miscellaneous Technical block.
However, most Unicode characters are not supported by most fonts. The real problem with using special characters like “⌂” is with font support and with users’ difficulties in guessing what you mean by such characters (if the users are lucky enough to see them). This is why images are generally recommended for icon-like symbols.

Is it possible to use English-Hindi converter software in Flex application?

I am using Flex4.6 and anmol hindi font. But some are missing in Keymap and it is difficult to type those words.
So, I want to type Hindi in a easy way, as we do in Gmail. So for that I got a software called English to Hindi converter in Indiatyping.com. But don't know how to use that in my application. I tried all most all fonts. But it is difficult to type few words and there are no proper guidelines to type all the alphabets. From a,aa, k,kha, to ka, kaa, jna, jnaa.
Please help me. Tired by searching in google. But didn't get any proper solution.

Unicode/HTML question about obscure Greek character

I'm putting an old text into HTML. Sometimes it uses Greek terms and phrases. But there's one character I've never seen before. It seems to be a combination of two other characters: small omicron (ο, ο) + small upsilon with perispomeni (ῦ, ῦ). Here is a PNG illustrating the character, and how it works:
Does anyone know how to put this character into HTML? Can it be found anywhere in Unicode? Has anyone even heard of it?
Thanks.
That's called a ligature. I couldn't find any Unicode character for that one, though there is the Latin version of it:
http://en.wikipedia.org/wiki/Ou_(ligature)
Which mentions the Greek.

What is a good resource for HTML character codes -> glyph and

I've already found a good site to convert HTML character codes to their respective glyphs:
http://www.public.asu.edu/~rjansen/glyph_encoding.html
However, I need a bit more information. Does anyone know of a site like the one above that also provides information on what type of character code it is? Meaning, is it a special character? Is the glyph visible? Etc...
So far I have found some tables with this information, but they aren't as complete as the resource above. I would really like to get my hands on a complete table.
Thanks,
-Ben
HTML Entity Character Lookup
I like FileFormat.Info--e.g.: http://www.fileformat.info/info/unicode/char/20ac/index.htm
The character map on Ubuntu (and I assume most other Linux distros) is fantastic. You can search for any character by its name or description (e.g. "arrow") really easily.
Windows' character map is a poor imitation but kinda works too. It seems to decide that certain fonts (Arial, Verdana etc) can't display some characters, even though they work absolutely fine. (Hint: try MS's more recent font creations like Calibri for better results.)
Once you've found a character you can either:
Copy it and use it directly (requires pages to be UTF-8) like this: ↗
Insert it as a hexadecimal entity. The above character is "U+2197 North East Arrow" so the entity would be ↗
Convert the hex code to decimal (the calculators on Windows and Linux can do this). The above example is ↗
Here's a quick, low-footprint way to look them up: &what;

apostrophes coming in as �

I am reading in HTML from a file and displaying it on a web page:
When I look at in the source I see:
The Club’s summer junior programs
but it shows up as:
The Club�s summer junior program
What is happening here and why the � is showing up?
Did you set the proper encoding of the html page?
Read here and here.
I'm guessing you (or someone close to you) is copy/pasting from Word and you are seeing the webby effects of word's [not so] smart quotes. The work around is to set the character encoding to utf-8 or windows-1252.
This is definitely a character encoding issue. It means the page says it has X encoding, but actually it has Y.
A very interesting read by Joel: http://www.joelonsoftware.com/articles/Unicode.html about this topic, definitively a must read if you didn't already read this.
It explains pretty well why these problems occur, how they came to be and how to avoid it :).
May be you have copied text from a work editor, like MS Word, which changes quotes to open quotes and closed quotes characters. When such a text is copied to a text file, it gives these problems.
A simple solution can be to type these quotes again in the text editor.