Tesseract does not recognize single characters

Tesseract does not recognize single characters - ocr

How to represent:
Create new image with paint (any size)
Add letter A to this image
Try to recognize -> tesseract will not find any letters
Copy-paste this letter 5-6 times to this image
Try to recognize -> tesseract will find all the letters
Why?

You must set the "page segmentation mode" to "single char".
For example, in Android you do the following:
api.setPageSegMode(TessBaseAPI.pageSegMode.PSM_SINGLE_CHAR);

python code to do that configuration is like this:
import pytesseract
import cv2
img = cv2.imread("path to some image")
pytesseract.image_to_string(
img, config=("-c tessedit"
"_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
" --psm 10"
" -l osd"
" "))
the --psm flag defines the page segmentation mode.
according to documentaion of tesseract, 10 means :
Treat the image as a single character.
so to recognize a single character you just need to use : --psm 10 flag.

You need to set Tesseract's page segmentation mode to "single character."

Have you seen this?
https://code.google.com/p/tesseract-ocr/issues/detail?id=581
The bug list shows it as "no longer an issue".
Be sure to have high resolution images.
If you are resizing the image, be sure to keep a high DPI and don't resize too small
Be sure to train your tesseract system
use the baseApi.setVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"); code before the init Tesseract
Also, you may look into which font to use with OCR

Related

Plain text with .gif extension in Ubuntu

Well here´s the thing, I am using the Terminal to do this. With a text editor such as nano I create a plain text file with the content: "GIF89a2017" and I save it as rare.gif
Here´s the thing, when I do file rare.gif it gives me this output: rare.gif: GIF image data, version 89a, 12338 x 14129 and that is indicating that it is a GIF image with a resolution of 12338 x 14129 and that's what I don't understand. Where´s that resolution coming from?
Another thing is, I thought extension didn't really decide what type of file it is, for example when I take a .gif and convert it into and .exe it still recognises it as a GIF image with the file command. I'm gonna guess that in the problem that I have it is recognised as a GIF image because it was created with the GIF extension but I'd like to know why.
Thanks to everyone!

Where´s that resolution coming from?
It's coming from the (bogus) GIF89 header you put in the file. The four bytes following "GIF89a" define the width and height. Each one is stored as a 16-bit unsigned integer. The characters you put there -- 2017 -- are interpreted as:
32 30 ("20") -- 0x3032 = 12338
31 37 ("17") -- 0x3731 = 14129
I'm gonna guess that in the problem that I have it is recognised as a GIF image because it was created with the GIF extension but I'd like to know why.
No, file doesn't look at extensions. It's because the file had a semi-valid GIF header. If you changed the header to something that didn't start with "GIF89a", it will no longer be recognized as a GIF.

How to display hidden characters in PhpStorm, especially line seperators

I got some special characters in my codes, take a look at:
  a     It's just shown in frontend with normal characters like an "a".
Now the same characters without any normal characters:
Characters starts here
     Characters ends here
Ok it looks like this Editor will not save empty   , try it with snippet.
<html><p>  </p></html>
The problem is, in PhpStorm this characters wont be shown, even not with
"settings - Editor - General - Appearance - show whitespaces" or
"settings - Editor - General - Appearance - show method separators"
Only "strg+f, strg+r" will find this characters.
I think this character is an "only-mac-char" :) I'm working with Windows, and I can't test it on mac.
EDIT: Sorry i could identify it as "U+2028 : LINE SEPARATOR"
http://www.babelstone.co.uk/Unicode/whatisit.html
The big problem is that phpStorm didn't show anything in the code. Like there is no character, but moving with the arrow keys notice 2 steps at this position, between 2 tags looks like "><" but it's "> <".

Based on your update it is now clear what character you have in mind:
Sorry I could identify it as "U+2028 : LINE SEPARATOR" http://www.babelstone.co.uk/Unicode/whatisit.html
Install and use Zero Width Characters locator 2 plugin: it can detect quite a few invisible characters (e.g. UTF-8 BOOM sequence, non-breakable space, Unicode line separator (your case) etc).
It is implemented as a separate inspection with highest (Error) severity so will be easy to spot or check the whole folder/project just for these issues.
There is a ticket (Feature Request) to have an option to show invisible characters in the editor.
https://youtrack.jetbrains.com/issue/IDEA-115572 -- watch this ticket (star/vote/comment) to get notified on any progress. implemented in 2020.2 version.
Other related tickets:
https://youtrack.jetbrains.com/issue/IDEA-99899 (your case, as I understand)
https://youtrack.jetbrains.com/issue/IDEA-140567
https://youtrack.jetbrains.com/issue/WEB-13506
UPDATE 2021-11-10:
As of 2020.2 version the IDE can show invisible/special symbols right in the editor.
An example:

Markup font-style (italic) in tesseract OCR

Have tesseract-ocr v3.02.02 installed on Windows 7, and have used it via the command line:
1) Output png text to a text file: tesseract image.png txtfile
2) Output png text to a html file: tesseract image.png htmlfile hocr
I need it to be able to markup any italic text in the output text or html file. How do I do this (preferably on the command line - never used it in API mode)?

The hocr output by Tesseract includes only the word coordinates and confidence values, not font-related information. As such, you will need to modify the source code to output what you want for the command-line mode, or use its API.

Convert .ttf file to .png

Is there any way to convert a TTF to PNG files? Or any other method to create Sprite out of TTF file in LIBGDX framework? Is there any application available for it?

Before running
LibGDX has a built-in tool in the gdx-toolsproject called Hiero. Just run that project as a java application, and when asked which class to run, choose that one. It lets you take a .ttf file and render it the characters you need (in a size given in pixels), plus it generates a file that contains information about where each character is on the texture. In the program, it's very simple to initialize and use:
BitmapFont font = new BitmapFont(Gdx.files.internal("data/font/font.fnt"));
...
font.draw(spriteBatch, "Text to output", coordX, coordY);
(font.fnt is the file containing the texture positions and other relevant information, it also refers to the .png which is created in the same folder by default.)
You can take a look at the BitmapFont documentation here.
During runtime
A disadvantage of Hiero is that bitmap fonts don't really scale well, so they can look quite bad on different screen resolutions.
Take a look at this answer to a related question:
One solution is to use the FreeType extension to libgdx, as described here. This allows you to generate a bitmap font on the fly from a .ttf font. Typically you would do this at startup time once you know the target resolution.
I haven't personally used it, but it seems like something worth checking out. It looks very simple as well - the example code in the linked answer is 5 lines long.

Finally I got the solution to the same problem(TTF to PNG) which I faced too.
Follow the below steps,
1. Convert TTF to SVG
Use TTF to SVG conversion tool to convert your custom or downloaded TTF file to SVG file
2. Convert SVG to PNG/PDF/TTF:
Goto IcoMoon, in the top left corner, there will be button to Import Icons, click and upload your converted SVG file.
In the bottom bar, there will be an option "Generate SVG & More" as in the below image, click on it
Next, Click the Settings gear icon near "Download" option to override size, output formats(PDF,PNG,etc.,) and then close the Settings
Now click download to get the outputs into a single zip file !!!

A ttf is a true-type font. It is not a picture, but a vectoric character set. You can't convert it to a picture simply with a tool.
If you want to view/manipulate ttf files, you can do this with ttf editing tools, for example fontforge ( http://fontforge.sourceforge.net ).

This may be an old question, but I found the following batch file works with ImageMagick 7:
#ECHO OFF
set f=wingding.TTF
set ps=800
set bg=white
set ext=png
set s=600x600
set alpha=A B C D E F G H I J K L M N O P Q R S T U V W Z Y Z
set num=0 1 2 3 4 5 6 7 8 9
For %%X in (%alpha% %num%) do (
convert -font %f% -pointsize %ps% -size %s% -background %bg% label:%%X
%%X.%ext%)
pause
exit
NOTE: This conversion only works with a limited selection of font characters. It works well for all capital letters. Just install ImageMagick and make sure it is in your environment path. Include "legacy" commands in your installation.

Html to ansi colored terminal text

I am under Linux and I want to fetch an html page from the web and then output it on terminal. I found out that html2text essentially does the job, but it converts my html to a plain text whereas I would better convert it into ansi colored text in the spirit of ls --color=auto. Any ideas?

The elinks browser can do that. Other text browsers such as lynx or w3m might be able to do that as well.
elinks -dump -dump-color-mode 1 http://example.com/
the above example provides a text version of http://example.com/ using 16 colors. The output format can be customized further depending on need.
The -dump option enables the dump mode, which just prints the whole page as text, with the link destinations printed out in a kind of "email-style".
-dump-color-mode 1 enables the coloring of the output using the 16 basic terminal colors. Depending on the value and the capabilities of the terminal emulator this can be up to ~16 million (True Color). The values are documented in elinks.conf(5).
The colors used for output can be configured as well, which is documented in elinks.conf(5) as well.

The w3m browser supports coloring the output text.

You can use the lynx browser to output the text using this command.
lynx -dump http://example.com

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Tesseract does not recognize single characters - ocr

How to represent: Create new image with paint (any size) Add letter A to this image Try to recognize -> tesseract will not find any letters Copy-paste this letter 5-6 times to this image Try to recognize -> tesseract will find all the letters Why?

You must set the "page segmentation mode" to "single char". For example, in Android you do the following: api.setPageSegMode(TessBaseAPI.pageSegMode.PSM_SINGLE_CHAR);

You need to set Tesseract's page segmentation mode to "single character."

Related

Plain text with .gif extension in Ubuntu

How to display hidden characters in PhpStorm, especially line seperators

Markup font-style (italic) in tesseract OCR

Convert .ttf file to .png

Html to ansi colored terminal text

Categories

Resources