Algorithm to detect dithered image - ocr

I am trying to detect if a g4 compressed Tiff image will produce a good OCR output. Currently, dithered Tiff's yield poor OCR results. Therefore, before I send a Tiff to the OCR engine, I would like to determine if the image is dithered. If a Tiff was dithered, Ghostscript was used to perform the dithering.
Is there an algorithm to determine if an image is dithered?

Related

Autodesk Viewer raster background

We are looking at using the Forge Viewer for a project to display and markup plans, etc. (for now just 2D, 3D may be added in the future). Some drawings they may only have in raster formats, i.e. scans of old buildings, etc.
Is there a way to show this raster image at a certain predefined size in the viewer as a background? Obviously there wouldn't be the ability to snap to anything or get different objects, etc. but it would still be useful when vector data is simply not available.
The only way I can think of is to use design automation to create a CAD file or something and then place the image file there and then convert the CAD file to svf. That seems very clunky though and I'm not sure it would actually work without testing it.
Is there a better way to display this data in the viewer?
There is no translation support for raster images png, bmp, etc to generate an SVF model out of it.
You could either
a) use the image inside a format that we support (e.g. DWG - perhaps place the image in it using Design Automation API) and translate that to SVF
b) load a dummy model as shown here, and then add the image using e.g. threejs functions

Is there any loss of information in converting jpg files to png?

I am working on an image dataset using deep learning for segmentation. The training images and masks are in jpg format. I would like to know whether there is any loss of information in converting jpg to png? I searched a bit, but couldn't get any relevant information. I am trying out whether using png images improves segmentation accuracy. Any help is appreciated.
PNG uses a lossless compression per default.
However, the PNG standard supports many different bit depths, color spaces and nifty features that can result in information loss. For example if you use a standard JPEG file with 24bit color and convert it to a PNG with 8bit color, you will lose image information.
When using default settings in libraries such as OpenCV or PIL, the conversion will be lossless though.

What is Blob in Tesseract OCR

I am learning Tesseract OCR and reading this article that is based on this article. From first article:
First step is Adaptive Thresholding, which converts the image into
binary images. Next step is connected component analysis which is
used to extract character outlines. This method is very useful
because it does the OCR of image with white text and black background.
Tesseract was probably first to provide this kind of
processing. Then after, the outlines are converted into Blobs.
Blobs are organized into text lines, and the lines and
regions are analyzed for some fixed area or equivalent text
size.
Could anyone explain what is Blob?
From https://tesseract-ocr.repairfaq.org/tess_glossary.html :
Blob
Isolated, small region of the scanned image. It's delineated by the outline. Tesseract 'juggles' the blobs to see if they can be split further into something that improved the confidence of recognition. Sometimes, blobs are 'combined' if that gives a better result. See pithsync.cpp, for example.
Generally a blob (also called a Connected Component) is a connected piece (i.e. not broken) in a binary image. In other words, it's a solid element in a binary image.
Blob finders are a key step in any system that aims extracting/measuring data from digital images.

Minimalistic way to read TIFF image format pixels

We are participating at the RoboCup 2015 from the German Aerospace Center in October. Before the torunament we will get a 30x30 pixel sized TIFF image, representing a low-pixel heigtmap. My task is to write a fast and lightweight, dependency free code that reads this TIFF image an does some algorythmic stuff on it.
I googled about the TIFF image format and it seems there are some powerfull libraries, but is there a simple way of reading just the color values of the file?
I remember a format, don't know which, where I just skipped the first 30 bytes and then could read the color values in RGB. Do you have any code that could do that or an idea/explanation how I could acchieve that?
As I said, I do not need filename, imagesize data etc. I actually don't even know why they have choosen the TIFF image-format since its just a normal heightmap in greyscale, but however.
Every help is very appreciated.

Creating a training image for Tesseract OCR

I'm writing a generator for training images for Tesseract OCR.
When generating a training image for a new font for Tesseract OCR, what are the best values for:
The DPI
The font size in points
Should the font be anti-aliased or not
Should the bounding boxes fit snugly: , or not:
The 2th question is somehow answered here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images
There is no need to train with multiple sizes. 10 point will do. (An exception to this is very small text. If you want to recognize text with an x-height smaller than about 15 pixels, you should either train it specifically or scale your images before trying to recognize them.)
Questions 1 and 3: by experience, I've successfully used 300 dpi images/non anti-aliased fonts. More specifically, I have used the following convert parameters on a training pdf, which generated a satisfactory image:
convert -density 300 -depth 8 [input].pdf -background white -flatten +matte -compress none -monochrome [output].tif
But then I tried to add a dotted font to Tesseract and it only detected characters properly when I used a 150 dpi image. So, I don't think there's a general solution, it depends on the kind of fonts you're trying to add.
I found the answer to the 4th question - "Should the bounding boxes fit snugly".
It seems that fitting the rectangles as much as possible gives much better results.
For the other 12 pts and 300 dpi will be good enough, as #Yaroslav suggested. I think anti-aliasing is better turned off.
Good tool for tesseract training http://vietocr.sourceforge.net/training.html
It is good tool because having number of advantages
bounding box on letter can be editable by GUI based interface
automatically create all require file
automatically combined all files like freq-dawg, word-dawg, user-words (can be empty file), Inttemp, Normproto, Pffmtable, Unicharset, DangAmbigs (can be empty file), shapetable into single eng.traineddata file.
New training data can be used with existing tesseract file end.traineddata