How to detect a TIFF CCITT Fax Group 3 or 4 have wrong FillOrder not matching data encoding

How to detect a TIFF CCITT Fax Group 3 or 4 have wrong FillOrder not matching data encoding - tiff

As title. I'm not interested in converting between FillOrder=2 and FillOrder=1. Rather, I have a set of TIFF files where some images were encoded with one setting but "re-tagged" as the other setting (so that the tag's value doesn't match the encoding method).
A human would easily tell that the image looks wrong. It will contain mostly random horizontal strips, with occasional "point disruptions". Can I write an algorithm that can detect images that are encoded or decoded wrong for this compression method?

It is relatively easy to detect if Group 3 encoded images have the bits reversed. Each line starts with an EOL which is 000000000001 (12 bits). This is easy to see if it's backwards. Group 4 images are a little harder to detect, but if you are managing the decoder, you can try to decode a few lines and if there are no errors, then you're probably using the correct bit order.

Related

tesseract unable to detect characters in simple two-word image

I'm having trouble getting tesseract to recognize any characters in the following image:
When I run tesseract from the command line on this image, I get "Empty page!!" - that is, no results - returned. Based on my reading of the Improving Quality section of the wiki, I thought that the issue might be that the words in this image are not dictionary words. With that in mind, I have tried both disabling the tesseract dictionaries altogether (using the load_system_dawg and load_freq_dawg config flags) as well as augmenting the existing dictionary with these additional words (LAO and CAUD). Neither of those approaches worked. I have tried tesseract versions 3, 4, and have built version 5 from source on a Mac computer. All have given the same result.
Curiously, if I type the exact words from that image into a word processor and take a screenshot, it works: the resulting image is readable by tesseract. It correctly parses each character. Here is that image:
The only difference between the two images is that the first one is of a slightly lower resolution/quality. Am I then to believe that tesseract is unable to recognize characters in a slightly inferior quality image like that? Is there anything I can do to improve that image quality? Is there something else I'm missing?
Thanks in advance.

It's common problem. You probably will need preprocess the image, with rescaling, filters, etc.
Here are some ref on how to do that:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://docparser.com/blog/improve-ocr-accuracy/

The solution was to use the right page segmentation method (PSM). In my case, PSM 6, which is for a single block of text, did the trick.

Limit space size in Tesseract

I write in Python, using pytesseract or direct Popen calls if needed.
I try to OCR a document with irregular structure, a letter looking like this:
The problem is in the .hocr file generated by Tesseract I get lines consisting of left and right column glued together like "Recipient: Sender:"
What I'd like to achieve is output from the left and right column separated. Using third party Python utilities to pre-process the image is an acceptable solution if explained in reasonable detail. The script must be autonomous and somehow detect this issue as not all the letters have such strange formatting.
Tried/ideas:
Using --psm 1 to allow input format detection - no improvement over default, likely because structure is too complicated.
Tweaking some config file options like gapmap_use_ends and textord_words_maxspace - I couldn't find a good documentation on these and probably there is a right combination of values but there are 57 options with "space" in name... any insight on these would be much appreciated.
Editing the .hocr - not sure how to write appropriate grouping rules for the word boxes that do not interfere with normal text everywhere else...

Usage of encoded images

Recently I found out about the encoded images (base64 strings) and it seems really nice. I have few questions about it:
Should I make all of my images encoded?
if I have a photo encoded, do I have to keep the photo on my website inside some directory?
how much faster is this? does it really worth converting every image and use it as a string?
If I have a gallery, should I use the encoded images, or just keep it as it is which results in some cases in hundreds of HTTP requests?
Thanks in advance!

Should I make all of my images encoded?
NO, some browsers have limit
From Mozilla Developer Network:
Although Mozilla supports data URIs of essentially unlimited length,
browsers are not required to support any particular maximum length of
data. For example, the Opera 11 browser limits data URIs to around
65000 characters.
If I have a photo encoded, do I have to keep the photo on my website inside some directory?
NO, you won't need the original image, if you encode it, than you will require that encoded string only.
How much faster is this? does it really worth converting every image and use it as a string?
It will save you http requests, and you shouldn't convert every image.
If I have a gallery, should I use the encoded images, or just keep it as it is which results in some cases in hundreds of HTTP requests?
NO, you shouldn't, take a look at lazy loading instead.

Should I make all of my images encoded?
Not necessarily. If you use a base64 encoding the size of the images grows typically by ~33% than the original resources, they are not cached like regular images and they always need to be converted into images.
Thus it's better use this technique to reduce the requests amount for a few of small images. Besides, older browser like IE7 doesn't accept base64 encoding
I have a photo encoded, do I have to keep the photo on my website inside some directory?
No
how much faster is this? does it really worth converting every image and use it as a string?
No, see first answer
If I have a gallery, should I use the encoded images, or just keep it as it is which results in some cases in hundreds of HTTP requests?
I wouldn't recommend this. You should instead consider to use an image preloader or lazyloading

Using Fractions On Websites

Is it possible to use any fraction symbol on a website, represented as ¼ rather than 1/4 for example?
From what I've gathered, these are the only ones I can use:
½
⅓ ⅔
¼ ¾
Is this right and why is that? The reason why I ask this is because I've done a Google web search and can't seem to locate any others ... eg. 2/4

You can test http://www.mathjax.org/ it is a JavasScript library to make a Math Formula if this is what you want.

The image below displays all unicode-defined fraction symbols. Each of them is treated as one single character. You can use all of them freely, of course, but if you want more, e.g. 123/321, then you should look out for a library that can create fractions dynamically.
An option for doing so would be using LaTeX. There is another question (with very good answers) on how to do this.
Image from http://symbolcodes.tlt.psu.edu/bylanguage/mathchart.html#fractions

As I undserstand HTML5 includes MathML which can represent any fraction you want.
While searching the unicode table I also found these: ⅑ ⅒ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞.

A web page is built up with text, and that text is encoded in a certain character set. The character set you select decides on which characters can be displayed. This also means that characters or symbols that don't exist in the character set cannot be displayed.
As shown in Michael's answer, Unicode defines symbols for a number of fractions. These can be displayed without using all kinds of tricks, for example server or client side generated small bitmaps showing the desired fraction, or as indicated by
mohammad mohsenipur a Javascript library that transforms TeX or MathML.

There are several possibilities:
Use special character for fractions. Not possible for 2/4 for example, and problematic in font support for all but the three most common (vulgar) fractions you had found.
Use markup like <sub>2</sub>/<sup>4</sup>. Probably messes up your line spacing, and does not look particularly good.
Construct a fraction using some CSS for positioning and size control and using fraction slash character instead of the common slash. Rather awkward really, I would say.
Use OpenType <code>"frac"</code> feature. Rather limited support in browsers and especially in fonts.
MathJax, e.g. \(\frac{2}{4}\) or some more elaborated TeX code to produce a different style for fraction.
MathML. Verbose, and browser support to MathML inside HTML could be better.
These are explained more and illustrated in my page “Math in HTML (and CSS)”, section Fractions.
The choice thus depends on many factors. It also depends on the font quite a lot. I suggest that you test the different options using the font family declaration you intend to use. Despite the many alternatives, you might end up with using just the simple linear notation like 2/4.

Recognizing superscript characters using OCR

I've started a simple project in which it must get an image containing text with superscripts and then by using OCR (currently I'm using tesseract) it has to recognize the superscript characters + the normal ones.
For example, we have a chemical equation such as Cl², but when I use the tesseract to recognize it, it gives me Cl2 (all in one line).
So, what is the solution for this problem? Is there any other OCR API that has the ability to read superscripts?

Very good question that touches more advanced features of any OCR system.
First of all, to make sure you are NOT overlooking the functionality even though it may be there on an OCR system. Make sure to look at your result test not in plain TXT format, but in some kind of rich text capable viewer. TXT viewers, such as Notepad on Windows, often do not support superscript/subscript characters, so even if OCR were to give you correct characters, your viewer could have converted it to display it. If you are accessing text result programatically, that is less of an issue because you are supposed to get a proper subscript character value when accessing it directly. Just note that viewers must support it for you to actually see it. If you eliminated this possible post-processing conversion and made sure that no subscript is returned from OCR, then it probably does not support it.
Just like in this text box, in your original question you tried to give us a superscript character example, but this text box did not accept it even though you could copy/paste it from elsewhere.
Many OCR will see subscript as any other normal character, if they can see it at all. OCR of your use needs to have technical capability to actually produce superscripts/subscripts, and many of them do, but they tend to be commercial OCR systems not surprisingly.
I made a small testcase before answering this letter. I generated an image with a few superscript/subscript examples for my testing (of course EMC2 was the first example that came to mind :) .
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page.tif
And processed this image through OCR-IT OCR Cloud 2.0 API using all default settings, but exporting to a rich text format, such as MS Word .DOC.
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page_result.doc
Also note: When you are interested to extract superscript/subscript characters, pay separate attention to your image quality, more than you would with a typical text. Those characters are tiny and you need sufficient details and resolution to achieve descent OCR quality. Even scanned at 300 dpi images sometimes have issues with tiny characters due to too few pixels. If you are considering mobile and digital cameras, that becomes even more important.
DISCLOSURE: My specialty is implementing internal OCR solutions for companies of different sizes. My company is WiseTREND. Contact me directly if I can assist with anything further.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008