tesseract OCR failed to recognize these numbers - ocr

I am having trouble getting these numbers recognized correctly. With "--psm 7", I got "lOON3S06", with "--psm 7 digits", I got "0306". The image is clear, resolution is good and there are enough white space around the text. Tried both version 4.1.1 and the latest 5 release. Any suggestions?

Related

Is it possible to grab the 4 numbers from this image using IronOCR?

So me and my friends play a game and they recently changed there images from white background and black letters to black background and colorful letters. and the old ocr that we was created years ago by someone is pretty useless now as the accuracy is very low if not 0% (it just took the old ocr ~250 attempts). So my question would i be able to to extract the text from the following picture
I have never used IronOCR and i tried using the default code to get text from image but the results were weird.
Thanks in advance!
You can try to segment the image first by color (a histogram analysis will tell you colors on the image). Then you can convert the images to b/w and run OCR. You'll get better accuracy.

Windows OS displaying unknown character vertical rectangles in browsers [duplicate]

This question already has answers here:
Rectangles instead of whitespace in Chrome
(2 answers)
Closed 5 years ago.
I had an interesting issue pop up today on a couple of websites I host. Specifically a vertical rectangle character is displaying (mostly) near characters in my code (though I tend to use those sparingly) but only on Windows machines. I've only seen one instance where that the vertical character has displayed in the middle of a paragraph.
I checked the site in Firefox, Chrome and Safari on Mac and didn't see any issues. Even more interestingly, when I log in to the site backend on the Windows machine, the vertical rectangle character is actually selectable and deletable. No clue why this is happening.
Both sites are running on Wordpress. I've attached a screenshot for reference. Thanks!
Unknown Character Reference Screenshot
That is a character that your font doesn't know how to render properly. Try to copy and paste it into google, and see what it says it is. Chances are it is safe to remove it, try retyping that bit of information WITHOUT copying and pasting. Else, try a different font and see if it is still there. See this for more; specifically: The replacement character � (often a black diamond with a white question mark or an empty square box) is a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table. (Bold Added)

Algorithm issue with TIFF CCITT Group 4 decompression (T.6)

I work for an engineering design house and we store black and white design drawings in TIFF format compressed with CCITT Group 4 compression.
I am working on a project to improve our software for working with these drawings. I need to be able to load the raw data into my program obviously, so I must decompress it.
I tried using LibTiff but gave up on that rather quickly. It wouldn't build, generating over 2000 errors. I found many obvious syntax errors in the library and concluded it was junk. I spent about 3 hours trying to find the part of the library that implements the CCITT Group 4 codec but no luck, that code is an incomprehensible mess.
So it is that I am writing my own codec for the program. I have it mostly working well, but I am stuck on a problem. I cannot find good documentation on this format. There are a lot of good overviews that describe generally how 2D Modified Huffman compression works, but I cant find any that have specific, implementation level details. So I am trying to work it out by using some of the drawing files as examples.
I have vertical and pass modes working well and my algorithm decompresses about a third of the image properly before it goes off to the wizard and produces garbage.
I traced the problem to the horizontal mode. My algorithm for the horizontal mode expects to see the horizontal mode code 001 followed by a set of makeup codes (optional) and a termination code in the the current pen color, followed by another set of makeup codes (optional) and a termination code in the opposite color.
This algorithm worked well for a third of the way through the image, but suddenly I encountered a horizontal mode run where the opposite color comes before the current pen color.
The section of the image is a run of 12 black pixels followed by a run of 22 white pixels.
The code bits from that section are 00100000110000111 which decodes to Horizontal (001) 22 White (0000011) 12 Black (0000111 ) which as you can see is opposite of the order in which the pixels appear in the image.
Since my algorithm expects image order listing, it crashes. But the previous 307 instances of horizontal mode in this same image file were all in image order. This is the only reversed one I have found (so far).
Other imaging programs display this file just fine. I tried manually editing the bits in the image file just as a test to put the order in image order and that causes other imaging programs to crash when decoding the image. This leads me to believe they have some way of knowing that it is reversed in that instance.
Anyone know specific implementation level details about this TIFF CCITT G4 encoding which could help me understand how and why the run codes are sometimes reversed?
Thanks
Josh
CCITT G4 horizontal codes are always encoded as a pair (black/white) or (white/black). It depends on the current pen color. A vertical code will flip the color, but a horizontal code will leave the color unchanged. If the current pen color is black, then you decode a white horizontal code followed by a black. If the current pen color is white, then you will do the opposite.
Code : 00100000110000111
001 : Horizontal Mode
0000011000 : Black RunLength 17
0111 : White RunLength 2
It is Black first.
Run codes are not reversed.

"L" characters showing up randomly in text in IE 8

I'm having this problem with L characters showing up in IE 8. It's happening in the Healthcare Professionals block and the bottom two blocks. Any experience with this/clue as to what's wrong? I'm going to start deconstructing the whole page soon and rebuilding it line by line, but it would be great to get an answer as to what the heck the cause is.
Maybe you can refer to this https://webmasters.stackexchange.com/questions/15709/strange-characters-appearing-on-websites-ascii-unicode
There may be some encoding issue with the content.

How to make tesseract to give relevant results in the presence of noise?

I am using tesseract 3.0.0 and I bumped into the following problem:
When there is something too small for tesseract to recognize it seems it's merged with
other fragments. As a result nothing relevant is returned.
The image below shows 3 cases. Only the rectangle with the dashed line is passed to tesseract. Over the rectangle is the result (V over T means new line).
The last case is the problem one. Is there someway to improve tesseract in situations like this?
As far as I know, Tesseract does not have proper image segmentation yet (or Document Analysis, as it is called in commertial OCR applications.) Typically, before OCR is done, image is get's split on separate areas that contain text, pictures, barcodes, lines and so on. Then you apply OCR only on text ares and don't face problems you have just described.
Earlier versions of Tesseract did not have that functionality at all, and Tesseract was supposed to be used as line recognizer only, or so called field-level recognizer, when you use it on small snippets of text cut from bigger image.
I did not followed throughly what was introduced in 3.0, probably it is already there partially, but obviously it does not work as expected, as you have just found out.
There is another opensource project - OCRopus, that aproached this problem exactly as I described - first Document Analisys (aka Segmentation) and only then OCR. Their earlier versions were actually using Tesseract for OCR after analisys step finished. But later they introduced their own OCR (which is still not very good) and moved Tesseract plugin support down in priorities list.
Here's what you actually can do to address your problem:
If your images have very typical structure, you can try to do some dumb segmentation and cut text from the image yourself before passing it to Tesseract. However, if you expect to have wide variety of images to be supported, just forget it.
You can ckeck OCRopus and see if their segmentation work for your images. If yes, then you can spend some time to make OCRopus + Tesseract work together.
Well, if what you do is not just for fun and you value your time, I would recommend thinking about real OCR engine like ABBYY. You will get much higher accuracy of both segmentaiton and OCR out of the box, and professional customer support of course.
Disclaimer: I work for ABBYY