Good tool but some errors are making me disappointed.
I'm using tesseract for single character recognition
This image recognized as ":"
This image recognized as "."
This image recognized as "."
What can I tweak to improve quality?
Looks like the colors are in reverse somehow. How can I instruct tesseract that text is black?
Related
So me and my friends play a game and they recently changed there images from white background and black letters to black background and colorful letters. and the old ocr that we was created years ago by someone is pretty useless now as the accuracy is very low if not 0% (it just took the old ocr ~250 attempts). So my question would i be able to to extract the text from the following picture
I have never used IronOCR and i tried using the default code to get text from image but the results were weird.
Thanks in advance!
You can try to segment the image first by color (a histogram analysis will tell you colors on the image). Then you can convert the images to b/w and run OCR. You'll get better accuracy.
tesseract ../spliced-time.png spliced-time -l eng --psm 13 --oem 3 txt pdf hocr
Gives me a result of: Al
I am confused if there is more I should be doing, or what would be the best approach for an image like this where the font and alignment should be generally the same. Just the numbers would be different. I was looking at opencv as well, but I feel as though this image shouldnt be that hard with maybe some extra work, or configuration or training to recognize the numbers well.
Image Attachment:
I have a question regarding achieving better recognition results with tesseract. I am using tesseract to recognize serial numbers. The serial numbes consist of only one font-type, characters A-Z, 0-9 and occur in different sizes and lengths.
At the moment I am able to recognize about 40% of the serial number images correct. Images are taken via mobile phone camera. Therefore the image quality isn't the best.
Special problem characters are 8/B, 5/6. Since I am recognizing only serial numbers, I am not using any dictionary improvements and every character is recognized independently.
My question is: Does someone has already experience in achieving better recognition results with training tesseract? How many images would be needed to be able to get good results.
For training tesseract should I use printed and afterwards photographed serial numbers, or should I use original digital serial numbers, without printing and photographing?
Maybe somebody has already experience in that kind of area.
Regarding training tesseract: I have already trained tesseract with some images. Therefore I have printed all characters in different sizes, photographed and labeled them correctly. Example training photo of the character 5
Is this a good/bad training example? Since I only want to recognize single characters without any dependency, I though I don't have to use words for training.
Actual I only have trained with 3 of these images for the characters B 8 6 5 which doesn't result in a better recognition in comparison with the original english (eng) tesseract database.
best regards,
Christoph
I am currently working on a Sikuli application using Tesseract to read text (Strings and numbers) from screenshots. I found that the best way to achieve accuracy was to process the screenshot before performing the OCR on it. However, most of the text I am reading is green text-on black background, making this my preferred solution. I used Scalr's method within BufferedImage to increase the size of the image:
BufferedImage bufImg = Scalr.resize(...)
which instantly yielded more accurate results with black text on gray background. I then used BufferedImage's options BufferedImage.TYPE_BYTE_GRAY and BufferedImage.TYPE_BYTE_BINARY when creating a new BufferedImage to process the Image to grayscale and black/white, respectively.
Following these steps brought Tesseract's accuracy from a 30% to around an 85% when dealing with green text on black background, and a really-close-to-100% accuracy when dealing with normal black text on white background. (sometimes letters within a word are mistaken by numbers i.e. hel10)
I hope this helps!
I have a problem with the training/text recognition process with Tesseract. Here is my trainingdata: http://s11.postimg.org/867aq10ur/dot_dotmatrixfont_exp0.png While training Tesseract ignores the dashes (I've marked them with red boxes, just to make it clear which ones I mean) and if I'm using the trained data for text recognition it also ignores them. Today I've played around with the Tesseract parameters (SetVariable(name, value)) but unfortunately I had no success.
What can I do to teach Tesseract those dashes? Thank you in advance!
Tesserect training is pretty tricky.
Your best chance might be to handle the dashes as a single char.
If your box editor or whatever tools you are using does not see the dashes as all, try running some image processing first, especially threshold or invert. try taking a look at OpenCV. They have some excellent tool for this kind of image processing.
I have to recognize set of digits on something like scoreboard, calculator and similiar devices display.
I tried that image in most popular ocr's, but with no success.
How can i preprocess this image to get it work with ocr frameworks? How to get that digits from there?
First,Based on edge detection, determine the position of the digits. Then, convert the image to a two-value image (white foreground and black background). Last, put it to OCR...