I uploaded my documents (bilingual documents) which include more than 13,000 sentences.
But the training result is... only 60% of sentences have been aligned and 40% of sentences have been used for training. So, only 5000 sentences are used.
When I uploaded, I used one bilingual files for source file and target file.
Is this problem caused by the low number of sentences aligned & used?
these files are actually the result of the translation by human.
How many sentences should be used to success training normally?
If I success the training, normally, how is the quality compared to google machine translation?
Related
I have a few questions regarding the fine-tuning process.
I'm building an app that is able to recognize data from the following documents:
ID Card
Driving license
Passport
Receipts
All of them have different fonts (especially receipts) and it is hard to match exactly the same font and I will have to train the model on a lot of similar fonts.
So my questions are:
Should I train a separate model for each of the document types for better performance and accuracy or it is fine to train a single eng model on a bunch of fonts that are similar to the fonts that are being used on this type of documents?
How many pages of training data should I generate per font? By default, I think tesstrain.sh generates around 4k pages.
Maybe any suggestions on how I can generate training data that is closest to real input data
How many iterations should be used?
For example, if I'm using some font that has a high error rate and I want to target 98% - 99% accuracy rate.
As well maybe some of you had experience working with this type of documents and maybe you know some common fonts that are being used for these documents?
I know that MRZ in passport and id cards is using OCR-B font, but what about the rest of the document?
Thanks in advance!
Ans 1
you can train a single model to achieve the same but if you want to detect different languages then I think you will need different models.
Ans 2
If you are looking for some datasets then have a look at this Mnist Png Dataset which has digits as well as alphabets from various computer-based fonts. Here is a link to some starter code to use the data set implemented in Pytorch.
Ans 3
You can use optuna to find the best set of params for your model, but you will need some of the
using-optuna-to-optimize-pytorch-hyperparameters
Have a look at these
PAN-Card-OCR
document-details-parsing-using-ocr
They are trying to achieve similar task.
Hope it answers your Question...!
I would train a classifier on the 4 different types to classify an ID, license, passport, receipts. Basically so you know that a passport is a passport vs a drivers license ect. Then I would have 4 more models that are used for translating each specific type (passport, drivers license, ID, and receipts). It should be noted that if you are working with multiple languages this will likely mean making 4 models based each specific language meaning that if you have L languages you make need 4*L number of models for translating those.
Likely a lot. I don’t think that font is really an issue. Maybe what you should do is try and define some templates for things like drivers license and then generate based on that template?
This is the least of your problems, just test for this.
Assuming you are referring to a ML data model that might be used to perform ocr using computer vision I'd recommend to:
Setup your taxonomy as required by your application requirements.
This means to categorize the expected font sets per type of scanned document (png,jpg tiff etc.) to include inside the appropriate dataset. Select the fonts closest to the ones being used as well as the type of information you need to gather (Digits only, Alphabetic characters).
Perform data cleanup on your dataset and make sure you have homogenous data for the OCR functionality. For example, all document images should be of png type, with max dimensions of 46x46 to have an appropriate training model. Note that higher resolution images and smaller scale means higher accuracy.
Cater for handwritting as well, if you have damaged or non-visible font images. This might improve character conversion options in cases that fonts on paper are not clearly visible/worn out.
In case you are using keras module with TF on mnist provided datasets, setup a cancellation rule for ML model training when you reach 98%-99% accuracy for more control in case you expect your fonts in images to be error-prone (as stated above). This helps avoid higher margin of errors when you have bad images in your training dataset. For a dataset of 1000+ images, a good setting would be using TF Dense of 256 and 5 epochs.
A sample training dataset can be found here.
If you just need to do some automation with your application or do data entry that requires OCR conversion from images, a good open source solution would be to use information gathering automatically via PSImaging module (Powershell) use the degrees of confidence retrieved (from png) and run them against your current datasets to improve your character match accuracy.
You can find the relevant link here
Using Kofax Capture 10 (SP1, FP2), I have recognition zones set up on some fields on a document. These fields are consistently recognizing I's as 1's. I have tried every combination of settings I can think of that don't obliterate all the characters in the field, to no avail. I have tried Advanced OCR and High Performance OCR, different filters for characters. All kinds of things.
What options can I try to automatically recognize this character? Should I tell the people producing the forms (they're generated by a computer) they need to try using a different font? Convince them that now is the time to consider using Validation?
My current field setup:
Kofax Advanced OCR with no custom settings except Maximize Accuracy in the advanced dialog. This has worked as well as anything else I have tried so far.
The font being used is 8 - 12 pt arial, btw.
Validation is a MUST if OCR is involved, no matter if e-docs or paper docs are processed. For paper docs it is an even bigger must.
Use at least 11pt Arial and render the document as 300 dpi image. This will give you I'd say 99.9% accuracy (that is 1 character in every 1000 missed). Accuracy can drop if you have data where digits and letters are mixed within one word especially 1-I, 0-O, 6-G.
Recognition scripts can be used if you know that you have no such mixed data and OCR still returns mixed digits and letters. You can use the PostRecognition script event to catch the recognition result from the OCR engine and modify it with SBL or VB.NET scripts. But it greatly depends on the documents and data you process.
Image cleanup will not do any good for e-docs.
I'd say your best would be to use validation. At least that will push responsibility to the validation operator.
I'm working on getting the Lincoln font to work in Tesseract, and I'm getting abysmal results, even after going through the wildly complicated training process.
This is what the font looks like, so yeah, it's a bit tricky:
I've carefully made a training image, and then used that to make a box file. The training image is here (25MB!). The image is 300 DPI, and has representative characters nicely spaced out vertically and horizontally.
I made a box file for the training image, and it worked properly. I've verified that it's correct using a box file editor.
I took this box file/tif file, and used it to create training data. I did likewise with the 30 or so other sample images/fonts provided by Tesseract.
I created the unicharset file.
I created a font_properties file. There's no guidance on the site about when fraktur should be used. So I've tried it both this way (fraktur on for Lincoln):
eng.lincoln.box 0 0 0 0 1
And this way (fraktur off):
eng.lincoln.box 0 0 0 0 0
And finally, I've tried this with and without dictionary files. When I used dictionary files, they were the wordmap from my search engine, Sphinx, and they have about 15K common words and about 20K uncommon ones.
In all cases, when I try to OCR the first couple lines of this file (3MB), the quality is abysmal. Rather than getting:
United States Court of Appeals
for the Federal Circuit
I get:
OniteiJ %tates C0urt of QppeaIs
for the jfeI1eraICircuit
Why?
I think you'll need a lot more samples (letters) and better training images (clean background, grayscale, 300 DPI, etc.). And try to train with only one font (for instance, Lincoln) first. You can use jTessBoxEditor tool to generate your training images and edit the box files.
Once you master the training process, you can add other fonts to your training. You can test the success of the resultant language data by using it in performing OCR on the training image itself -- the recognition rates should be high.
The font names in font_properties should be like:
lincoln 0 0 0 0 1
I am not a Tesseract expert but I have evaluated nearly every OCR engine available and my comments are based on my experience over the years of analysing OCR errors.
Just wondering why your image has speckles in the background and not a pure white background. I don't know how Tesseract or the training tool works but the background could be causing some problems.
Just reading the sample page is difficult and requires a large amount of concentration. Characters such as F and I are very similar as are U and N. Tesseract like many OCR engines would be using many different techniques to recognise a character and there is not a whole lot difference between many of these characters in terms of the strokes and curves used in the font.
These characters, especially the uppercase characters would confuse many different matching algorithms just because they are so different to standard Latin / Roman type characters. This shows through in your results ie. All capital letters have an OCR error.
I'm trying to use Tesseract-OCR to detect the text of images with pure text in it but these text has a handwritten font called Journal.
Example:
The result is not the best:
Maxima! size` W (35)
Is there any possibility to improve the result or rather to get the exact result?
I am surprised Tesseract is doing so well. With a little bit of training you should be able to train the lower case 'l' to be recognised correctly.
The main problem you have is the top of the large T character. The horizontal line extends across 2 (possibly 3) other character cells and this would cause a problem for any OCR engine when it tries to segment the characters for recognition. Training may be able to help in this case.
The next problem is the . and : which are very light/thin and are possibly being removed with image pre-processing before the OCR even starts.
Overall the only chance to improve the results with Tesseract would be to investigate training. Here are some links which may help.
Alternative to Tesseract OCR Training?
Tesseract OCR Library learning font
Tesseract confuses two numbers
Like Andrew Cash mentioned, it'll be very hard to perform OCR for that T letter because of its intersection with a number of next characters.
For results improvement you may want to try a more accurate SDK. Have a look at ABBYY Cloud OCR SDK, it's a cloud-based OCR SDK recently launched by ABBYY. It's in beta, so for now it's totally free to use. I work # ABBYY and can provide you additional info on our products if necessary. I've sent the image you've attached to our SDK and got this response:
Maximal size: lall (35)
in an OCR application you'd usually find connected components of the image and run you OCR engine on those components to recognise them.
My question is what should one do if your connected components has symbols/shapes that donot exist in your training set.
For example, if we're running digit recognition and the image has a straight-line or a char, say "X" or anything else which is not a digit.
How can you tell that it's not a digit?
Normally OCR engines provides the confidence score for each symbol recognized. If you set an acceptance threshold on this confidence score you can distinguish between digits and non-digit information.
Good luck