Tesseract 3 training new font "Error: traineddata file must contain at least (a unicharset fileand inttemp) OR an lstm file." - ocr

I followed the instructions on this website https://pretius.com/blog/ocr-tesseract-training-data/ to train an OCR for a font that it doesn't work too well on; however, on the last step of creating the tesseract OCR traineddata, this error occurs:
Error message and my tesseract version:
I've searched online and have not found where the inttemp file is created or how it is created. Thanks in advance!

Related

How should one zip a large folder in Windows 10, upload it to GDrive, then unzip it?

I have a directory consisting of 22 sub-directories. Altogether, the directory is about 750GB in size and I need this data on GDrive so that I can work with it in Google Colab. Obviously uploading this takes an absolute age (particularly with my slow connection) so I would like to zip it, upload it, then unzip it in the cloud.
I am using 7zip and zipping each subdirectory using the zip format and "normal" compression level. (EDIT: Can now confirm that I get the same error for 7z and tar format). Each subdirectory ends up between 14 and 20GB in size. I then upload this and attempt to unzip it in Google Colab using the following code:
drive.mount('/content/gdrive/')
!apt-get install p7zip-full
!7za x "/content/gdrive/My Drive/av_tfrecords/drumming_7zip.zip" -o"/content/gdrive/My Drive/unzipped_av_tfrecords/" -aos
This extracts some portion of the zip file before throwing an error. There are a variety of errors and sometimes the code will not even begin unzipping the file before throwing an error. This is the most common error:
Can not open the file as archive
ERROR: Unknown error -2147024891
Archives with Errors: 1
If I then attempt to rerun the !7za command, it may extract one or 2 files more from the zip file before throwing this error:
terminate called after throwing an instance of 'CInBufferException'
It may also complain about particular files within the zip archive:
ERROR: Headers Error : drumming/yt-g0fi0iLRJCE_23.tfrecords
I have also tried using:
!unzip -n "/content/gdrive/My Drive/av_tfrecords/drumming_7zip.zip" -d "/content/gdrive/My Drive/unzipped_av_tfrecords/"
But that just begins throwing errors:
file #254: bad zipfile offset (lseek): 8137146368
file #255: bad zipfile offset (lseek): 8168710144
file #256: bad zipfile offset (lseek): 8207515648
Although I would prefer a solution in Colab, I have also tried using an app available in GDrive named "Zip Extractor". But that too throws an error and has a dataquota.
This has now happened across 4 zip files and each time I try something new, it takes an a long time to try it out because of the upload speeds. Any explanations for why this is happening and how I can resolve the issue would be greatly appreciated. Also I understand there are probably alternatives to what I am trying to do and they would be appreciated also, even if they do not directly answer the question. Thank you!
I got same problem
Solve it by
new ProcessBuilder(new String[] {"7z", "x", fPath, "-o" + dir)
Use command line array, not just full line!
Luck!
Why does this command behave differently depending on whether it's called from terminal.app or a scala program?

Tesseract 3.05 : Failed loading language `eng` when training Chinese

I am trying to use tesseract to train a Chinese model, here's my script:
./tesstrain.sh \
--lang chi_sim
--langdata_dir ../../langdata
--tessdata_dir ../ # root directory of tesseract
--output_dir ../../output
At the first, everything works fine, but when it comes to phase E: Extracting features, something went wrong:
Failed loading language 'eng'
Tesseract couldn't load any languages!
Couldn't initialize tesseract
I don't understand, I am trying to train a Chinese model, why it comes to look for eng language, and how do I resolve this problem? thanks!

How to create a uzn file for tesseract

I need to build an OCR application that scans passports and so I have chosen tesseract for start. From what I have read there should be a .uzn file that I define, but I can't find any documentation on it. How can I create such a template for tesseract to use.
you can rather use uzn file or let tesseract do the segmentation itself.
anyway checkout the folowing link if you need more informations about uzn file format :
https://github.com/OpenGreekAndLatin/greek-dev/wiki/uzn-format

Including unicharambigs in the [lang].traineddata file (Tesseract)

I'm facing a problem in training the Tesseract OCR for Kannada font (Lohit Kannada and Kedage), when it comes to numerals.
For example, 0 is getting recognized as 8 (and ನ as ವ).
I needed help in including the unicharambigs file (the documentation on Github describes the format solely).My output.txt file has not changed,despite including the unicharambigs file.
Suppose [lang] corresponds to kan, will the following command include the unicharambigs file in the kan.traineddata file?
combine_tessdata kan.
Incase it doesn't, I'd appreciate any help regarding how to proceed with the same.
Difficult to answer not knowing which version of tesseract and kan.traineddata you're using.
You can unpack the kan.traineddata to see the version of kan.unicharabigs included in it and then recombine it after editing the file.
see https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc for command syntax
Use -u option to unpack:
-u .traineddata PATHPREFIX Unpacks the .traineddata using the provided prefix.
Use -o option to overwrite ucharambigs:
-o .traineddata FILE…​: Overwrites the specified components of the .traineddata file with those provided on the command line.
Please note that https://github.com/tesseract-ocr/langdata/blob/master/kan/kan.unicharambigs seems to be a copy of eng.unicharambigs

Error when loading shape files into Bluemix dashDB

I am running into the following error when I am loading my shape files through the DashDB console:
My shape files are the following:
Would anyone have experience working with DashDB and ran into a similar problem?
UPDATE:
I downloaded a separate dataset with the following files, and I still running into the same error:
Please find the following sample files https://www.dropbox.com/s/bkrac971g9uc02x/deng.zip?dl=0
I brought the Shapefile into QGIS easily, so I knew the format was OK. I unzipped the Shapefile, changed the file names to lower-case and re-zipped it up. Then I was able to get further in the dashDB upload UI. I got to a message saying the SRS was unknown. I then used QGIS to convert the SRS (spatial reference system) into a known one -- EPSG:4269, NAD83, and I was then able to upload it into dashDB. Here's the version of your file that works:
https://dl.dropboxusercontent.com/u/8196680/dc.zip