Limit space size in Tesseract - ocr

I write in Python, using pytesseract or direct Popen calls if needed.
I try to OCR a document with irregular structure, a letter looking like this:
The problem is in the .hocr file generated by Tesseract I get lines consisting of left and right column glued together like "Recipient: Sender:"
What I'd like to achieve is output from the left and right column separated. Using third party Python utilities to pre-process the image is an acceptable solution if explained in reasonable detail. The script must be autonomous and somehow detect this issue as not all the letters have such strange formatting.
Tried/ideas:
Using --psm 1 to allow input format detection - no improvement over default, likely because structure is too complicated.
Tweaking some config file options like gapmap_use_ends and textord_words_maxspace - I couldn't find a good documentation on these and probably there is a right combination of values but there are 57 options with "space" in name... any insight on these would be much appreciated.
Editing the .hocr - not sure how to write appropriate grouping rules for the word boxes that do not interfere with normal text everywhere else...

Related

How to create language-dictionary database from text file?

I have a large text file, which is an Italian-English dictionary. A typical line is:
Mazzapícchio, a long pole that fishers vse to bob vp and down for Eeles, and also to make fish to stirre. Also a kind of meate or custard in some parts of Italie made with milke and egges.
(Yes, it's a 17th-century dictionary.)
I'm looking for the best/easiest way to turn this into a searchable database.
The search would need to ignore the diacritics; with everything up to the first comma as the 'entry'. There are some cross-references, e.g.: Mefíte, as Mephíte.
My first thought is simply to turn it into HTML, with anchor tags for the word/phrase up to the first comma. That should be easy enough with a bit of Grep. I could also add links to the crossrefs in the same way (using BBEdit to confirm each change). It would then be easy to query just using a browser's search field.
However, ideally, I'd like something that returned only (all) the matching results. XML/HTML Tagging is the easy bit: the problem is the front-end to access/query it.
I'm on MacOS. (I'm also investigating Apple's Dictionary format...)
Any ideas on how to proceed would be welcome. Thanks.
This is a huge question. So many choices at so many areas.
A small start:
A searchable db. Look at https://solr.apache.org/
Php to handle interaction front-end with solr and to serve your html search form and results.

tesseract unable to detect characters in simple two-word image

I'm having trouble getting tesseract to recognize any characters in the following image:
When I run tesseract from the command line on this image, I get "Empty page!!" - that is, no results - returned. Based on my reading of the Improving Quality section of the wiki, I thought that the issue might be that the words in this image are not dictionary words. With that in mind, I have tried both disabling the tesseract dictionaries altogether (using the load_system_dawg and load_freq_dawg config flags) as well as augmenting the existing dictionary with these additional words (LAO and CAUD). Neither of those approaches worked. I have tried tesseract versions 3, 4, and have built version 5 from source on a Mac computer. All have given the same result.
Curiously, if I type the exact words from that image into a word processor and take a screenshot, it works: the resulting image is readable by tesseract. It correctly parses each character. Here is that image:
The only difference between the two images is that the first one is of a slightly lower resolution/quality. Am I then to believe that tesseract is unable to recognize characters in a slightly inferior quality image like that? Is there anything I can do to improve that image quality? Is there something else I'm missing?
Thanks in advance.
It's common problem. You probably will need preprocess the image, with rescaling, filters, etc.
Here are some ref on how to do that:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://docparser.com/blog/improve-ocr-accuracy/
The solution was to use the right page segmentation method (PSM). In my case, PSM 6, which is for a single block of text, did the trick.

How to embed table within text and produce pdf output using Perl

I have a requirement to produce letters to send to customers which will contain a report within the letter text. The idea is that the user can create letter paragraphs which can be saved in a database for later use, can be sequenced and can appear either before or after a report. The report will be in table form.
I've looked using PDF::Table and PDF::API2, (both of which are good at what they do), however, both place 'items' on the page in fixed positions and not create a free flowing document.
Unless I've missed something, there is no way to add a table immediately after a paragraph of text or vice versa as page positions are required.
I have thought about using HTML::Template to create the basic letter, then HTML::HTMLDoc to convert to PDF, but would need the ability to insert a page break on change of customer.
What is my best option to achieve the above result please?
Many Thanks
There are only two ways that I've had any success with.
The first is the Apache XML-FOP project. This is a huge, sprawling Java library and specification for turning XML documents into nicely formatted PDFs. I was never good enough with XML stylesheets and transformations to get to grips with this.
The second is to generate openoffice/libreoffice documents and then use a copy of libreoffice in headless mode to convert them to PDFs. This is what I generally end up doing. You may want a minimal X11 installation for fonts etc with Xvfb as a fake display.
For editing the documents I've had success with the OpenOffice-OODoc distribution. HTH.

Strip HTML in excel and populate different cells

Can someone help me strip down HTML code and populate different columns in excel?
For eg.
If my HTML code is:
<p></p>10-16-2013 22:35<br/>I love pizza! Ordering was a breeze!<p></p>10-16-2013 13:19:46<br />this has time stamps too!<p></p>10-21-2013 11:55<br />This is a test<br />
How can I output it as separate columns in Excel like this?
Column A Column B
10-16-2013 22:35 I love pizza! Ordering was a breeze!
10-16-2013 13:19:46 this has time stamps too!
10-21-2013 11:55 This is a test
Will be extremely grateful if someone can help me out!
There are three different options you might try for parsing the html:
Combine InStr, Mid and/or Replace as mehow suggests.
Use VBScript's RegExp library. You would need to include it into your VBA project by clicking "Tools" ---> "References" and then checking the box next to "Microsoft VBScript Regular Expressions 5.5". Regular Expressions are a very powerful text parsing tool, but it does take some time to get used to the syntax. I found that this pattern allowed me to get the dates/comments as submatches: <p></p>([^<]*)<br />([^<]*). I assume you are pulling that example out of a full webpage, so you would need to tweak that pattern to match exactly the parts of it that you are looking for. This site has a good tutorial on using the VBScript RegExp library.
Use a higher level HTML parser. I suggest the MSHTML library, which you can add to your VBA project by clicking "Tools" ---> "References" and then checking the box next to "Microsoft HTML object library". This parser is aware of constructs like HTML paragraphs, breaks and tables.
In my opinion, if you're willing to take the time to learn it, Regular Expressions would be your best bet. The InStr/Replace method may not be able to account for the variability in the webpage content and the HTML method would probably be overkill, especially given the lack of formatting in the example HTML.
Once you've parsed it, you can tackle the second part of the question using Excel Worksheet and Range objects. Like wehow noted, if you can put together some code it will be easier to help you.

Converting formula/equation in docx to html using docx4j

I'm trying to convert docx containing equations to on Android. I came across docx4j which is great and tested the following sample (HtmlExporterNonXSTL):
https://github.com/plutext/docx4j/blob/android/src/main/java/org/docx4j/convert/out/html/HtmlExporterNonXSLT.java
However I noticed that it doesn't handler equations well - if some symbol or number has some power and/or indices their position is alway in the middle e.g.
k_{n+1}^2 (latex format)
is displayed as:
kn+12 (with 'n+12' having correct smaller font but they are both vertically aligned)
Is there any way to adjust CSS to handle powers and indices? (full formula conversion would be better but I guess it is not so easy). I'm new to docx4j but looks like somehow
handlePPr()
method will need to be modified in HtmlExporterNonXSLT example. Before I would dive into it I thought about asking is it even possible to accomplish it (any way to obtain the offset property of a run?
Disclosure: I'm docx4j project lead
You're welcome to modify HtmlExporterNonXSLT in order to fix your particular example, but as you say, full formula conversion would be better.
Here are links to three prior posts on that subject (newest first):
math-equations-and-docx-to-html-conversion-not-working
need-to-handle-latex-equation
math-expression-issue