Converting formula/equation in docx to html using docx4j - html

I'm trying to convert docx containing equations to on Android. I came across docx4j which is great and tested the following sample (HtmlExporterNonXSTL):
https://github.com/plutext/docx4j/blob/android/src/main/java/org/docx4j/convert/out/html/HtmlExporterNonXSLT.java
However I noticed that it doesn't handler equations well - if some symbol or number has some power and/or indices their position is alway in the middle e.g.
k_{n+1}^2 (latex format)
is displayed as:
kn+12 (with 'n+12' having correct smaller font but they are both vertically aligned)
Is there any way to adjust CSS to handle powers and indices? (full formula conversion would be better but I guess it is not so easy). I'm new to docx4j but looks like somehow
handlePPr()
method will need to be modified in HtmlExporterNonXSLT example. Before I would dive into it I thought about asking is it even possible to accomplish it (any way to obtain the offset property of a run?

Disclosure: I'm docx4j project lead
You're welcome to modify HtmlExporterNonXSLT in order to fix your particular example, but as you say, full formula conversion would be better.
Here are links to three prior posts on that subject (newest first):
math-equations-and-docx-to-html-conversion-not-working
need-to-handle-latex-equation
math-expression-issue

Related

Store arbitrary characters in Semantic MediaWiki

I'm trying to store some text containing html tags into properties, which doesn't work. I created a form for a property with the data type 'text' and a template. Saving the form writes the text into the template, but it can't get displayed, as it contains illegal characters, as I guess.
What I'm trying to do:
I need a form to enter data, containing html tags and special
characters
I'd like to be able to use a query to find all those pages
and show that text using a template I provide to the ask query.
I also tried to use the free text option, but then I can't retrieve it using the ask query.
What would be the best, or at least a working solution to this?
Thanks a lot
storing text with html tags is a bit tricky in SemanticMediaWiki
The reason is the invention of the StripMarkers UNIQ/QINU by the MediaWiki developers.
When parsing the content of page with html tags in it the parsing is sort of "postponed". This technical detail unfortunately makes it hard for extension developers like the SMW developers to solve the issue of handling such content. Also it makes it hard for lay people to follow the discussion on how to solve the problem
Here are two examples of SMW Issues that are marked as "closed". This state of affairs means that by following the configuration hints in the issue your problem should be solved. If not please ask a question on the SMW issue list or even initiate the reopening of the issues.
https://github.com/SemanticMediaWiki/SemanticMediaWiki/pull/794
https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/3707
On my wiki we ran into this and resolved it by replacing special characters (we had issues with [ ] =, but the same problem happens with to < > tags too) with alternate unicode characters using the regex extension and a template before setting the property with {{#set:}}. If you want to display the formatted text on the wiki directly then call that parameter separately without replacing the unicode characters.
When you want to display the property, you can then run the reverse replacement with regex before displaying your now intact code (using the template result format to allow you to perform the operation on the output of the query).
To switch to special characters you can create this template
{{#regex:{{#regex:{{#regex:{{#regex:{{#regex:{{{1|}}}|/=/|꞊}}|/\[/|[}}|/\]/|]}}|/>/|≽}}|/</|≼}}
And to switch back you can use this as a template
{{#regex:{{#regex:{{#regex:{{#regex:{{#regex:{{{1|}}}|/꞊/|=}}|/[/|[}}|/]/|]}}|/≽/|>}}|/≼/|<}}

Limit space size in Tesseract

I write in Python, using pytesseract or direct Popen calls if needed.
I try to OCR a document with irregular structure, a letter looking like this:
The problem is in the .hocr file generated by Tesseract I get lines consisting of left and right column glued together like "Recipient: Sender:"
What I'd like to achieve is output from the left and right column separated. Using third party Python utilities to pre-process the image is an acceptable solution if explained in reasonable detail. The script must be autonomous and somehow detect this issue as not all the letters have such strange formatting.
Tried/ideas:
Using --psm 1 to allow input format detection - no improvement over default, likely because structure is too complicated.
Tweaking some config file options like gapmap_use_ends and textord_words_maxspace - I couldn't find a good documentation on these and probably there is a right combination of values but there are 57 options with "space" in name... any insight on these would be much appreciated.
Editing the .hocr - not sure how to write appropriate grouping rules for the word boxes that do not interfere with normal text everywhere else...

How to embed table within text and produce pdf output using Perl

I have a requirement to produce letters to send to customers which will contain a report within the letter text. The idea is that the user can create letter paragraphs which can be saved in a database for later use, can be sequenced and can appear either before or after a report. The report will be in table form.
I've looked using PDF::Table and PDF::API2, (both of which are good at what they do), however, both place 'items' on the page in fixed positions and not create a free flowing document.
Unless I've missed something, there is no way to add a table immediately after a paragraph of text or vice versa as page positions are required.
I have thought about using HTML::Template to create the basic letter, then HTML::HTMLDoc to convert to PDF, but would need the ability to insert a page break on change of customer.
What is my best option to achieve the above result please?
Many Thanks
There are only two ways that I've had any success with.
The first is the Apache XML-FOP project. This is a huge, sprawling Java library and specification for turning XML documents into nicely formatted PDFs. I was never good enough with XML stylesheets and transformations to get to grips with this.
The second is to generate openoffice/libreoffice documents and then use a copy of libreoffice in headless mode to convert them to PDFs. This is what I generally end up doing. You may want a minimal X11 installation for fonts etc with Xvfb as a fake display.
For editing the documents I've had success with the OpenOffice-OODoc distribution. HTH.

Strip HTML in excel and populate different cells

Can someone help me strip down HTML code and populate different columns in excel?
For eg.
If my HTML code is:
<p></p>10-16-2013 22:35<br/>I love pizza! Ordering was a breeze!<p></p>10-16-2013 13:19:46<br />this has time stamps too!<p></p>10-21-2013 11:55<br />This is a test<br />
How can I output it as separate columns in Excel like this?
Column A Column B
10-16-2013 22:35 I love pizza! Ordering was a breeze!
10-16-2013 13:19:46 this has time stamps too!
10-21-2013 11:55 This is a test
Will be extremely grateful if someone can help me out!
There are three different options you might try for parsing the html:
Combine InStr, Mid and/or Replace as mehow suggests.
Use VBScript's RegExp library. You would need to include it into your VBA project by clicking "Tools" ---> "References" and then checking the box next to "Microsoft VBScript Regular Expressions 5.5". Regular Expressions are a very powerful text parsing tool, but it does take some time to get used to the syntax. I found that this pattern allowed me to get the dates/comments as submatches: <p></p>([^<]*)<br />([^<]*). I assume you are pulling that example out of a full webpage, so you would need to tweak that pattern to match exactly the parts of it that you are looking for. This site has a good tutorial on using the VBScript RegExp library.
Use a higher level HTML parser. I suggest the MSHTML library, which you can add to your VBA project by clicking "Tools" ---> "References" and then checking the box next to "Microsoft HTML object library". This parser is aware of constructs like HTML paragraphs, breaks and tables.
In my opinion, if you're willing to take the time to learn it, Regular Expressions would be your best bet. The InStr/Replace method may not be able to account for the variability in the webpage content and the HTML method would probably be overkill, especially given the lack of formatting in the example HTML.
Once you've parsed it, you can tackle the second part of the question using Excel Worksheet and Range objects. Like wehow noted, if you can put together some code it will be easier to help you.

Recognizing superscript characters using OCR

I've started a simple project in which it must get an image containing text with superscripts and then by using OCR (currently I'm using tesseract) it has to recognize the superscript characters + the normal ones.
For example, we have a chemical equation such as Cl², but when I use the tesseract to recognize it, it gives me Cl2 (all in one line).
So, what is the solution for this problem? Is there any other OCR API that has the ability to read superscripts?
Very good question that touches more advanced features of any OCR system.
First of all, to make sure you are NOT overlooking the functionality even though it may be there on an OCR system. Make sure to look at your result test not in plain TXT format, but in some kind of rich text capable viewer. TXT viewers, such as Notepad on Windows, often do not support superscript/subscript characters, so even if OCR were to give you correct characters, your viewer could have converted it to display it. If you are accessing text result programatically, that is less of an issue because you are supposed to get a proper subscript character value when accessing it directly. Just note that viewers must support it for you to actually see it. If you eliminated this possible post-processing conversion and made sure that no subscript is returned from OCR, then it probably does not support it.
Just like in this text box, in your original question you tried to give us a superscript character example, but this text box did not accept it even though you could copy/paste it from elsewhere.
Many OCR will see subscript as any other normal character, if they can see it at all. OCR of your use needs to have technical capability to actually produce superscripts/subscripts, and many of them do, but they tend to be commercial OCR systems not surprisingly.
I made a small testcase before answering this letter. I generated an image with a few superscript/subscript examples for my testing (of course EMC2 was the first example that came to mind :) .
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page.tif
And processed this image through OCR-IT OCR Cloud 2.0 API using all default settings, but exporting to a rich text format, such as MS Word .DOC.
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page_result.doc
Also note: When you are interested to extract superscript/subscript characters, pay separate attention to your image quality, more than you would with a typical text. Those characters are tiny and you need sufficient details and resolution to achieve descent OCR quality. Even scanned at 300 dpi images sometimes have issues with tiny characters due to too few pixels. If you are considering mobile and digital cameras, that becomes even more important.
DISCLOSURE: My specialty is implementing internal OCR solutions for companies of different sizes. My company is WiseTREND. Contact me directly if I can assist with anything further.