Alignment with NN translator - microsoft-translator

I use TranslateArray2 to get alignment information. Is there a way to get the aligments back when using category=generalnn? I only get them with the default category (and the older models).

Related

Limit space size in Tesseract

I write in Python, using pytesseract or direct Popen calls if needed.
I try to OCR a document with irregular structure, a letter looking like this:
The problem is in the .hocr file generated by Tesseract I get lines consisting of left and right column glued together like "Recipient: Sender:"
What I'd like to achieve is output from the left and right column separated. Using third party Python utilities to pre-process the image is an acceptable solution if explained in reasonable detail. The script must be autonomous and somehow detect this issue as not all the letters have such strange formatting.
Tried/ideas:
Using --psm 1 to allow input format detection - no improvement over default, likely because structure is too complicated.
Tweaking some config file options like gapmap_use_ends and textord_words_maxspace - I couldn't find a good documentation on these and probably there is a right combination of values but there are 57 options with "space" in name... any insight on these would be much appreciated.
Editing the .hocr - not sure how to write appropriate grouping rules for the word boxes that do not interfere with normal text everywhere else...

Do I need a database to handle my website content?

So I'm building a website that contains information about a bunch of different animal species. I will have a list of 500 items, that should be able to be filtered and sorted by different criteria. For example, I will have a 'country selection' option. If Brazil is selected, the Capuchin monkey among other animals (living in Brazil) should be added to the list.
I could see myself making a list with 50 species with no problem, as the HTML would be manageable. But would having 500 items in a list with filterabilty even be possible without using some sort of database?
I was thinking of just pairing animal items from the list with certain filter criteria. For example, Capuchin monkey with "Brazil", "Mammal", "Omnivore", etc.
And when e.g. "Mammal" is selected in the filter, all animals paired with that property (all mammals of the list) is added to the list, or if not paired with the property, then removed from the list.
As you probably can tell, I'm really uneducated on how to go about creating this filterable list. Down the road I might even look into adding a search function.
After pluggin in all content, I would never need to change anything. I've read that databases should only be used if you have dynamic content.
I wouldn't list all 500 items on the same page, as that would make it very slow. I would have 10 items per page.
I don't need a solution per se. I just wish to be pushed in the right direction.
Should I look into MySQL? Can a filterable list of 500 items be possible with just HTML/CSS/Javascript? I am somewhat familiar with javascript, and have read that JSON might be able to provide the things I need.
Sorry if my question is vague or if I'm in the wrong anywhere (this is my first post). Please ask for any clarification and any advice or suggestion is greatly appreciated.
Thanks,
Manne
No you don't need a database. Have a look at this very robust jQuery plugin that will easily allow you to sort/filter/search 500 items in JavaScript alone:
https://datatables.net/
There are examples that are powered from JSON alone so I would suggest you simply store your data in a JSON file until you grow large enough that you need to change that (if you ever do).
Here is an example where the data is pulled from a .txt file:
https://datatables.net/examples/data_sources/ajax.html

How to embed table within text and produce pdf output using Perl

I have a requirement to produce letters to send to customers which will contain a report within the letter text. The idea is that the user can create letter paragraphs which can be saved in a database for later use, can be sequenced and can appear either before or after a report. The report will be in table form.
I've looked using PDF::Table and PDF::API2, (both of which are good at what they do), however, both place 'items' on the page in fixed positions and not create a free flowing document.
Unless I've missed something, there is no way to add a table immediately after a paragraph of text or vice versa as page positions are required.
I have thought about using HTML::Template to create the basic letter, then HTML::HTMLDoc to convert to PDF, but would need the ability to insert a page break on change of customer.
What is my best option to achieve the above result please?
Many Thanks
There are only two ways that I've had any success with.
The first is the Apache XML-FOP project. This is a huge, sprawling Java library and specification for turning XML documents into nicely formatted PDFs. I was never good enough with XML stylesheets and transformations to get to grips with this.
The second is to generate openoffice/libreoffice documents and then use a copy of libreoffice in headless mode to convert them to PDFs. This is what I generally end up doing. You may want a minimal X11 installation for fonts etc with Xvfb as a fake display.
For editing the documents I've had success with the OpenOffice-OODoc distribution. HTH.

Looking for examples of a UI element that allows to select N elements from a set plus define a default

Imagine my application has a list of supported languages. I'm looking for a UI element which allows to select a subset of the supported languages plus make one of them the default.
At first, I thought to use a list with two checkbox columns but the user will be surprised when she activates one in the "default" column because that will deselect the current default. I could use radio buttons but that also feels clunky (and a waste of screen space).
The next idea was to have two lists, one with the available languages and one with the active ones. But how would the user select the default in this case?
Our current solution works with two lists:
Active Available
* English Italian
French <=> Greek
German
You can drag and drop elements between the lists to make a language active or not. The first element of the left list is the "default". In the UI, we give it a special style, so users can easily recognize "this language is special." A tooltip (and the documentation) reveals "this is the default language."
To select a different default language, just drag one of the elements on the left side to the top of the list.

Converting formula/equation in docx to html using docx4j

I'm trying to convert docx containing equations to on Android. I came across docx4j which is great and tested the following sample (HtmlExporterNonXSTL):
https://github.com/plutext/docx4j/blob/android/src/main/java/org/docx4j/convert/out/html/HtmlExporterNonXSLT.java
However I noticed that it doesn't handler equations well - if some symbol or number has some power and/or indices their position is alway in the middle e.g.
k_{n+1}^2 (latex format)
is displayed as:
kn+12 (with 'n+12' having correct smaller font but they are both vertically aligned)
Is there any way to adjust CSS to handle powers and indices? (full formula conversion would be better but I guess it is not so easy). I'm new to docx4j but looks like somehow
handlePPr()
method will need to be modified in HtmlExporterNonXSLT example. Before I would dive into it I thought about asking is it even possible to accomplish it (any way to obtain the offset property of a run?
Disclosure: I'm docx4j project lead
You're welcome to modify HtmlExporterNonXSLT in order to fix your particular example, but as you say, full formula conversion would be better.
Here are links to three prior posts on that subject (newest first):
math-equations-and-docx-to-html-conversion-not-working
need-to-handle-latex-equation
math-expression-issue