Unclear note in the October 2018 release of the Custom Translator User Guide - microsoft-translator

Can anyone clarify what the following note exactly means?
NOTE: There must not be any new line characters; “\n” or “\r” at the end of sentences. If there are then the alignment of sentences will be corrupted and the training will not be effective.
The note appears on page 5, section 2.1.2.1 Parallel documents.
Does this apply to any document formats? It does not make much sense (at least to me), for instance for .align documents...

Thank you for bringing this to our attention. We will update the documentation as this statement is inaccurate. It should read
"NOTE: There must not be any new line characters; “\n” or “\r” within a sentence. If there are then the alignment of sentences will be corrupted and the training will not be effective."
The issue we want to address here is that parallel documents should not break a single sentence across multiple lines as it makes sentence alignment much less effective.
In regards to your question regarding .align files. We do not sentence align on these files, so you could break the sentences across multiple lines as long as you did it consistently. That is to say that if you have a sentence broken into three lines on the source side, it should be broken into three lines on the target side. Since the sentence aligner is not used, even one in unmatched split would cause misalignments to all the following sentences. There is no advantage to splitting sentences, so I strongly urge you not to do that.

Related

Training Tesseract: how to handle multiple whitespace characters in training images

I am finetuning tesseract. I am using tesstrain. When creating the ground truth files, how should I handle multiple whitespace characters?
a.tiff
There are two whitespaces between "(f)" and "any". I have no idea how many are between "30" and "(f)". Should I:
Attempt to provide the correct number of whitespaces? (in a.gt.txt)
Just use 1 whitespace between words always?
It doesn't make a difference.
I've seen on other tangentially related questions that when doing the inference, there is an option to preserve interword space. Maybe that makes a difference:

How to translate text/HTML that has stylistic line breaks?

The general question here is how do you mark text up for translation on an HTML page when the position of the line breaks have to look eye pleasing (as opposed to the line break aways happening after a specific word)?
I have a web page I want to translate into 5 different languages. In some places, I have text like "Enjoyed by 10,000 happy users" under a small icon that needs to be displayed in an eye pleasing way. This looks good as the noun phrase is on its own line and each line has about the same number of letters:
<icon>
Enjoyed by
10,000 happy users
Do I send this text to be translated as this?
Enjoyed by <br> 10,000 happy users
Problems:
By adding markup to the text it makes it unlikely I can reuse the string elsewhere but I can't see any other options.
How do I cope with how I place the in the translated text given the translated text will have a different number of letters (e.g. "Genossen von 10.000 glückliche Benutzer" in German)? Just review how each one renders on the page manually and adjust the myself after the translations come back?
I can't see any clean way to do this. I could remove the markup and try to write some server code that will add the break in a nice place but I can't see how it's possible to automate (e.g. putting noun phrases on their own line if possible when the previous line has enough letters). CSS has even less options to do this.
Your question is somewhat subjective, but I think your choices are to either trust your translators to format the HTML, or trust them to come up with copy that fits your design. Trying to engineer your way to a "clean" solution with server code sounds like it will achieve the exact opposite.
Make sure your design is good enough to cope with a reasonable range of word lengths. If your layout lives and dies by the text being exactly X characters long, then it isn't well designed. You can always ask your translators to try and write a translation in less than a maximum number of characters. This is why we still have human translators - they are also copywriters :)

Automatic Way of Adding in <BR> Tags at the end of a Sentence

I work for a company who sell items online. We're constantly listing items on our website via Spreadhseet upload and are using Magento 1.4.
Our products have long descriptions, in which we're currently manually adding in line breaks as the end of each sentence (we're doing this in excel - each paragraph is around 15 lines).
One semi-automated method we tried was using a macoring program; ghostmouse. This half worked but proved difficult as it takes a while to perfect, and still takes a long amount of time.
I've really no idea if this is at all possible - but if anybody has any suggestions or even opinions on whether they think this is possible or not, I'd be massively grateful.
Thanks For Reading, Dylan.
When you write your descriptions in excel, if you put TWO spaces between each sentence, you could write a formula like this: (assumes description is in A1)
=SUBSTITUTE(A1, ". ", ".<BR>")
The reason I specify TWO spaces is because you might choose to use a period within your sentence that you do not want to break with a line break.

Hyphenating arbitrary text automatically

What kinds of challenges are there facing automatic hyphenation? It seems that you could just draw word by word, breaking when the length of the line exceeds the length of the viewport (or whatever we're wrapping our text in), placing hyphens after as many characters as can fit (provided at least two characters fit and the word is at least four characters), skipping words that already contain a hyphen (there's no requirement that words have to be hyphenated).
But I note how Firefox and IE need a dictionary to be able to hyphenate with CSS's hyphens. This seems to imply that there are further issues regarding where we can place hyphens.
What kinds of issues are these? Do any exist in the English language or do they only exist in other languages?
You have these issues in all languages. You can only place a hyphen where meaningful tokens result from the split, as has already been pointed out. You don't want to, for example, split a word like "wr-ong".
This may or may not be a syllable, while in most languages (including English) it is. But the main point is that you cannot pin it down as easily just with some simple rules. You would need to consider a lot of phonology to get a highly accurate result, and these rules vary from language to language.
With this background, I can see why one would take a dictionary instead, and frankly, being a computational linguist myself, this is also what I would probably opt for.
If you DO want to go for an automatic solution, I would recommend doing some research in English phonology of syllables, or the so-called syllabification. You might want to start with this article on Wikipedia:
Wikipedia - Syllabification

what are the disadvantages of having tons of entities?

I've been writing a source-to-display converter for a small project. Basically, it takes an input and transforms the input into an output that is displayable by the browser (think Wikipedia-like).
The idea is there, but it isn't like the MediaWiki style, nor is like the MarkDown style. It has a few innovations by itself. For example, when the user types in a chain of spaces, I would presume he wants the spaces preserved. Since html ignores spaces by default, I was thinking of converting these chain of spaces into respective s (for example 3 spaces in a row converted to 1 )
So what happens is that I can foresee a possibility of a ton of tags per post (and a single page may have multiple posts).
I've been hearing alot of anti-&nbsps in the web, but most of it boils down to readability headaches (in this case, the input is supplied by the user. if he decides to make his post unreadable he can do so with any of the other formatting actions supplied) or maintenance headaches (which in this case is not, since it's a converted output).
I'm wondering what are the disadvantages of having tons of tags on a webpage?
You are rendering every space as ?
Besides wasting so much bandwidth, this will not allow dynamic line breaking as "nbsp" means "*n*on *b*reaking *sp*ace". This will most probably cause much trouble.
If it's just being dumped to a client, it's just a matter of size, and if it's gzipped, it barely matters in terms of network traffic.
It'll slow down rendering, I'm sure, and take up DOM space, but whether or not that matters depends on stuff I don't know about your use case(s). You might be able to achieve the same result in other ways, too; not sure.
s aren't tags, but are character entities like ©, <, >, etc.
I'd say that the disadvantages would be readability. When I see a word, I expect the spacing to be constant (unless it is in a block of justified text).
Can you show me a case where you'd need s?
Have you considered trying to figure out what the user, by inserting those spaces, is really trying to achieve? Rather than the how (they want to insert the spaces), the what (if the spaces are at the beginning of a line, they want to indent the text in question).
An example of this is many programming sites convert 4 spaces at the start of a line to a pre+code block.
For your purposes, maybe it should be a <block> block.
The end goal being that of converting the spaces not to what the user (with their limited resources) intended to show up there but, rather, what they meant to convey with it.