I'm using the Microsoft Translator Text API to translate parts of a webpage. The platform we use, inserts in the HTML to render empty lines. So a part of the webpage can be:
<p>
<span>This is a dummy text</span>
</p>
<p>
<span> </span>
</p>
When I send this to the Microsoft Translator Text API, it returns the following HTML:
<p>
<span>Il s’agit d’un texte factice</span>
</p>
<p>
<span></span>
</p>
I've set the content type to text/html, and escape the HTML characters to be able to send it to the API (so will be replaced with ). But the text that is returned by the API has completely lost the .
How can I prevent the API from removing the instances in the HTML? Or is this a bug in the API?
A notranslate span may help to prevent translation. You would have to try it to see if it does indeed preserve the nbsp tag.
See the answer to Microsoft Translator API - notranslate trimming leading space? from Chis Wendt (Microsoft):
Translator trims leading and trailing space, and compresses any other white space to a single space. This is by design. Translator needs to move the words around freely to form the newly composed sentence, and wouldn't know what to do with the extra white space. A workaround would be to trim in your code before translation, and then restore the trimmed off pieces afterwards, depending on the context.
Line breaks and non-breaking spaces tend to be used for specific line layout based on the particular source text that would need to be laid out differently in another language in any case because of different word lengths and arrangements of the significant words.
Related
I'm using Google Cloud Translator API to translate some HTML texts. I set the format to HTML, and the translation qualities are pretty good (it keeps all the tags untranslated and only translated the text between the tags). However, it often removes all the line breaks in the HTML text. For example, I selected the English-German option, and
<p><a class="selfLink" id="notes" href="#notes" rel="help"><strong>Notes</strong></a>
<ul>
<li><a class="selfLink" id="disclaimer" href="#disclaimer" rel="help">DISCLAIMER OF LIABILITY</a>
...
becomes
<p><a class="selfLink" id="notes" href="#notes" rel="help"><strong>Anmerkungen</strong></a><ul><li> <a class="selfLink" id="disclaimer" href="#disclaimer" rel="help">...
It's very difficult to read the translated text since it's all in one line. I know that I can set the translator mode to treat the input text as "text" to preserve the line breaks, but in text mode, the translator is not able to identify HTML entities and determine whether a piece of text should be translated or not. Manually adding the line breaks is not a desirable approach. What can I do to improve the readability of the HTML translation?
Disappearing newlines is one of the features of the HTML mode, another is that some of the Unicode characters will turn into HTML entities. You will run into it sooner or later :-)
The solution is to replace all newlines with <br/> before sending the text to Google Translate API, and after getting the translation replace <br/> with newlines + making HTML decode.
I am sending a text to a web service. The web service reads the text and makes a html report. The text is multiline. The web service connects all the lines and makes a single line and then wraps it in a quoted string. I want the text lines come separately in the html report. I have no access to change or view the web service's code.
I tried to add <br/> at the end of each message line before sending, but it didn't work.The browser handles <br/> like a normal text and it comes exactly in the report: line1<br/>line2.
I look for a trick to get rid of the quotes and allow the browser to interpret the html tags like <br/>.
Instead of sending
text = '************ <br/> SomeText <br/>**********';
Use
html='************ <br/> SomeText <br/> **********';
I don't know whether it satisfies your requirement
I'm using the Microsoft Translator Text API to translate some sentences. My sentences contains some parts of text that I need to not being translated.
To achieve this I using <span class="notranslate"></span> by wrapping not translatable text. It works good in most cases, by in some cases MT API breaks this spans.
Examples (Input -> Output):
some <span class="notranslate">1</span> text -> деякий 1 текст
some <span class="notranslate">1</span> another text -> деякий
<span class="notranslate">1 інший </span> текст
Good Example:
some <span class="notranslate">1</span> text -> деякий <span class="notranslate">1</span> текст
I do not observe any regularities, it happens randomly. Maybe I miss something?
UPD:
I tried to send headers Content-Type: text/xml or Content-Type: text/html - the same result in both: engine breaks some spans.
I found the solution.
Microsoft Translator API 3.0 Documentation recommends to use <div class="notranslate"></div> instead of <span class="notranslate"></span>.
I use API 2 version, but seems like after changing wrapper to <div>, MT API stopped breaking of my notranslate wrappers.
With version 3.0, it's not enough to use <div>. Also as Denis Kurochkin warned, it will reduce the effectiveness of the translation (by ending the sentence prematurely).
To achieve this, use <span class="notranslate">Text won't translate here</span> or <span translate="no">Text</span>, plus include the textType=html query parameter to ensure it is working correctly:
/translate?api-version=3.0&to=zh&textType=html
Without it (regardless of span/div), it will not translate the text inside the tags, but it will translate the attributes within the tag
i.e. if you have other attributes inside the <span> tag then they will be modified, something like this:
<span data-type=""mention"" class=""mention"" data-id=""39dcf29b-fce0-4a26-90ef-6342e017c1b8"" data-label=""My name has words inside it | Super cool company"" class=""notranslate"">My name has words inside it | Super cool company
I need to show paragraph marks, spaces and other formatting marks in a contenteditable div as you can in MS Word by pressing the Formatting Marks button Formatting Marks button http://blogs.mccombs.utexas.edu/the-most/files/2011/04/show-hide-button-in-outlook.jpg
Is there a simple way to achieve this?
<html>
<head>
<style>
span::after{
color:black;
content:"\00b6";
}
p::after{
color:black;
content:"\00b6";
}
</style>
</head>
<body>
<h3>
<span class="label">This is the main label</span>
<span class="secondary-label">secondary label</span>
</h3>
<P>Quote me</p>
</body>
</html>
Creating a font which draws spaces as dots and newlines as paragraph marks should solve your problem.
In code it will look like
.editable-div {
font-family: "Your custom font with spaces as dots and stuff", "Actual character font";
}
Here's an article which elaborates on this approach http://www.sitepoint.com/joy-of-subsets-web-fonts/
(I don't have access to Word, but I'm assuming it's the exact same functionality present in most text editors, or InDesign's 'show hidden characters' option &c.)
No, there definitely isn't a simple way to do this, because it's a fairly complex feature.
Your best bet if you really want to do this is to capture the input within the div as a user enters text. Something like Bacon that can easily capture keyed user input as a stream (and allow you to map across the stream) would simplify the process somewhat.
You'll then need to replace* (in realtime) every space/paragraph mark/&c with a relevant marker for the user. The actual input still needs be either saved as typed, or parsed again before saving to strip the new, pretend characters. And though you can use use unicode entities for many of the markers (pilcrows, maybe?), a space (for example) will still show as whitespace (or as the entity code if escaped), so you would need to use a representative icon - essentially, the majority of the hidden characters will each need to have their own specific, defined rendering rules.
This is all fairly nightmarish. It's doable if you can ensure the max amount of text can be kept small, and if you can control what users can enter. For large amounts of text, I can see it becoming horrific: not sure what the JS overhead would be in terms of performance, but I can't imagine it would be particularly good.
* or append - for example newlines/carriage returns etc need to be both displayed as a marker, and actually occur within the contenteditable element.
Edit: What you could do in addition to the above is to edit a font, replacing/adding unicode points for hidden characters instead/as well as visible ones - you would still need to capture input, but this would remove a few headaches. It would deal with spaces quite nicely, for example. Still a bit of a nightmare, but hey.
how to match this kind of line
<p><span class="font7" style="font-weight:bold;">text text text text </span></p>\r\n<p>
and at the same time avoid this kind of line
<p><span class="font7" style="font-weight:bold;">text text text text </span><span class="font7"> text text text <br/> text text text </span></p>\r\n<p>
the problem is that the tag span appears twice in the same line, i want to avoid that.
only wanting if appears once in a line.
</span>
i have tried this regex
<p><span class="font7" style="font-weight:bold;">.+?(?:(?!.+?</span>.+?$)){2}</p>\r\n<p>
please help me, if possible in .net, perl or ruby flavor
greetings
Do not try to parse HTML with regular expressions. You can't do it reliably. Regular expressions are not up to the task.
You need a proper HTML parser. It will be an HTML parser that has been well-tested and used by many people, as opposed to whatever regexes you try to cobble together.
Here are some options for Perl HTML parsers. Start there.