l20n with HTML markup? - html

How would I use l20n if I wanted to create something like this:
About <strong>Firefox</strong>
I want to translate the phrase as a whole but I also want the markup. I don't want to have to do this:
<aboutBrowser "About {{ browserBrandShortName }}">
<aboutBrowserStrong "About <strong>{{ browserBrandShortName }}</strong>">
...as the translation itself is now duplicated.
I understand that this might not be in the scope of l20n, but it is probably a common enough case in the real world. Is there some kind of established way to go about this?

Sometimes duplicating the translation is the best thing you can do. Redundancy is good in localization: it allows to make fewer assumptions about translations into other languages. One of the core principles of L20n is that only the localizer will know what they really need.
Your solution is actually okay
The solution that you proposed is actually quite good. It's entirely possible that the emphasis that you're trying to express with <strong> will have some unknown implications in some language that we might not be aware of. For instance, some languages might use declensions or postpositions to mean "about something", in which case you—as a developer—shouldn't make too many assumptions about the exact position of the <strong> element. It might be that the entire translation will be a single word surrounded by <strong>.
Here's your code again, formatted using L20n's multiline string literals:
<aboutBrowser "About {{ browserBrandShortName }}">
<aboutBrowserEmphasized """
About <strong>{{ browserBrandShortName }}</strong>
""">
Note that for this to work as expected, you'll need to add a data-l10n-overlay attribute to the DOM node with data-l10n-id=aboutBrowserEmphasized. Otherwise, < and > will be escaped.
Making few assumptions matters
Let me digress quickly and bring up Bug 859035 — Do not use the same "unknown" entity for Size & Author when installing a WebApp from Firefox OS. The English-speaking developer assumed the they can use the "unknown" adjective to qualify both the size and the author in the installation dialog. However, in certain languages, like French or Polish, the adjective must be accorded with the object in terms of gender and plurals. So even though in English we can only have one string:
<unknown "Unknown">
…other languages might require two separate strings for each of the contexts they're used in. In French, you'd say "auteur inconnu" (unknown author, masculine) but "taille inconnue" (unknown size, feminine):
<unknownSize "inconnue">
<unknownAuthor "inconnu">
In English, this means some redundancy:
<unknownSize "Unknown">
<unknownAuthor "Unknown">
…but that's OK, because in the end the quality of localization is improved. It is generally a good practice to use unique strings everywhere and reuse sparingly. Ideally, you'd allow different translations for all strings. Even something as simple and common as "Yes" and "No" can be tricky if you consider languages like Welsh:
Welsh doesn't have a single word to use every time for yes and no questions. The word used depends on the form of the question. You must generally answer using the relevant form of the verb used in the question, or in questions where the verb is not the first element you use either 'ie' / 'nage'.

Related

What is the "form" attribute in XML schemas actually good for?

I have been brushing up on my XML schema skills the past few days and during a whole day I was busy trying to understand the intricacies of namespaces with respect to schemas. What struck me most was the seemingly uselessness of the form="qualified|unqualified" attribute on non-global <element> and <attribute> elements.
My question is: does the form attribute actually add expressivity to XML schemas / XML documents, or does it just make the notation of certain XML documents easier/different?
I understand that XML documents that need to conform to a certain schema are generally easier to write when all elements are to be qualified with a namespace (one xmlns="xyz" attribute on the document element is all you need), but is that all? Why would anyone bother with unqualified non-global elements at all?
First, on the formal question: does the form attribute actually add to the expressivity of XSD? I think so: with the form attribute I can write types whose sets of valid instances cannot (as far as I can see) be matched by any type written without using the form attribute.
For example (in a schema which defines a top-level element named a of type duration):
<choice maxOccurs="unbounded">
<element form="qualified" name="a" type="integer"/>
<element form="unqualified" name="a" type="gYear"/>
</choice>
Then, on the less formal question: why would anyone bother?
The short answer is: because for each possible way of choosing among qualified and unqualified names for local elements and attributes, some people think that that is the right way to declare local elements and/or attributes.
A longer answer will take a little time. Sit down, get yourself a cup of coffee.
There are two schools of thought on local elements; both were represented in the working group that designed XSD.
One school of thought believes that if element P is in namespace N, and element C is local to the type of element N:P, then it is only natural that the child element should be named N:C. It is, after all, part of the same vocabulary as P, and identifying the vocabulary is what namespaces are all about. From your final question, I guess you lean toward this way of seeing things.
The other school of thought reasons that local elements are like local attributes. An attribute local to (the type of) element N:P is named A, not N:A -- the name N:A denotes, by definition, an attribute global to namespace N, not local to element N:P. By analogy, local children should also use unqualified names, so that attributes and child elements are treated in similar ways.
The presence of the form attribute on XSD element and attribute declarations might suggest the possible existence of a third school of thought, characterized by a desire to treat the choice between qualified and unqualified names as a design choice to be taken individually for each local element or attribute, and not necessarily in a single vocabulary-wide edict. For what it's worth, this third school of thought doesn't actually seem to exist. At least, I don't think I've ever encountered a member. No one ever seems to set out to write complex types of the kind exhibited above with mixtures of qualified and unqualified local names. The essential function of the form attribute is not to allow different local elements to be qualified or unqualified, but to have its default set by the elementFormDefault and attributeFormDefault attributes on the enclosing schema element, thus ensuring that even if schema authors are somehow stuck with the wrong values for those attributes, they can still get the effect they desire.
I have also never (that I know of) encountered any member of either of the first two schools of thought who could feel any sympathy at all for the reasoning of the other school. That anyone could think the way the members of the other school of thought think has pretty much always come as an unwelcome surprise. With a little effort, smart people of good will find it possible to accept the existence of the other school of thought, and even (with a little more effort) to accept that the members of that school of thought are arguing in good faith, and are not just trying to gum up the works. The variety of views has anthropological interest at best (look at the very odd things people can claim to believe even when otherwise they seem mostly like more or less rational beings! Funny old world, isn't it?). After some months of deadlock most members of the working group were forced to admit to themselves that they just were not going to be able to persuade the other guys to see the error of their ways.
It then became clear that pretty much everyone in the WG had the same ranking for the three possibilities we could think of:
Define things the way I think is right.
Define things so that the schema author must make the choice.
Define things the way the other guys think is right.
Everyone liked the first choice (for suitable values of "I"), but for different WG members "the way I think is right" turned out to denote different things.
No one much liked the second choice, since it makes the schema author's life harder and leads to less consistency in the universe of schemas.
But everyone hated the third choice (being forced to do things the way the other guys wanted) so much that they were willing to accept compromise rather than risk utter defeat.

What's a good Lucene analyzer for text and source code?

What would be a good Lucene analyzer to use for documents that are a mix of text and diverse source code?
For example, I want "C" and "C++" to be considered different words, and I want Charset.forName("utf-8") to be split between the class name and method name, and for the parameter to be considered either one or two words.
A good example dataset for what I'd like to look at is StackOverflow itself. I believe that StackOverflow uses Lucene.NET for search; does it use a stock analyzer, or has it been heavily customized?
You're probably best to use the WhitespaceTokenizer and customize it to strip off punctuation. For example we strip of all puncutation except '+', '-' so that words such as C++, etc... are left but opening and closing quotes and brackets, etc are left. In reality though for something like this you might have to add the document twice using different tokenizers to catch the different parts of the document. i.e. once with the StandardTokenizer and once with a WhitespaceTokenizer, in this case the StandardTokenizer will split all your code, e.g. between class and method names as the Whitespace one will pick-up the words such as C++. Obviously it kind of depends on the language though as e.g. Scala allows some punctuation characters in method names.

Regular Expressions vs XPath when parsing HTML text

I want to parse a HTML text and find special parts. For example a text in 3rd div of 1st row and 2nd column of a table. I have 2 options to parse: Regular Expressions and XPath. What is advantages and disadvantages of each one?
thanks
It somewhat depends on whether you have a complete HTML file of unknown but well-formed content versus having merely a snippet or an expanse of HTML of completely known content which may or may not be well-formed.
There is a difference between editing and parsing, you see.
It is one thing to be editing your own HTML file that you wrote yourself or are otherwise staring right in the face, and you issue the editor command
:100,200s!<br */>!!g
To remove the breaks from lines 200–300.
It is quite another to suck down whatever HTML happens to be at the other end of a URL and then try to make some sense out it, sight unseen.
The first calls for a regex solution — the very one shown above, in fact. To go off writing some massively overengineered behemoth to do a fall parse to set up the entire parse tree just to do the simple edit shown above is quite simply wrong. It’s also its own punishment.
On the other hand, using patterns to parse out (as opposed to lex out) an entire HTML document that can contain all kinds of whacky things you aren’t planning for just cries out for leveraging someone else’s hard work intead of recreating the wheel for yourself, and badly at that.
However, there’s something else nobody likes to mention, and that’s that most people just aren’t competent at regexes. They don’t really understand them. They don’t know how to test them or to craft them. They don’t know how to make them readable and maintainable.
The truth of the matter is that the overwhelming majority of regex users cannot even manage as simple and basic a thing as matching an arbitrary HTML tag using a regex, even when things gotchas like alternate encodings and CDATA sections and redefined entitities and <script> contents and archaic never-seen forms are all safely dispensed with.
It’s not because it’s hard to do; it isn’t, actually. It’s just that the people trying to do it understand neither regexes nor HTML particularly well, and they don’t know they don’t know, and so they get themselves in way over their heads more quickly than they realize. And then they have a complete disaster on their hands.
Plus it’s been done before, and correctly. Might as well learn from someone else’s mistakes for a change, eh? It would probably help to have a few canned regexes at your disposal to go at frequently manipulated things. This is especially useful for editing.
But for a full parse, you really shouldn’t try to embed a full HTML grammar inside your pattern. Honest, you really shouldn’t. Speaking as someone has actually can and has done this, I unlike 99.9999% of the responders here the credibility of actual experience in this area when I advise against it. Sure, I can do it, but I almost never want to, and I certainly don’t want you to try it at home unsupervised. I can’t be held responsible for any damage that might ensue. :)
Sure, this may sound like “Do as I say, not as I do,” but if your level of regex mastery were at a level that allowed you to contemplate such a thing, you would not be asking this question. As I mentioned, almost no one who uses regexes can actually match an arbitrary HTML tag, simple as that is. Given that you need that sort of building block before writing your recursive descent grammar, and given that next to nobody can even manage that simple building block, well...
Given that sad state of affairs, it’s probably best to use regexes for simple edit jobs only, and leave their use for more complete solutions to real regex wizards, for they are subtle and quick to anger. Meaning of course the regexes, not (just) the wizards.
But sure, keep some canned regexes handy for doing simple editing rather than full parsing. That way you won’t be forced to redevise them each time from first principles. I do keep a few of these around, but then I also keep simple frameworks that allow me to edit a particular structural element of the HTML, like the plain text or the tag contents or the link references, etc, and those all use a full parser, letting me then surgically target just the parts I want in complete confidence I haven’t forgotten something.
More as a testament to what is possible than what is advisable, you can see some answers with more, um, “heroic” pattern matching, including recursion,
here,
here,
here,
here,
here, and
here.
Understand that some of those were actually written for the express purpose of showing people why they should not use regexes, because some of them are really quite sophisticated, much moreso than you can expect in nonwizards. That difficulty may chase you away, which is ok, because it was sort of meant to.
But don’t let that stop you from using vi on your HTML files, nor should it scare you away from using its search or substitute commands. Don’t let the perfect be the enemy of the good. Sometimes good enough is exactly what you need, because the perfect would take more investment than it could ever be worth.
Understanding which out of several possible approaches will give you the most bang for your buck is something that takes time to learn, and no one can tell you the answer that works for you. They don’t know your dataset, your requirements, your skillset, your priorities. Therefore any categorical answer is automatically wrong. You have to evaluate these things for yourself.
I think XPath is the primary option for traversing XML-like documents. With RegExp, it will be up to you to handle the different forms of writing a tag (with multiple spaces, double quotes, single quotes, no quotes, in one line, in multi-lines, with inner data, without inner data, etc). With XPath, this is all transparent to you, and it has many features (like accessing a node by index, selecting by attribute values, selecting simblings, and MANY others).
See how powerfull it can be at http://www.w3schools.com/xpath/.
EDIT: See also How do HTML parses work if they're not using regexp?
XPath is less likely to break if the web developer does any minor changes. That would be my choice.
Here is the canonical Stackoverflow explanation for why you should not parse HTML with regex:
RegEx match open tags except XHTML self-contained tags
In general, you cannot parse HTML with regex because regex is not made to parse HTML. Just use XPath.

Writing XSS Filter for (X)HTML Based on White List

I need to implement a simple and efficient XSS Filter in C++ for CppCMS. I can't use existing high quality filters
written in PHP because because it is high performance framework that uses C++.
The basic idea is provide a filter that have a while list of HTML tags and a white
list of options for these tags. For example. typical HTML input can consist of
<b>, <i>, tags and <a> tag with href. But straightforward implementation is not
good enough, because, even allowed simple links may include XSS:
Click On Me
There are many other examples can be found there. So I though also about a possibility to create a white list of prefixes for tags like href/src -- so I always need to check if it starts with (https?|ftp)://
Questions:
Are these assumptions are good enough for most of purposes? Meaning that If I do not
give an options for style tags and check src/href using white list of prefixes it solves XSS problems? Are there problems that can't be fixes this way?
Is there a good reference for formal grammar of HTML/XHTML in order to write simple
parser that would cleanup all incorrect of forbidden tags like <script>
You can take a look at the Anti Samy project, trying to accomplish the same thing. It's Java and .NET though.
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project#.NET_version
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project_.NET
Edit 1, A bit extra :
You can potentially come up with a very strict white listing. It should be structured well and should be pretty tight and not much flexible. When you combine flexibility, so many tags, attributes and different browsers generally you end up with a XSS vulnerability.
I don't know what is your requirements but I'd go with a strict and simple tag support (only b li h1 etc.) and then strict attribute support based on the tag (for example src is only valid under href tag), then you need to do whitelisting in the attribute values as you stated http|https|ftp or style="color|background-color" etc.
Consider this one:
<x style="express/**/ion:(alert(/bah!/))">
Also you need to think about some character whitelisting or some UTF-8 normalization, because different encodings can cause awkward issues. Such as new lines in attributes, non valid UTF-8 sequences.
All details of HTML parsing are specified in HTML 5. However implementation of it is quite a lot of work, and it doesn't matter whether you'll parse HTML exactly with all corner cases. At worst you'll end up with different DOM, but you have to sanitize DOM anyway.
As you mentioned, there are various PHP implementations of this, but I don't know of any in C++, since that's not a language typically applied to web development. Overall, it's going to depend on how complex of an implementation you want to come up with.
A very restrictive whitelist is probably the "simplest" way, but if you want to be really comprehensive I would look into doing a conversion of one of the established versions to C++, as opposed to trying to write your own from scratch. There are so many tricks to worry about, that I think you'd be better off standing on the shoulders of others that have already gone through all that.
I don't know anything about using C++ for web development, but converting PHP to it doesn't seem like it would be a particularly difficult task, PHP doesn't really have any magical capabilities that C++ won't be able to duplicate. I'm sure there will be some small hitches, but overall if you want to go the more-complex route it'd definitely still be faster to do a conversion than a full design from scratch.
HTML Purifier seems like a strong PHP implementation that is still actively maintained, there's a comparison document where the author discuss some differences between his approach and others', probably worth reading.
Whatever you come up with, definitely test it with all the examples you link, and make sure it passes all those. Good luck!

How do you handle translation of text with markup?

I'm developing multi-language support for our web app. We're using Django's helpers around the gettext library. Everything has been surprisingly easy, except for the question of how to handle sentences that include significant HTML markup. Here's a simple example:
Please log in to continue.
Here are the approaches I can think of:
Change the link to include the whole sentence. Regardless of whether the change is a good idea in this case, the problem with this solution is that UI becomes dependent on the needs of i18n when the two are ideally independent.
Mark the whole string above for translation (formatting included). The translation strings would then also include the HTML directly. The problem with this is that changing the HTML formatting requires changing all the translation.
Tightly couple multiple translations, then use string interpolation to combine them. For the example, the phrase "Please %s to continue" and "log in" could be marked separately for translation, then combined. The "log in" is localized, then wrapped in the HREF, then inserted into the translated phrase, which keeps the %s in translation to mark where the link should go. This approach complicates the code and breaks the independence of translation strings.
Are there any other options? How have others solved this problem?
Solution 2 is what you want. Send them the whole sentence, with the HTML markup embedded.
Reasons:
The predominant translation tool, Trados, can preserve the markup from inadvertent corruption by a translator.
Trados can also auto-translate text that it has seen before, even if the content of the tags have changed (but the number of tags and their position in the sentence are the same). At the very least, the translator will give you a good discount.
Styling is locale-specific. In some cases, bold will be inappropriate in Chinese or Japanese, and italics are less commonly used in East Asian languages, for example. The translator should have the freedom to either keep or remove the styles.
Word order is language-specific. If you were to segment the above sentence into fragments, it might work for English and French, but in Chinese or Japanese the word order would not be correct when you concatenate. For this reason, it is best i18n practice to externalize entire sentences, not sentence fragments.
2, with a potential twist.
You certainly could localize the whole string, like:
loginLink=Please log in to continue
However, depending on your tooling and your localization group, they might prefer for you to do something like:
// tokens in this string add html links
loginLink=Please {0}log in{1} to continue
That would be my preferred method. You could use a different substitution pattern if you have localization tooling that ignores certain characters. E.g.
loginLink=Please %startlink%log in%endlink% to continue
Then perform the substitution in your jsp, servlet, or equivalent for whatever language you're using ...
Disclaimer: I am not experienced in internationalization of software myself.
I don't think this would be good in any case - just introduces too much coupling …
As long as you keep formatting sparse in the parts which need to be translated, this could be okay. Giving translators the possibility to give special words importance (by either making them a link or probably using <strong /> emphasis sounds like a good idea. However, those translations with (X)HTML possibly cannot be used anywhere else easily.
This sounds like unnecessary work to me …
If it were me, I think I would go with the second approach, but I would put the URI into a formatting parameter, so that this can be changed without having to change all those translations.
Please log in to continue.
You should keep in mind that you may need to teach your translators a basic knowledge of (X)HTML if you go with this approach, so that they do not screw up your markup and so that they know what to expect from that text they write. Anyhow, this additional knowledge might lead to a better semantic markup, because, as mentioned above, texts could be translated and annotated with (X)HTML to reflect local writing style.
What ever you do keep the whole sentence as one string. You need to understand the whole sentece to translate it correctly.
Not all words should be translated in all languages: e.g. in Norwegian one doesn't use "please" (we can say "vær så snill" literally "be so kind" but when used as a command it sounds too forceful) so the correct norwegian vould be:
"Logg inn for å fortsette" lit.: "Log in to continue" or
"Fortsett ved å logge inn" lit.: "Continue by to log in" etc.
You must allow completely changing the order, e.g. in a fictional demo language:
"Für kontinuer Loggen bitte ins" (if it was real) lit.: "To continue log please in"
Some language may even have one single word for (most of) this sentence too...
I'll recommend solution 1 or possibly "Please %{startlink}log in%{endlink} to continue" this way the translator can make the whole sentence a link if that's more natural, and it can be completely restructured.
Interesting question, I'll be having this problem very soon. I think I'll go for 2, without any kind of tricky stuff. HTML markup is simple, urls won't move anytime soon, and if anything is changed a new entry will be created in django.po, so we get a chance to review the translation ( ex: a script should check for empty translations after makemessages ).
So, in template :
{% load i18n %}
{% trans 'hello world' %}
... then, after python manage.py makemessages I get in my django.po
#: templates/out.html:3
msgid "hello world"
msgstr ""
I change it to my needs
#: templates/out.html:3
msgid "hello world"
msgstr "bonjour monde"
... and in the simple yet frequent cases I'll encounter, it won't be worth any further trouble. The other solutions here seems quite smart but I don't think the solution to markup problems is more markup. Plus, I want to avoid too much confusing stuff inside templates.
Your templates should be quite stable after a while, I guess, but I don't know what other trouble you expect. If the content changes over and over, perhaps that content's place is not inside the template but inside a model.
Edit: I just checked it out in the documentation, if you ever need variables inside a translation, there is blocktrans.
Makes no sense, how would you translate "log in"?
I don't think many translators have experience with HTML (the regular non-HTML-aware translators would be cheaper)
I would go with option 3, or use "Please %slog in%s to continue" and replace the %s with parts of the link.