A webmaster-tools utility complains that my page is missing meta language information.
What should I do if my page contains mixture of text in multiple languages (all Western European languages)?
Most “webmaster-tools” complaints are best ignored (or not read at all). To get help with specific messages, please identify the tool, the message, and your URL.
A mixture of languages is a problem in itself, and best avoided by putting different language versions into different pages. If you need a mix of languages, use the lang attribute to specify the language of each part.
There is very little useful that you can do with meta tags, regarding a mixture of languages.
Have you looked at Multi-regional and multilingual sites article? It provides some good suggestions.
Related
Some sites use `` for code formatting, while some others that I use have [c][/c] for code, or [b][/b] for bold etc. Then, some other sites like YouTube use things like ** for bold and __ for italics (I think that WhatsApp uses this same convention).
What are these called? Is `` a form of Markdown and [c][/c] is a form of HTML? What is the type used by YouTube and WhatsApp called?
And what, in general, do we call this class of formatting?
The general term for all of these types of languages would be Lightweight Markup Languages, although that term is not actually used much. Generally, "Markup Language" will suffice.
The [b][/b] syntax is most likely BBCode (although there is no [c] that I an aware of; I don't know what that would be).
Many markup languages (including Markdown) make use of backticks (`), underscores (_) and asterisks (*), and that use does not always mean the same thing. For an example of various languages, see this table on Wikipedia.
Note that some of those languages predate Markdown and may have even contributed to Markdown adopting the same behavior. Many others are newer than Markdown and could credit Markdown as their inspiration. That said, many of these markup languages which are "inspired" by Markdown are subtly different in various ways and are not actually Markdown. Slack and WhatsApp are two recent examples. Note that despite their similarity to Markdown, Wikipedia lists each of them as their own separate markup language due to those differences.
Finally, most of these lightweight markup languages are subsets of HTML. That is, they represent the small portion of HTML which is more commonly used in prose and are generally converted to HTML before being displayed. For example, StackOverflow converts Markdown questions, answers and comments to HTML and serves that HTML to your browser for display. HTML (HyperText Markup Language) itself uses angle brackets (<em>italics</em>), but so does XML (Extensible Markup Language) and various other languages. These would not be considered "lightweight."
Some languages have an accepted transliteration to Latin characters, such as Hindi, Russian or Japanese. For example, the Hindi for 'The man is eating' written in Devanagari script is 'आदमी खा रहा है।'. Transliterated, it would be 'Aadmi kha raha hai.' (or something similar; this approach is often used online, especially if people don't have access to a Hindi keyboard.)
In this case, we're using the Latin script but still writing Hindi, so it would be acceptable to mark up either variation using the lang attribute:
<span lang="hi">आदमी खा रहा है।</span> or <span lang="hi">Aadmi kha raha hai.</span>
My question then is about languages that are normally written in the Latin alphabet themselves, but might have phonetic guides for non-speakers/learners — either IPA or ad hoc pronunciation — is there any best practice in terms of giving it semantic meaning?
For example, in Irish if I were to say "The man is eating", I would say "Tá an fear ag ithe." I can mark this up as:
<span lang="ga">Tá an fear ag ithe.</span>
If I were to give a pronunciation guide for non-speakers, I might say "Taw on far eg ih-he". The sentence isn't meaningless, (like 'lorem ipsum' text) but neither is the sentence in either English or Irish.
What is the correct use of language related attributes in HTML in this case, or is this use case just not covered currently by the specification?
Short version: if you want to specifically say it's written in the Latin alphabet, go for "hi-Latn" or "ga-Latn" for the examples you gave.
Long version:
The W3C spec for the lang attribute doesn't specifically mention this - it suggests some uses of this that depend on orthography (such as using it in order to render high-quality versions of the characters used), but some that don't (such as for search engines).
RFC1766, which specifies the format for the language tags, suggests that specialisations of tags may be used to represent "script variations, such as az-arabic and az-cyrillic". There's more about the script subtag in this article on the W3C site, and a bit extra in the later RFC5646. That one points to an ISO standard list of script names, and in that list the script you'd want is "Latn" as they're romanised forms of other scripts.
(This doesn't cover things like specifying how you did the transliteration, though, for languages which may have more than one standard e.g. Chinese in Latin script using Wade-Giles versus pinyin.)
For most practical purposes, it does not matter, since browsers, search engines, and other relevant programs generally ignore lang attributes. The attributes may affect the choice of font, but only when the page itself does not suggest fonts (which is rare). Some speech browsers recognize a few values for lang and adapt their functionality accordingly. And if you open an HTML document in MS Word, it recognizes the lang markup and applies language-specific spelling tools. But all this is rather limited and rarely matters much. Moreover, in these cases, only the simplest types of language codes are recognized.
In principle, it is possible to indicate the writing system (“script”), such as Latin vs. Devanagari, and the transliteration or transcription system that has been used. This has been described in BCP 47. But for the most of it, it’s guidelines for implementors, not something you could use here and now.
For example, you can write <span lang="hi-Latn">Aadmi kha raha hai.</span> to indicate that the content is in Hindi but written in Latin letters. And there is, in principle at least, a way to indicate which of the competing romanization systems has been used. I don’t think any web-related software recognizes lang="hi-Latn"; programs might even fail to recognize it even if they recognize lang="hi".
So you can use detailed values for lang, but it’s not of much use. Using simple markup like lang="hi" for a any major fragment in another language (say, a sentence or more) is good practice, though not much more. Before spending too much time on it, consider what practical benefits you could expect. For example, if you consider using a client-side hyphenator like hyphenate.js, then lang markup becomes essential; but then you need to check out the expectations of that software, rather than just general specifications.
A word of warning: I have seen odd results when using lang="ru" for Russian written in Latin letters. The reason is that browsers may switch to their idea of “font for Russian”, causing a mix of fonts. But the simple remedy is to make some consistent font settings for all of your texts, overriding browser defaults, in cases like this.
Strings like “Taw on far eg ih-he” cannot be meaningfully classified as being in some language. If you use language markup, use lang="" (with empty string as value), since this is the defined way of explicitly indicating that the language is not indicated!
You might want to look into marking it up as <ruby>.
For example:
<ruby lang="hi">आदमी<rt>Aadmi</rt> खा<rt>kha</rt> रहा<rt>raha</rt> है।<rt>hai</rt></ruby>
I know some languages emphasise words differently to English, e.g. via changing word endings rather than stressing words with inflection of the voice.
If you are localising a site, would you trust that <strong> and <em> tags (and their placement) will have the same meaning in other languages — would you maintain this emphasis, check with your translator or leave them out?
What I'm wondering is how this translates (excuse the pun) into the semantics of the web? — Strong and em tags carry semantic meaning that is used within SEO, screen-readers etc. So should they be left in place so this isn't lost, or dropped to better conform with the target language?
Mark-up is there to convey meaning a whole, and so long as the meaning is conveyed, you have succeeded in your mark-up. So in a language where stress emphasis is conveyed in the text, using tags to signal the emphasis is redundant and optional.
Inline level markup, much more so than block level markup, may need to be radically different in different languages. In a good translation, the text should be marked up from scratch in each language.
For individual words, I would leave the markup tags in the translation strings for the reasons outlined in comments above. If emphasising blocks of text, whole sentences, numerals, etc, I'd put it in the template if possible as it's not really something that would need to be be messed with by the translation.
A good idea might be to flag in the template the you have done this using comments. You will also need some reliable process for getting all the translation files changed if you decide to alter the emphasis ever (which you inevitably will). This is a pain, so I tend to avoid adding emphasis to individual words wherever possible :)
Interesting question! The only thing I can add is that strong and em tags are only useful for SEO if the search engines connect those tags with the content on your site.
I'd recommend using these tags only if there's an actual reason (for emphasis, say) rather than hoping to gain SEO benefit from bolding or italicizing keywords.
For screen readers, it comes down to what language you are talking about. JAWS for example, you can download voice files. If it isn't listed, then they have to have to choose another language or find alternative means. The key thing is for you to set the lang attribute correctly.
How will you customise a html page so that it accepts multiple language?
I will cite W3 Internationalization Quick Tips for the Web :
Encoding. Use Unicode wherever possible for content, databases, etc. Always declare the encoding of content.
Escapes. Use characters rather than escapes (e.g. á á or á) whenever you can.
Language. Declare the language of documents and indicate internal language changes.
Presentation vs. content. Use style sheets for presentational information. Restrict markup to semantics.
Images, animations & examples. Check for translatability and inappropriate cultural bias.
Forms. Use an appropriate encoding on both form and server. Support local formats of names/addresses, times/dates, etc.
Text authoring. Use simple, concise text. Use care when composing sentences from multiple strings.
Navigation. On each page include clearly visible navigation to localized pages or sites, using the target language.
Right-to-left text. For XHTML, add dir="rtl" to the html tag. Only re-use it to change the base direction.
Check your work. Validate! Use techniques, tutorials, and articles at http://www.w3.org/International/
For more information follow W3 recommendations : http://www.w3.org/International/
One way to do this would be to use a decent server-side web technology, there are many to choose from, which contains support for internationalization. Essentially it comes down to specifying the different pieces of text that the site needs to display, assigning a label to each message, creating different versions of each label in separate language files, and using the server-side code, reference the label name and a country code to display the text in the appropriate language.
The first step is to determine your requirements, your hosting environment and then figure out what options are available to you. If you can provide some more information we might be able to steer you in a better direction.
If I make a bunch of assumptions about what you are trying to achieve:
Serve the document as UTF-8
Browsers will tend to then return a UTF-8 response to the server when any forms are submitted (forms being the only way that a page is going to "accept" anything), and UTF-8 can handle the characters used in just about every language.
I need to implement a simple and efficient XSS Filter in C++ for CppCMS. I can't use existing high quality filters
written in PHP because because it is high performance framework that uses C++.
The basic idea is provide a filter that have a while list of HTML tags and a white
list of options for these tags. For example. typical HTML input can consist of
<b>, <i>, tags and <a> tag with href. But straightforward implementation is not
good enough, because, even allowed simple links may include XSS:
Click On Me
There are many other examples can be found there. So I though also about a possibility to create a white list of prefixes for tags like href/src -- so I always need to check if it starts with (https?|ftp)://
Questions:
Are these assumptions are good enough for most of purposes? Meaning that If I do not
give an options for style tags and check src/href using white list of prefixes it solves XSS problems? Are there problems that can't be fixes this way?
Is there a good reference for formal grammar of HTML/XHTML in order to write simple
parser that would cleanup all incorrect of forbidden tags like <script>
You can take a look at the Anti Samy project, trying to accomplish the same thing. It's Java and .NET though.
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project#.NET_version
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project_.NET
Edit 1, A bit extra :
You can potentially come up with a very strict white listing. It should be structured well and should be pretty tight and not much flexible. When you combine flexibility, so many tags, attributes and different browsers generally you end up with a XSS vulnerability.
I don't know what is your requirements but I'd go with a strict and simple tag support (only b li h1 etc.) and then strict attribute support based on the tag (for example src is only valid under href tag), then you need to do whitelisting in the attribute values as you stated http|https|ftp or style="color|background-color" etc.
Consider this one:
<x style="express/**/ion:(alert(/bah!/))">
Also you need to think about some character whitelisting or some UTF-8 normalization, because different encodings can cause awkward issues. Such as new lines in attributes, non valid UTF-8 sequences.
All details of HTML parsing are specified in HTML 5. However implementation of it is quite a lot of work, and it doesn't matter whether you'll parse HTML exactly with all corner cases. At worst you'll end up with different DOM, but you have to sanitize DOM anyway.
As you mentioned, there are various PHP implementations of this, but I don't know of any in C++, since that's not a language typically applied to web development. Overall, it's going to depend on how complex of an implementation you want to come up with.
A very restrictive whitelist is probably the "simplest" way, but if you want to be really comprehensive I would look into doing a conversion of one of the established versions to C++, as opposed to trying to write your own from scratch. There are so many tricks to worry about, that I think you'd be better off standing on the shoulders of others that have already gone through all that.
I don't know anything about using C++ for web development, but converting PHP to it doesn't seem like it would be a particularly difficult task, PHP doesn't really have any magical capabilities that C++ won't be able to duplicate. I'm sure there will be some small hitches, but overall if you want to go the more-complex route it'd definitely still be faster to do a conversion than a full design from scratch.
HTML Purifier seems like a strong PHP implementation that is still actively maintained, there's a comparison document where the author discuss some differences between his approach and others', probably worth reading.
Whatever you come up with, definitely test it with all the examples you link, and make sure it passes all those. Good luck!