Two encoding in one HTML document - html

My problem is this:
I am copying a set of HTMLs from a machine to other one, and I am adding more information into the target HTMLs as a element. The problem I have is that the source documents are encoded in a lot of different encodings [UTF8, 8859-1, GB1232, etc.] and the meta information is stored as UTF-8, so, when I "dummily" merge my meta info with the original document, my meta info [that contains international characters] looks weird.
So, is there a way of use the HTML encoding defined in the <META> and in the !DOCTYPE tags in all an HTML document except in a TABLE or in a DIV Section that will use another encoding specified there?
thanks in advance,
Ernesto

No, there isn't.
I suggest you use DOM parsers to read the various HTML bits into memory, and then construct a combined document in UTF-8. Once these HTML fragments are in memory (after parsing) they'll be in some sort of Unicode representation (depending on the programming language), and so no information should get lost along the way.

No, you need to use a character encoding that is a union of the encodings that are used. So in your case I suggest you to use UTF-8 for all of your documents. Or you use character references instead of the plain character itself, if they can not be encoded with the encoding that is used in the document.

Related

Entity codes and the lang attribute: should I use both?

I am writing a markup document in Finnish.
I'm using the lang="fi-fi" attribute. Am I supposed to use the markup entities (ä for ä etc.) in conjunction with the language attribute, or is using the language attribute alone sufficient? How do the entities and language attribute affect each other?
The "problem" comes from the fact that the markup is written without entities and I have a script that's supposed to replace the scandic letters with entities by using regular expressions -- after defining the lang attribute the script doesn't appear to work anymore (which it supposedly did before adding the lang attribute).
My main concern is that the markup renders correctly regardless of the browser, although a "modern" browser can be assumed.
The lang attribute and entities do completely different jobs.
The lang attribute tells the parser what human language the document is written in. This allows, for example, search engines to tell if it is a good document to present to Finish speakers and screen reader software to select the correct pronunciation library.
Entities just let you represent characters that you couldn't otherwise represent. e.g.
Because you can't type the character of your keyboard
Because the character encoding the document is saved in (e.g. ASCII) doesn't include the character. This century you should be using UTF-8 just about everywhere and shouldn't need to worry about that.
Because the character would otherwise have special meaning in HTML (e.g. <).
Always use a lang attribute if you know what language the text of the document will be written in
Always use entities for characters with special meaning in HTML
Use literal characters if you can be reasonably certain the character encoding won't be mangled (which you can be most of the time) as they use fewer bytes and are easier to read in source code.
The root of my problem was actually character encoding. Although all of the documents were defined with UTF-8, the script somehow didn't recognize it. By telling the script that the input files (that were supposed be fixed with entities) are UTF-8 encoded the script functions correctly again.
As an answer to the question in the heading: to be absolutely sure that the documents are compatible with the server -- yes, I am supposed to use entity encoding (though I understand that assuming that the server allows UTF-8 is pretty safe assumption in general as implied by Quentin). Due to other reasons (related to automatic content generating), I'm also supposed to use the lang attribute.

HTML encoding of Japanese text

I'm making a static HTML page that displays courtesy text in multiple languages. I noticed that if I paste ウェブサイトのメンテナンスの下で into Expression Blend, that text appears the same in the code. I think it's bad for compatibility and should be replaced by proper HTML entities.
I have tried http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/encode.aspx but it returns me the same Japanese text.
Is it correct, from the point of view of browser compatibility, to paste that Japanese right into the source code of an HTML page?
Else, what is the correct HTML encoding of that text? Or, better, is there any tool that I can use to convert non-ASCII characters to HTML entities, possibly online and possibly free?
I think it's bad for compatibility and should be replaced by proper
HTML entities.
Quite the opposite actually, your preference should be to not use html entities but rather correctly declare document encoding as UTF-8 and use the actual characters. There are quite a few compelling reasons to do so, but the real question is why not use it since it's a well- and widely supported standard?
Some of those points have been summarised previously:
UTF-8 encodings are easier to read and edit for those who understand
what the character means and know how to type it.
UTF-8 encodings are just as unintelligible as HTML entity encodings
for those who don't understand them, but they have the advantage of
rendering as special characters rather than hard to understand decimal
or hex encodings.
[For example] Wikipedia... actually go through articles and convert
character entities to their corresponding real characters for the sake
of user-friendliness and searchability.
As long as you mark your web-page as UTF-8, either in the http headers or the meta tags, having foreign characters in your web-pages should be a non-issue. Alternately you could encode/decode these strings using encodeURI/decodeURI functions in JavaScript
encodeURI('ウェブサイトのメンテナンスの下で')
//returns"%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7"
decodeURI("%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7")
//returns ウェブサイトのメンテナンスの下で
If you are looking for a tool to convert a bunch of static strings to unicode characters, you could simply use encodeURI/decodeURI functions from a web-page developer console (firebug for mozilla/firefox). Hope this helps!
HTML entities are only useful if you need to represent a character that cannot be represented in the encoding your document is saved in. For example, ASCII has no specification for how to represent "€". If you want to use that character in an ASCII encoded HTML document, you have to encode it as € or not use it at all.
If you are using a character encoding for your document that can represent all the characters you need though, like UTF-8, there's no need for HTML entities. You simply need to make sure the browser knows what encoding the document is in so it can interpret it correctly. This is really the preferable method, since it simply keeps the source code readable. It really makes no sense to want to work with HTML entities if you can simply work with the actual characters.
See http://kunststube.net/frontback for some more information.

Why can some HTML documents display special chars written plainly (e.g. as ä) without the need for codes (e.g. ä)

I'm making a little website with german and french content. Some of the documents display text correctly, even though all umlauts are written as äöü and not with codes. Other docs need the codes but I can't find the difference between the documents.
When trying to google for an answer, I can only find tons of code references but no explanation why some docs don't need them.
Any HTML document (or any text document for that matter) is encoded to a certain encoding - this is a mapping between the characters and the values representing them. Different encodings mean different characters.
Many pages use UTF-8 a Unicode encoding and they state so either in the HTTP header or in a Meta tag (Content-Type) on the page itself - such pages can use most characters directly.
You should read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
1) charset-declaration in the html-code (meta)
2) the encoding of your documents.
For example... if you're working with UTF-8 and there is ONE document (for example a js-file) in ISO 8859-1 then some browsers will show you the site in ISO 8859-1 wich destroys your äöüß, ...
Because, per the HTML specification:
Authoring tools (e.g., text editors) may encode HTML documents in the character encoding of their choice
Some documents use an encoding (such as iso‑8859‑1, or Windows‑1252, or utf‑8) that can represent the character ä directly; others use an encoding (such as us‑ascii) that cannot, and therefore need to use the character entity reference ä.

Necessary to encode characters in HTML links?

Should I be encoding characters contained within a url?
Example:
Some link using &
or
Some link using &
Yes.
In HTML (including XHTML and HTML5, as far as I know), all attribute values and tag content should be encoded:
Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
There are two different kinds of encoding which are needed for different purposes in web programming, and it is easy to get confused.
Special characters in text which is to be displayed as HTML need to be encoded as HTML entities. This is particularly characters such as '<' which are part of HTML markup, but it may also be useful for other special characters if there is any doubt about the character encoding to be used.
Special characters in a URL need to be URL-encoded (replaced by %nn codes).
There is no harm in putting an HTML entity into a URL if it is going to be treated as HTML text by whatever receives it; but if it is part of an instruction to a program (such as the & used to separate arguments in a CGI query string) you should not encode it.
Depends how your files are being served up and identified.
For XHTML, yes and it's required.
For HTML, no and it's incorrect to do it.

What to do with umlauts (äöü) in metatags?

Declaring them as &xuml; etc. didn't work, just writing them as they are leads to display errors.
What to do?
If your page is encoded as UTF-8, you should be able to use special characters directly (i.e. without converting them into their HTML entity counterparts) without problems. Note that if you declare the encoding in a content-type meta tag, you should put that tag to the very beginning of the head section.
Use an encoding which can encode the characters. I'd recommend UTF-8, which is generally the preferred solution for western languages.
Keep in mind that HTTP headers have precedence over <meta http-equiv=...>, but you should set both to ensure using the correct encoding when loading the document from non-HTTP sources (eg when saving the file locally).
You should never have to use HTML entities for those characters, since they have no special meaning in HTML. Just make sure the character encoding of the text you're outputting matches your charset header.