Does HTML5 define a default charset? - html

I'm authoring HTML5 documents and was a little surprised that the default text encoding (without HTTP headers or meta element setting it) defaults to windows-1252 on the browsers that I have tested (Safari, Chrome, Firefox - recent versions as of Feb 2023, macOS).
In particular, I'm using the <!DOCTYPE html> but forgot to add the <meta charset="utf-8"> element. If I open the file locally, browsers perform auto-detection and use utf-8 when non-ascii chars are present - but not if files are served through a web server.
I understand that browsers can't simply default to utf-8 for all HTML files due to old content and auto-detection for HTTP served content is hard (reasoning described here https://hsivonen.fi/utf-8-detection/).
What I don't understand, however, is why a modern HTML5 document in standards mode (with doctype set) does not also use utf-8 by default?
Edit: The similar Why it's necessary to specify the character encoding in an HTML5 document if the default character encoding for HTML5 is UTF-8? question asks why one needs to set the encoding if one (wrongly) assumes utf-8 as default, not what the default is (or how it's selected).

Through this question (thanks exa.byte and Rob!) and the HTML spec I believe I was able to piece together an answer.
Short answer: No, HTML5 has no default character encoding (but read on).
Long answer: Obviously browsers will use some encoding to display the page. When none is specified, the algorithm first uses auto-detection. In my testing browsers actually do this for local files (url starting with file://) and some might even do it for remote files but the standard encourages not doing this for remote files beyond the first 1kb (this is where the meta charset tag has to be). Limiting to first 1kb is recommended to not stall parsing for too long. Browsers can also entirely skip the auto-detection step if they want (this is what Firefox does for remote files I believe).
Side note: Above no encoding specified means no BOM, no Content-Type with charset, no meta tag, no inherited from parent iframe, and no XML declaration (yes, this is used for text/html too).
So, if auto-detection didn't select the encoding, such as having multiple possibilities or browser didn't have enough data available at the time, the browser selects an implementation-defined option. This can be browser-dependent but HTML5 suggests utf-8 for controlled environments or locale-based default (#9 here) otherwise.
Finally, to explain the behavior I saw with getting the windows-1252 encoding. The reason was because a) auto-detection failed (the non-ascii characters were at the end of page) and b) the browsers I use selected it based on my preferred/selected locale.

Related

Browser support: png files with jpg extension

While using PrestaShop 1.6 even if you set it to store all images as png, and they in fact are, the PS always adds .jpg extension instead of the correct one (.png), however it works anyway (at least in Chrome).
Does all common browsers treat images according to their file header? Or is there some major browser that I need to consider and repair core(which I would really like to avoid) to use correct extensions?
Thanks
Browsers don't care about the file extension at all, but they do care about the content type in the HTTP header. The server generally uses the file extension to determine what MIME type to put in the HTTP header, so it may end up sending the images with the wrong MIME type.
However, once the browser has determined that the MIME type is an image, they don't tend to be picky about the image format. There may be some special cases, but both PNG and JPEG files have an easily recognisable signature in the beginning of the file, so the browser can easily see what the format actually is.

File's encoding doesn't work

In eclipse, I've created a couple of files, added some text and displayed it in the local server. My problem is that instead of utf-8 characters like "ć", "ś" I get some trash like "Ä". All files have .php extension, though it doesn't matter.
Actually, what's strange, with opera some files display those characters properly, while others don't. Using firefox all files show trash.
I've tried project -> properties -> text file encoding -> other (utf-8). It doesn't work.
What's wrong?
It's like that both on localhost and on external servers.
You need to tell the browser what the encoding of the file is. Add a charset tag to the head:
<meta charset="utf-8" />
Without telling what the encoding is, the browser has to guess. Different browser will make different guesses, some guesses work better for some kind of files, and worse for others.

getting namespaced attributes in Chrome

Oh, what frustration. The supposedly XHTML-complient CKEditor can't actually be served as application/xhtml+xml, so I have to switch to text/html. Suddenly my pages start breaking all over the place.
I serve a well-formed HTML5 document that uses namespaces---in particular, the "example" namespace. Some elements have the "example:fooBar" attribute, but I see now that Chrome when reading a document as text/html converts all attributes to lowercase---grrr!!!
So I change the attribute to "example:foobar" and try element.getAttributeNS("http://example.com/ns", "foobar"). No luck. So I investigate the DOM, and Chrome 17 shows a "localName" of example:foobar. Ack! How hard can namespaces be? Shouldn't Chrome be using a local name of foobar? That is, after all, the local name; example is the namespace prefix!
Is this is Chrome bug? Do all browsers do screwy things like this?

How browsers use the STRING defined in the <img src="STRING" /> to load picture file

I have a very strange problem:
I use xsl to show an html picture where the source is defined in the xml file like this:
<pic src="..\_images\gallery\smallPictures\2009-03-11 אפריקה ושחור לבן\020.jpg" width="150" height="120" />
[the funny chars are Hebrew- ;) ]
Now comes the strange part:
When testing the file locally it works on Firefox and Safari but NOT in IE and opera. (file://c:/file.xml)
Next I send the file to the host throw FTP (nothing more)
Than it suddenly works with all browsers when calling the page from the host: (http://www.host/file.xml)
The question is how can the server send the xml file to my browser in a way that my browser can read, while the same browser cannot read the same file stored locally ?!
I always thought that both HTML(xml) and pictures are sent to the client which is responsible to load the page - so how come the same files works for my webhost provider and not for me?
And what makes it totally strange is that IE is not alone - Opera joins it with this strange behavior.
Any ideas?
Thanks alot
Asaf
When you open the file locally, there is no server to serve up HTTP headers. That's a big difference at least. Try examining the coding the browser thinks the page is in, when it's opened manually from disc, and when served over HTTP.
If headers are set correctly by either your script, or the server, then that is likely why.
This is most likely an encoding problem. Try to specify the encoding explicitly in the generated HTML page by including the following META element in the head of the page (assuming that your XSLT is set to generate UTF-8):
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
...
</head>
...
This tells the browser to use UTF-8 encoding when rendering the page (You can actually see the encoding used in Internet Explorer's Page -> Encoding menu).
The reason why this works when the page is served by your web server is that the web server tells the browser already what encoding the response has in one of the HTTP headers.
To get a basic understanding what encoding means I recommend you to read the following article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
..\_images\gallery\smallPictures\2009-03-11 אפריקה ושחור לבן\020.jpg
that's a Windows filepath and not anything like a valid valid URI. You need to:
replace the \ backslashes with /;
presumably, remove the .., if you're expecting the file to be in the root directory;
replace the spaces (and any other URL-unfriendly punctuation) with URL-encoded versions;
for compatibility with browsers that don't properly support IRI (and to avoid page encoding problems) non-ASCII characters like the Hebrew have to be UTF-8-and-URL-encoded.
You should end up with:
<img src="_images/gallery/smallPictures/2009-03-11%20020/%D7%90%D7%A4%D7%A8%D7%99%D7%A7%D7%94%20%D7%95%D7%A9%D7%97%D7%95%D7%A8%20%D7%9C%D7%91%D7%9F%10.jpg"/>
There's no practical way you can convert filepath to URI in XSLT alone. You will need some scripting language on the server, for example in Python you'd use nturl2path.pathname2url().
It's generally better to keep the file reference in URL form in the XML source.
#Asaf, I believe #Svend is right. HTTP headers will specify content type, content encoding, and other things. Encoding is likely the reason for the weird behavior. In the absence of header information specifying encoding, different browsers will guess the encoding using different methods.
Try right-clicking on the page in the browser and "Show page info". Content encoding should be different when you serve it from a server, than when it's coming straight from your hard drive, depending on your browser.

Is there any benefit to adding accept-charset="UTF-8" to HTML forms, if the page is already in UTF-8?

For pages already specified (either by HTTP header, or by meta tag), to have a Content-Type with a UTF-8 charset... is there a benefit of adding accept-charset="UTF-8" to HTML forms?
(I understand the accept-charset attribute is broken in IE for ISO-8859-1, but I haven't heard of a problem with IE and UTF-8. I'm just asking if there's a benefit to adding it with UTF-8, to help prevent invalid byte sequences from being entered.)
If the page is already interpreted by the browser as being UTF-8, setting accept-charset="utf-8" does nothing.
If you set the encoding of the page to UTF-8 in a <meta> and/or HTTP header, it will be interpreted as UTF-8, unless the user deliberately goes to the View->Encoding menu and selects a different encoding, overriding the one you specified.
In that case, accept-encoding would have the effect of setting the submission encoding back to UTF-8 in the face of the user messing about with the page encoding. However, this still won't work in IE, due the previous problems discussed with accept-encoding in that browser.
So it's IMO doubtful whether it's worth including accept-charset to fix the case where a non-IE user has deliberately sabotaged the page encoding (possibly messing up more on your page than just the form).
Personally, I don't bother.
I did not encounter any problems using UTF-8 with IE (6+) or any other major browser out there. You need to make sure, that a UTF-8 meta tag is set (IE needs this) and that all your files are UTF-8 encoded (which means that the webserver sends UTF-8 headers). Then there should not be any problem if you omit accept-charset.