Strange validation behaviour of the charset with two html of the same template - html

I have got two HTML documents based on the same template. I built both exactly the same and then changed the contents inside the divs. I'm using the DTD HTML 4.01 Transitional and the charset ISO-8859-15 (for Spanish language, you know accents and so on) in a meta tag inside the head.
And when it comes to validation, one parses and the other doesn't, and I can't figure out why.
It complains about some accents in one of the documents that are also present in the other document which gets no complaints.
I find it funny, but there must be a reason.

I think I found the problem, I just opened with the simple notepad the file that was giving me trouble and once opened there I could see that I had strange characters like “ or ‰ in my code. I just removed them and wrote the contents properly and, of course, it passed the parser. I could not see those characters with my file opened from notepad++, that's why the parser error I was getting was so strange to me.
I didn't set the encoding in my Notepad++ to ANSI and maybe that was the reason I couldn't see those odd characters.

Related

Words showing differently in the browsers

I am working on a site which has some Norwegian words. When I used "På" inside a <span> it is showing as "PÃ¥" in the browser.This is happening only for a particular page. For others it is working fine.I have tried to copy-paste from other working pages.But had no effect.It is showing "PÃ¥" instead of "På".Why this is happening?
you need to use &aring insead of å
see this link for html codes-
http://www.ascii.cl/htmlcodes.htm
Try converting your special characters to equivalent HTML entities using this converter
The character encoding of the page is wrong: the real encoding differs from the declared encoding. Using entity references for all non-Ascii characters would hide the symptoms (with the pertaining risk that later on, when someone inserts an “å”, things go wrong again). But the solution is to remove the conflict.
Check out the tutorial Declaring character encodings in HTML. If you need further help with this, posting the URL (not just copy of all code) is essential.

Browser is HTML Encoding a character before sending it?

I cant believe what im seeing here! I have a normal, basic html form (havent changed the enctype), if someone puts a strange japanese character in the field and posts the form then in my database it is saving an HTML encoded version of the character. I am not processing the string at all except with a Trim(). Using classic ASP (not out of choice i might add!). I have a feeling this might have something to do with utf-8/encoding but ive tried messing around with the meta tag and content type and been unable to get the character to come through properly. To make things harder i dont seem to be able to get classic ASP debugging in VS express 2010. Any comments appreciated :)
As you can see in this demo and read in the standard (4.10.22.6.4.2), characters that are not supported by the selected encoding (such as Japanese ones in an ISO8859-* or cp1252 encoding) are encoded as HTML entities.
If you are fine with incorrectly handling user input that contains html entities in the clear, you can replace all numeric HTML entities in the user input with the corresponding Unicode character (however, doing so in ASP is hard since there is no inverse function to Server.HTMLEncode and Unicode support is pretty much nonexistent in the first place.
As an alternative, use UTF-8 (and/or a web development platform from this millennium) and all these problems go away. However, since that may not be an option, you may want the to unescape the HTML entities in different programs, for example with HttpUtility.HtmlDecode in C#, html_entity_decode in PHP, or HTMLParser.unescape in Python.

Validation error: "Byte-Order Mark found in UTF-8 File"

I'm working on a website and, while displaying it on Firefox is fine, on Internet Explorer I've got a lot of problems. I used the W3C validator and I got a lot of strange errors.
Here's the link to the website: http://misenplacecatering.it/
The first validation error, which I think is the most relevant, is this:
Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.
and
Line 1, Column 1: Non-space characters found without seeing a doctype first. Expected .
<!DOCTYPE HTML>
I've read other questions about this issue, so I tried to open the file with different editors (I always use Vim, anyway), but I don't see any space or anything else before the doctype definition. I even used Notepad++ and used an option to remove the BOM, but nothing.
How can I fix it?
If using Notepad++, use Convert to UTF-8 without BOM.
If you are using PHP, make sure that any included/required file is in either in ASCII or UTF without a BOM, as PHP doesn't handle non-ASCII file very well (this one gave me a headache once)
You could try converting your files to ASCII, if you don't need UTF characters.
In your <meta charset> attribute, try writing the value within quotes.
The free text editor PSPad has a hex editing mode which is very handy for seeing exactly what you really have in your text files.

How do I paste Chinese text into an html snippet with no UTF-8 meta tag?

I have some pages I'm copying Chinese text into from a Word doc. I have 2 kind of HTML documents. Some are parent pages, with a full head tag, and meta tag, where I have spec'd:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I also have some snippets that just start with <div> because they are include files.
When I paste Chinese into the docs with the meta tag, it pastes fine and I can see the Chinese.
When I paste Chinese into the HTML snippets that start with a <div>, I just get boxes.
How do I paste Chinese into a snippet so that I can see the characters?
In Dreamweaver, I can flip over to visual mode and paste them in, but when I flip back to code mode, It shows me that all the characters have been converted to URL encoded equivalents. On the UTF-8 page, I can paste the Chinese and read it in code mode as Chinese characters.
HOWEVER
The weird thing is I have several include files. One I opened up has no meta tag on it, and it already had Chinese in it from development I did a few weeks back. I can still paste new Chinese into it and it's fine.
So basically, I have regular old HTML files, with no meta tag about UTF-8, and some allow me to paste Chinese into them in code mode and it works fine, and others don't allow it. The structure of these various HTML snippets are nearly identical.
Could this be a DreamWeaver bug or is there some trick / setting?
You cannot do so reliably without some sort of metadata; in HTML, that is what the <meta> tag is for. (Otherwise the browser will have to guess, and uses the default that the user has chosen. Which, in most cases, is anything but what one expects.)
Dreamweaver also has to determine the encoding of the files it opens, and in the absence of metadata it will presumably use the same techniques as a web browser (such as detecting byte sequences / character distributions that are only likely to appear in a specific encoding) to come up with a best guess.
This is the likely reason why one snippet opened with the correct encoding and another did not. Once open, Dreamweaver "knows" that a file is encoded with UTF-8, so it can continue to use UTF-8 for that file (and may even add a BOM when saving, to ensure it opens correctly in future) which is probably why saving over the bad one with the good one fixed the issue.
A good editor will let you specify the encoding to use when opening (and saving) a file. You can do this in Dreamweaver via Modify > Page Properties (see: UTF-8 Without BOM?).

HTML Encoding Charset Problem I think?

I've been asked to add a testimonial to this page...
http://www.orchardkitchens.com/Showroom/testimonials.html
As you will see there are funny characters showing up all over the place, and it has thrown the structure of the page out.
I've since reloaded the backup and the funny chars are still appearing. Any ideas what I need to do??
Please ask if you need more info from me about the problem in hand.
Many thanks,
ETFairfax.
Looks to me as though some of the text was encoded as UTF-8 yet loaded as if it were an ANSI charset then an HTML encode run over it. Resulting in these extra characters. You will need to find the source text re-build the HTML ensuring whatever is reading the source text understands that its in UTF-8 encoding.
Valid HTML might be a start; a HTML document shouldn't start with a meta tag directly. Also it seems that the charset problem is not with your web page but rather in the backend code. Look at the source, there are numerous things such as
“
appearing which are HTML character entities for things that UTF-8 encoding yields when interpreted as Latin 1. So you should probably fix your code instead of the HTML (well, that too).
Your HTML is syntactically invalid. The <!doctype> is missing, the <html> tag is missing, the <head> tag is missing, the meta information cannot be parsed reasonably by the webbrowser.
Fix your HTML first and then retry.
As to the character encoding story, just ensure that you're using one and same character encoding everywhere. In the datastore, in the source files, in the response headers, etcetera. You may find the introductory text of this article useful to learn a bit more about character encodings. If you actually know/use Java, then you may find the proposed solutions useful as well.