HTML Issue, strange characters replacing HREF quotes - html

I am new to HTML coding. I'm taking an intro web design course this semester and i'm having a difficult time with my HREF segment. I have a table of contents page that references all of my projects over the semester.
This includes direct links to my projects where I should be able to embed my index.html file with the links to my new projects. However, whenever I try to update the HREF segments with quotes linking to my new project it spits out odd characters where the quotes would be.
â₠example of what the error shows below.
**The requested URL /“http://userid.myweb.usf.edu/project1/index.html“ was not found on this server.**
<li>This link goes to <a href=“http://userid.myweb.usf.edu/project1/index.html“>Project1</a></li>
I see a lot of references to it being a UNICODE8 issue but i have no idea what that means. If anyone could help i would greatly appreciate it as my professor is not the best at getting back to us.

Your <a> tag is using “ quote characters (Unicode codepoint U+201C LEFT DOUBLE QUOTATION MARK). HTML requires " quote characters instead (codepoint U+0022 QUOTATION MARK).
<li>This link goes to Project1</li>
Some editors, particularly word processors that were designed for editing documents and not HTML, will use “ instead of " when you type " on the keyboard or copy/paste text from other apps, so watch out for that. Use a text editor that is specifically designed for editing HTML, or at least a plain vanilla text editor, like NotePad/NodePad++, which doesn't reinterpret entered characters.
Here is a breakdown of what “ means:
The Unicode “ (U+201C) character, which you are entering in your HTML, is encoded in UTF-8 as bytes E2 80 9C.
When those same bytes are interpreted in the Windows-1252 charset (the default charset used by most Windows systems in Western countries), byte E2 is Unicode codepoint U+00E2 (â), byte 80 is codepoint U+20AC (€), and byte 9C is codepoint U+0153 (œ).
When encoded in UTF-8, codepoint U+00E2 is bytes C3 A2, codepoint U+20AC is bytes E2 82 AC, and codepoint U+0153 is bytes C5 93.
In Windows-1252, characters “ are bytes C3 A2 E2 82 AC C5 93.
Look familiar?
You have a charset mismatch between what you are saving your HTML file as, and what your web browser is interpreting the HTML as. Your HTML is being saved as UTF-8, but is being decoded to Unicode mis-interpretted as Windows-1252 instead of as UTF-8, re-encoded as UTF-8, and then displayed as Windows-1252.
If you are serving your HTML file over HTTP, make sure the HTTP server is reporting the correct charset=UTF-8 attribute in the Content-Type HTTP header.
You can (and should) also add a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> tag (if using HTML4) or <meta charset="UTF-8"> tag (if using HTML5) to your HTML itself (when served over HTTP, web browsers are required to give the actual Content-Type HTTP header higher priority, though).
Make sure the reported charset in all cases matches the actual charset that you are saving your HTML file as.

Related

Basics on encoding

Today I've started my first HTML page. Where is the page encoding stored exactly?
At first, é turned into é. Then I used my text editor to save the file with an encoding. "UTF-8" didn't work. Then I used "ISO 8859-1", which did work. How did my browser know it was encoded with "ISO 8859-1"?
I can't see it anywhere in my file, so I'm very curious about where the info is stored.
The encoding is stored in the header of the file itself. Notepad++ and similar programs usually provide a number of options to change and view it.
Additionally, you can provide a value by using the meta tag:
<meta charset="UTF-8"> (HTML5)
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"> (HTML4)
Those tags are used by browsers to parse your file. However, they do not define the encoding of the file itself (and that's what seems to be happening in your case: your file has encoding A, and the browser is trying to read encoding B), and browsers can ignore those conditions.
The default encoding can also be defined (and overwritten) by your server. A sample .htaccess encoding configuration:
AddDefaultCharset utf-8
AddType 'text/html; charset=utf-8' .html .htm .shtml
UTF-8 is the recommended encoding standard for the web.
The UTF-8 encoding for é is the two hex bytes C3A9.
C3 A9, when interpreted as ISO 8859-1 is two characters: é.
Browsers tend to guess correctly at the encoding. Or you can explicitly tell it how to interpret the bytes. Try that out -- you will probably see the text change between é and é.
A third case is when "double encoding" occurs. That is, somehow, the é is seen as UTF-8, hex C383 C2A9.
So, to really be sure of what is going on, you need to get the HEX.

Include Unicode Signature (BOM) in HTML files or not?

In Dreamweaver I have the option "Include Unicode Signature (BOM)".
If I check this box and save the file the HTML file it looks good when viewed in the web browser. If not it gives me strange symbols for Swedish letters like åäö.
If I serve this HTML file with strange letters using the header respond "Content-Type: text/html; charset=utf-8" it still gives me strange symbols.
Q1) Does that mean that it's not a UTF-8 encoded file (the one without BOM that shows strange symbols)?
Q2) What makes a file UTF-8 encoded, is it just the Unicode signature (BOM)?
Q3) Should I or should I not add the Include Unicode Signature (BOM) in my files (HTML, Javascript, CSS, PHP)?
I know that I can add <meta charset="UTF-8"> in the HTML code or type AddDefaultCharset UTF-8 in my .htaccess. I just figure the optimal solution would be to have a header respond that says "it's a UTF-8 encoded file" and then also actually serve a UTF-8 encoded file. Nothing else.
Q4) I thought HTML files were plain text-files. What other information is hidden in those files and how can I read this information?
The BOM is entirely optional for UTF-8. The Unicode consortium points out that it can create problems while offering no real advantage; the W3C says that it can be a substitute for other forms of declaring the encodings and should work on all modern browsers.
The BOM is only there to clarify the endianness of the encoding. Since UTF-8 only has one kind of endianness it is superfluous. It's only useful for UTF-16 and other encodings. A UTF-8 encoded file is UTF-8 encoded regardless of the presence of the BOM.
HTML files do not "hide" any other information, they're plain text.
My recommendation would be:
encode as UTF-8 without BOM
add the HTTP Content-Type header to denote the encoding of the file
also add the <meta> tag into the HTML itself as a fallback, should the file be interpreted outside of an HTTP context (meaning where no HTTP header exists because the file is not read over HTTP)
This gives you the best compatibility with the least potential for issues. If your characters are still appearing funny, then your file is not actually UTF-8 encoded or the HTTP header is not being set correctly.

utf-8 / utf-16 conversion

When I design a html page in Dreamweaver CS6 I use its validation tool (it sends the code to w3c) and I get no errors. However, when I validate the same page in UltraEdit 21 (it uses HTML Tidy) I get the warning:
"Specified input encoding (utf-8) does not match actual input encoding (utf-16)"
The page is set as html5 (with <!doctype html>), as utf-8 (with <meta charset="utf-8">) and contains greek text.
Well, the question is:
Does that problem affect the appearance of the page? I mean, when I publish it, will a user in China, Germany, or ...Tierra del Fuego see the greek text?
If yes, the rest are less important, but I'll ask them:
What makes HTML Tidy to define the document as utf-16? Is there a character, word or visible string of any kind that I can remove/delete to correct the problem?
If I use <meta charset="utf-16"> will browsers parse the code correctly (ending to greek text for the global user)?
The actual file encoding will be set in Dreamweaver properties for the file.
Dreamweaver Help / Set title and encoding properties for a page:
The Title/Encoding Page Properties options let you specify the document encoding type that is specific to the language used to author your web pages as well as specify which Unicode Normalization Form to use with that encoding type.
Select Modify > Page Properties, or click the Page Properties button in the text Property inspector.
Choose the Title/Encoding category and set the options.
...
Encoding
Specifies the encoding used for characters in the document.
If you select Unicode (UTF‑8) as the document encoding, entity encoding is not necessary because UTF‑8 can safely represent all characters. If you select another document encoding, entity encoding may be necessary to represent certain characters. For more information on character entities, see www.w3.org/TR/REC-html40/sgml/entities.html.
...
Include Unicode Signature (BOM)
Includes a Byte Order Mark (BOM) in the document. A BOM is 2 to 4 bytes at the beginning of a text file that identifies a file as Unicode, and if so, the byte order of the following bytes. Because UTF‑8 has no byte order, adding a UTF‑8 BOM is optional. For UTF‑16 and UTF‑32, it is required.
Choose UTF-8 without BOM.
UltraEdit automatically detects encoding of a file on opening and displays it at bottom in status bar. See in UltraEdit Advanced - Configuration - File Handling - Unicode/UTF-8 Detection and press button Help for some more details.
UTF-16 is displayed for a file encoded in UTF-16 Little Endian with or without BOM on using standard status bar since UE v19.00. Clicking on this list box in status bar and selecting Unicode - UTF-8 results in converting the file from UTF-16 LE to UTF-8 which then matches with the character set declaration in head of your HTML5 file.
When using basic status bar in UE v19.00 or any later version or using any UltraEdit version prior v19.00, the status bar field right to the field with line, column and clipboard number starts with U- for a file with UTF-16 LE encoding.
The UltraEdit help page about the Status Bar contains more information about information shown in standard and basic status bar in UltraEdit.
Conversion to UTF-8 can be done with UltraEdit also with command UNICODE/UTF-8 to UTF-8 (Unicode Editing) in submenu Conversions in menu File.
There are 2 configuration settings at Advanced - Configuration - File Handling - Save which define saving a UTF-8 encoded file with or without byte order mark (BOM):
Write UTF-8 BOM header to all UTF-8 files when saved
Write UTF-8 BOM on new files created within this program (if above is not set)
As UTF-8 encoded HTML files should be always without BOM, it is better to have both UTF-8 BOM settings unchecked when using UltraEdit mainly for editing HTML files.
Another possibility to convert a file with UltraEdit is using command Save As from menu File and use appropriate Encoding / Format setting. UTF-8 in Save As dialog means saving the file as UTF-8 encoded file with BOM and UTF-8 - NO BOM without BOM independent on the two configuration settings for standard Save.
For converting all files in a single folder, a folder tree, opened in UltraEdit, etc. to UTF-8 using UltraEdit, there is an UltraEdit scripting solution, see How to convert all files in a folder to UTF-8?
Unfortunately UE v21.30.0.1024 still does not recognize the short character set declaration <meta charset="utf-8"> as defined in HTML5 standard. See Short utf-8 charset declaration in HTML5 header with details about this limitation and how it can be worked around. This limitation does not matter if within first 64 KB at least one UTF-8 encoded character is found as it will be the case for your HTML5 files with Greek text.
HTML Tidy installed with UltraEdit v21.30.0.1024 is of version 25 March 2009. I'm not sure if HTML Tidy really supports short charset declaration of HTML5. But it looks so because otherwise you would not see the warning on validating the HTML5 file with HTML Tidy.
It might be useful for you to read UltraEdit power tip Unicode text and Unicode files in UltraEdit/UEStudio as it looks like you do not really know what encoding and character set really means and why it is important for applications that the declaration in the HTML5 matches with really used encoding.
I answer your questions now after all those general UltraEdit stuff.
Does that problem affect the appearance of the page?
Although the file contains the declaration that file contents is encoded with UTF-8, but is in real encoded with UTF-16 Little Endian, the browsers display the contents correct. UTF-16 detection is very easy, especially with BOM present and therefore browsers ignore wrong declaration and interpret the bytes of the HTML file from beginning right as UTF-16 encoded text file.
However, it would be much better to convert the UTF-16 encoded HTML files to UTF-8 without BOM. UTF-8 without BOM is most commonly used for HTML files worldwide and then the character set declaration in head of your HTML file would also match with really used encoding.
What makes HTML Tidy to define the document as utf-16?
The really used encoding of your HTML file is UTF-16 Little Endian and UltraEdit, HTML Tidy and the browsers detect that already after reading in the first 2 bytes of the text file - the byte order mark. That's the reason why HTML Tidy suggests to declare the encoding in head of HTML file correct as utf-16 as the file is really encoded with.
If I use <meta charset="utf-16"> will browsers parse the code correctly?
In case of keeping the file encoded in UTF-16 LE (always 2 bytes per character), it would be better to declare the character set right with <meta charset="utf-16">. But no Unicode aware text editor or browser has a problem to automatically detect UTF-16 Little Endian encoding with byte order mark.
The character set declaration becomes very important mainly for UTF-8 encoded files (1, 2, 3 or even 4 bytes per character) or files with single-byte coded characters using a code page like Windows-1252 / ISO 8859-1 (Latin 1) or Windows-1253 / ISO 8859-7 (Latin/Greek).

how to fix £ showing on HTML, possible through htaccess?

I've switched hosts and somehow on all my HTML files which contain the pound sign is replaces with an A in front: £. Is there a way to overcome this problem without adding
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>
on every HTML page?
You have various alternative ways to overcome the problem, including any one of these:
In .htaccess as you asked: insert the line "AddDefaultCharset utf-8" or follow further advice from W3C or further advice from askapache.com
Insert the HTML 5 doctype "<!DOCTYPE html>" at the beginning of each HTML page, thus causing the browser to interpret the default character encoding to be UTF-8 instead of ISO-8859-1.
Store your HTML using character encoding ISO-8859-1, so that the pound sign is stored as one byte. Currently your HTML would appear to be stored using character encoding UTF-8, so that the pound sign is stored as two bytes. Here is one way to store a copy of a UTF-8 file as ISO-8859-1: iconv --from-code=UTF-8 --to-code=ISO-8859-1 inputfile.html > outputfile.html
Store your HTML using 7-bit (ASCII) characters, with the pound sign encoded as an XML numeric character entity £ or (hexadecimal) £ or the HTML named character entity £

validator.w3.org reports a markup error - detected character encoding "utf-8"

validator.w3.org reports for www.besaltnlight.ca:
Character Encoding Override in effect!
The detected character encoding "utf-8" has been suppressed and "iso-8859-1" used instead.
The php code outputs iso-8859-1 and php sets that as the default characterset.
What is causing this problem? Am I using the wrong doctype?
Oh, and would any of this cause quirks mode in IE?
Thanks for your help.
Gerry
The document is encoded in UTF-8. It has a byte order mark, smart quotes, and an ellipsis, all properly encoded in UTF-8. It begins with two byte order marks, which is invalid. You must remove one, and the validator also says that the presence of a BOM in a UTF-8 document may be confusing, so you may remove them both.
Since you’re outputting UTF-8, you must change the HTTP header to:
Content-type: text/html; charset=utf-8
Since you are missing that header, you force the browser to guess. Additionally, the meta tag must be changed to
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
for the same reason.
Your output starts with a Unicode byte order mark, encoded in UTF-8.
This is likely the first some bytes of your PHP file, or any PHP file included by your main file. Your editor may not even show them. Interpreted as ISO-8859-1 the start of the output looks like <!DOCTYPE html, which are even two byte order marks, one after each other.
As said by jleedev, either make sure your files are really encoded in Latin-1, or declare the encoding as UTF-8.