swedish characters are missing while displaying on html form - html

I am trying to fetch the swedish content from another site. I am able to fetch the data but the Swedish characters(ÅÖÄ) are missing. Swedish Content that I have added directly has no issue to display as i have added the meta-tag. The issue is when i am trying to display the data from another site. Is it possible to fix this issue. I do not have any access to other site.

To take into account Swedish characters, you need to set the charset to UTF-8. An example from MDN is:
<!-- In HTML5 -->
<meta charset="utf-8">
<!-- Defining the charset in HTML4 -->
<!-- Note: This is invalid in HTML5 -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
The meta tag goes in the <head> tag like so:
<html>
<head>
<meta charset="UTF-8">
</head>
</html>
To quote from MDN:
[charset] declares the character encoding used of the page. It can be locally overridden using the lang attribute on any element. This
attribute is a literal string and must be one of the preferred MIME
names for a character encoding as defined by the IANA. Though the
standard doesn't request a specific character encoding, it gives some
recommendations:
Authors are encouraged to use UTF-8.
Authors should not use ASCII-incompatible encodings (i.e. those that don't map the 8-bit code points 0x20 to 0x7E to the Unicode
0x0020 to 0x007E code points) as these represent a security risk:
browsers not supporting them may interpret benign content as HTML
Elements. This is the case of at least the following charsets:
JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB, the ISO-2022
family, and the EBCDIC family.
Authors must not use CESU-8, UTF-7, BOCU-1 and SCSU, also falling in that category and not intended to be used on the web.
Cross-scripting attacks with some of these encodings have been
documented.
Authors should not use UTF-32 because not all HTML5 encoding algorithms can distinguish it from UTF-16.
Here is also a link on UTF-8.
*Note: if for some reason UTF-8 encoding is not working for your characters, try charset="ISO-8859-1"

Related

Using HTML ASCII

Rookie question.
Would guys recommend using Html ASCII or does the browser handle this part? I was reading through W3Schools and I’m just curious if this is something I should always consider as a good habit.
It's always a good idea to include <meta charset="UTF-8"> in the <head> of your HTML documents. This lets the browser know that your document is encoded with Unicode.
It's perfectly fine to use Unicode characters in an HTML document, but it's better to use HTML entity names or entity numbers.
(see a list of entity names and numbers and learn more on
w3schools.)
According to w3schools,
If you use an HTML entity name or a hexadecimal number,
the character will always display correctly.
This is independent of what character set (encoding) your page uses!
This means that entity names and numbers are guaranteed to work, even if you don't put <meta charset="UTF-8"> in the <head> of the document.

W3 validation error "content" "charset" [duplicate]

In order to define charset for HTML5 Doctype, which notation should I use?
Short:
<meta charset="utf-8" />
Long:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
In HTML5, they are equivalent. Use the shorter one, as it is easier to remember and type. Browser support is fine since it was designed for backwards compatibility.
Both forms of the meta charset declaration are equivalent and should work the same across browsers. But, there are a few things you need to remember when declaring your web files character-set as UTF-8:
Save your file(s) in UTF-8 encoding without the byte-order mark (BOM).
Declare the encoding in your HTML files using meta charset (like above).
Your web server must serve your files, declaring the UTF-8 encoding in the Content-Type HTTP header.
Apache servers are configured to serve files in ISO-8859-1 by default, so you need to add the following line to your .htaccess file:
AddDefaultCharset UTF-8
This will configure Apache to serve your files declaring UTF-8 encoding in the Content-Type response header, but your files must be saved in UTF-8 (without BOM) to begin with.
Notepad cannot save your files in UTF-8 without the BOM. A free editor that can is Notepad++. On the program menu bar, select "Encoding > Encode in UTF-8 without BOM". You can also open files and re-save them in UTF-8 using "Encoding > Convert to UTF-8 without BOM".
More on the Byte Order Mark (BOM) at Wikipedia.
Another reason to go with the short one is that it matches other instances where you might specify a character set in markup. For example:
<script type="javascript" charset="UTF-8" src="/script.js"></script>
<p><a charset="UTF-8" href="http://example.com/">Example Site</a></p>
Consistency helps to reduce errors and make code more readable.
Note that the charset attribute is case-insensitive. You can use UTF-8 or utf-8, however UTF-8 is clearer, more readable, more accurate.
Also, there is absolutely no reason at all to use any value other than UTF-8 in the meta charset attribute or page header. UTF-8 is the default encoding for Web documents since HTML4 in 1999 and the only practical way to make modern Web pages.
Also you should not use HTML entities in UTF-8. Characters like the copyright symbol should be typed directly. The only entities you should use are for the five reserved markup characters: less than, greater than, ampersand, prime, double prime.
Entities need an HTML parser, which you may not always want to use going forward. They introduce errors, make your code less readable, increase your file sizes, and sometimes decode incorrectly in various browsers depending on which entities you used. Learn how to type/insert copyright, trademark, open quote, close quote, apostrophe, em dash, en dash, bullet, Euro, and any other characters you encounter in your content, and use those actual characters in your code.
The Mac has a Character Viewer that you can turn on in the Keyboard System Preference, and you can find and then drag and drop the characters you need, or use the matching Keyboard Viewer to see which keys to type. For example, trademark is Option + 2. UTF-8 contains all of the characters and symbols from every written human language.
So there is no excuse for using -- instead of an em dash. It is not a bad idea to learn the rules of punctuation and typography also ... for example, knowing that a period goes inside a close quote, not outside.
Using a <meta> tag for something like content-type and encoding is highly
ironic, since without knowing those things, you couldn't parse the file
to get the value of the meta tag.
No, that is not true. The browser starts out parsing the file as the browser's default encoding, either UTF-8 or ISO-8859-1. Since US-ASCII is a subset of both ISO-8859-1 and UTF-8, the browser can read <html><head> just fine either way ... it is the same. When the browser encounters the meta charset tag, if the encoding is different than what the browser is already using, the browser reloads the page in the specified encoding.
That is why we put the meta charset tag at the top, right after the head tag, before anything else, even the title. That way you can use UTF-8 characters in your title.
You must save your file(s) in UTF-8 encoding without BOM
That is not strictly true. If you only have US-ASCII characters in your document, you can Save it as US-ASCII and serve it as UTF-8, because it is a subset. But if there are Unicode characters, you are correct, you must Save as UTF-8 without BOM.
If you want a good text editor that will save your files
in UTF-8, I recommend Notepad++.
On the Mac, use Bare Bones TextWrangler (free) from Mac App Store, or Bare Bones BBEdit which is at Mac App Store for $39.99 ... very cheap for such a great tool.
In either app, there is a menu at the bottom of the document window where you specify the document encoding and you can easily choose "UTF-8 no BOM". And of course you can set that as the default for new documents in Preferences.
But if your Webserver serves the encoding in the HTTP header,
which is recommended, both [meta tags] are needless.
That is incorrect. You should of course set the encoding in the HTTP header, but you should also set it in the meta charset attribute so that the page can be saved by the user, out of the browser onto local storage and then opened again later, in which case the only indication of the encoding that will be present is the meta charset attribute.
You should also set a base tag for the same reason ... on the server, the base tag is unnecessary, but when opened from local storage, the base tag enables the page to work as if it is on the server, with all the assets in place and so on, no broken links.
AddDefaultCharset UTF-8
Or you can just change the encoding of particular file types like so:
AddType text/html;charset=utf-8 html
A tip for serving both UTF-8 and Latin-1 (ISO-8859-1) files is to give the UTF-8 files a "text" extension and Latin-1 files "txt."
AddType text/plain;charset=iso-8859-1 txt
AddType text/plain;charset=utf-8 text
Finally, consider saving your documents with Unix line endings, not legacy DOS or (classic) Mac line endings, which don't help and may hurt, especially down the line as we get further and further from those legacy systems.
An HTML document with valid HTML5, UTF-8 encoding, and Unix line endings is a job well done. You can share and edit and store and read and recover and rely on that document in many contexts. It's lingua franca. It's digital paper.
<meta charset="utf-8"> was introduced with/for HTML5.
As mentioned in the documentation, both are valid. However, <meta charset="utf-8"> is only for HTML5 (and easier to type/remember).
In due time, the old style is bound to become deprecated in the near future. I'd stick to the new <meta charset="utf-8">. There's only one way, but up. In tech's case, that's phasing out the old (really, REALLY fast)
Documentation: HTML meta charset Attribute—W3Schools
While not contesting the other answers, I think the following is worthy of mentioning.
The “long” (http-equiv) notation and the “short” one are equal. Whichever comes first wins;
Web server headers will override all the <meta> tags;
BOM (byte order mark) will override everything, and in many cases it will affect HTML 4 (and probably other stuff, too);
If you don't declare any encoding, you will probably get your text in “fallback text encoding” that is defined your browser. Neither in Firefox nor in Chrome it's UTF-8;
In absence of other clues the browser will attempt to read your document as if it was in ASCII to get the encoding, so you can't use any weird encodings (UTF-16 with BOM should do, though);
While the specifications say that the encoding declaration must be within the first 512 bytes of the document, most browsers will try reading more than that.
You can test by running echo 'HTTP/1.1 200 OK\r\nContent-type: text/html; charset=windows-1251\r\n\r\n\xef\xbb\xbf<!DOCTYPE html><html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta charset="windows-1251"><title>привет</title></head><body>привет</body></html>' | nc -lp 4500 and pointing your browser at localhost:4500. (Of course you will want to change or remove parts. The BOM part is \xef\xbb\xbf. Be wary of the encoding of your shell.)
Please mind that it's very important that you explicitly declare the encoding. Letting browsers guess can lead to security issues.
Use <meta charset="utf-8" /> for web browsers when using HTML5.
Use <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> when using HTML4 or XHTML, or for outdated DOM parsers, like DOMDocument in PHP 5.3.
To embed a signature in an email, I would use the long version:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The reason is that not many email readers use HTML5, so it's always better use old HTML styles. Actually, it's better to use tables than divs + CSS as well.
There is some news based on Mozilla Foundation, and SitePoint:
Do not use this value (http-equiv=content-type) as it is obsolete.
Prefer the charset attribute on the <meta> element.

How can I use unicode characters in HTML keywords?

The meta section of HTML documents can contain a keyword section.
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="description" content="under construction" />
<meta name="keywords"
content="..." />
Can one use unicode characters in this section (i.e., \u00B0)? If yes how?
All the characters you put into an HTML document, whether in attribute values or elsewhere, as Unicode characters. If the character encoding of your document is UTF-8, as your example declares (but it had better be UTF-8 encoded then!), you can enter any characters, such as the degree sign (°), directly there. How you do that depends on your authoring environment. You can alternatively use a character reference (like °) or, for some characters, an entity reference (like °).
But \u00B0 is not an HTML notation. It just a sequence of six characters. It has a special meaning in JavaScript, but not in HTML. The corresponding HTML notation is °.
Search engines will probably ignore special characters like the degree sign in keywords. But not necessarily; Google has been observed to be sensitive to them in some special situations. (Not for the degree sign at the moment, it seems.)
In <meta name=description ...> tags, special characters may be relevant if search engines use their content when constructing the page description for search result lists. Such things still happen, though less frequently than they used to.
Because non-English websites that use Unicode for their body content will also use Unicode for their metadata, it is reasonable to assume that the important tools that process HTML metadata will be able to cope with this in UTF-8.
Also bear in mind that (at least historically) the keywords meta tag was meant to contain terms that people might search for. Your example \00B0 is the degrees sign; in this case it seems more likely people will search for the word degrees than for the symbol °. Because of wide-scale abuse of keyword metadata, many search engines (including Google) ignore them for search ranking.
So, in summary, I think it is safe to use Unicode keyword metadata. But it probably won't improve your site's search ranking for those terms.

HTML and character encoding vs HTML Entity

When writing an HTML document, is it acceptable to use the direct special character such as the captial letter C with a cedilla underneath as regular text: Ç or to use the HTML Entity name of this charecter, &Ccedil ?
I have seen both being used in practice, but surely there are rules governing the appropriate usage of this, as well as advantages to one way over another. For instance, this website maintains the raw-form of this character, but other websites may end up rendering it as a square block.
Real characters:
Are easier to type if your system is set up for a language that uses those characters
Produce more readable code
Save bytes
HTML entities:
Let you more or less forget about character encoding
Obviously, characters with special meaning in HTML (<, &, etc) still need to be represented by entities.
If you're using UTF-8 character encoding, then most entity characters (other than &, > and <) become redundant.
If you're not using UTF-8, then you need entities for everything.
It all depends on the character encoding of the document. If you're unsure of whether or not you should use the the regular text or the encoding version, you could run your page through the W3C Validator.
Consider this code:
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Stuff</title>
</head>
<body>
<p>©</p>
<p>©</p>
</body>
</html>
The document encoding is set to UTF-8 and when it's validated, it returns an error:
Sorry, I am unable to validate this document because on line 7 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

Special characters are not supported

I am having problem to display the special characters like ’, é in Firefox and IE. But these characters are supported for the local server.
I have used the following
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
Can anyone suggest me what the might be? Thanks in advance.
You've set the charset to iso-8859-1 - are you sure that's how they're encoded in your HTML?
In Firefox, try changing the charset using View -> Character Encoding (for your page it should have "Western (ISO-8859-1)" selected), and see if it works with another character encoding. If it does, consider either re-encoding your HTML into UTF-8, or changing the charset in your meta tag.
As Dominic says, checking you're encoding your HTML with the right charset in your meta tag would be the first step. There's info on charsets and encoding here. Whether you need to change the charset meta tag depends on the language the page is in. If your page is in English but just has the odd character that needs accents etc., the easiest way is to use the character code, for example the character code for é is é One of the many lists of character entities available online can be found here.
Alternatively, if your page is basically in English, but has small sections in another language, CSS2 has a lang attribute that can be used to style text in other languages appropriately. There's more info about the four different ways to apply language styles here. You can use the :lang() pseudo-class selector, the [lang |= "..."] selector that matches the beginning of the value of a language attribute, the [lang = "..."] selector that exactly matches the value of a language attribute, or a generic class or id selector.
If a small portion of your site was in another language such as Hebrew, you could also use CSS and a span to signify a change in the reading direction of the text, for example:
<p style="direction: rtl; unicode-bidi: embed;">
This is a paragraph written right-to-left.
</p>
or
<p>
This paragraph is written left-to-right except for <span style="direction: rtl; unicode-bidi: bidi-override;">these words</span> which were written right-to-left.
</p>
These examples (taken from here) show the style being applied inline, but you could also set the styles up in an external stylesheet).
You've set the charset in the document's meta tag, which works when you're viewing it as a file, but if the web server is providing a charset value, that takes priority. Check the HTTP headers that the web server is providing; one way is with the Firefox extension Live HTTP Headers. If it's something different, you have to tell the web server what you're doing or else reencode the document to match.
How to set the encoding varies between web servers. Apache, for example, lets you specify the charset globally, per-file in .htaccess, or by renaming the file to example.html.latin1.
Use HTML Entities like á or á and the browser should sort it out.
Here is a list:
http://www.utexas.edu/learn/html/spchar.html
change your encoding meta tag to:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />