how utf-8 identifies the different language character - html

I am really amazed to see the magic of utf-8 but couldn't understand the logic behind it. I went through several documents but still confused though i know the basic only.
please take a look first example. it converts from language character to utf-8. there are two text box, in first text box enter the chars, click the button and get the utf-8 values in second text box as utf-8.
please take a look of the second example . i have used the utf-8 char from the example 1 and put the value in html and here i really do not understand how it translates. as i tested three language chinese, Hindi and Russian.
used google translator to translate from english to several language
Hello = 您好(chinese)
Hello = नमस्ते (Hindi)
Hello = привет (Russian)
how does a web page identify the language character on the basis of utf-8 ? is it possible that different computer will show different character ?

The "magic" behind UTF-8 is called Unicode. It is one of several encodings of the standard.
Unicode does have character ranges that correspond to languages and many characters are specifically associated with a language.
I suggest reading this - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

UTF-8 is a variable-length byte encoding of Unicode, the character numbering system for all languages.
Internet web pages by default base on ISO-8859-1, so called Latin-1. Other charsets can be set by:
Header lines of text, preceding an empty line and then the HTML content text.
There a header line:
Content-Type: text/html; charset=UTF-8
A Java EE server needs to do for this:
response.setContentType("text/html; charset=UTF-8");
In the HTML head a meta tag
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
...

Related

using european characters in html

I started learning HTML + CSS a week or two ago, and I'm facing a problem. I'm european so I need to use special characters like á, ã, ç , etc a lot. Is there any other way I can do that without using the corresponding code for each letter every time I need to use one? Like a code I can put in the beggining of the html document or something like that that would make all the special characters accepted.
Decide which encoding you want to use for your site; if you don't have any preference, use UTF-8.
Save the .html file in that encoding in your text editor. Consult the help of your specific text editor how to choose which encoding the file gets saved in.
Add <meta charset="utf-8"> to your <head> to instruct the browser to treat the page as UTF-8 encoded.
Preferably also configure your web server to output a Content-Type: text/html; charset=utf-8 HTTP header, since that takes precedence if present. Consult the manual of your web server how to do that.
Write literally any character you can input directly as is into your document and enjoy.
Further reading:
https://www.w3.org/International/tutorials/tutorial-char-enc/
Handling Unicode Front To Back In A Web App
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
UTF-8 all the way through

How to make a .htm page accept letters of languages other than english?

Currently I am working on an application which converts a .msg file to pdf. I am using a pdf converter which converts html to pdf file.So, I convert the email to html and then use the tool to convert it to pdf. Everything was working fine until I tried to convert a french email to pdf. When I open the .htm file for the french email with notepad++ , it displays the french accent letters(é, à, ù, ê, ë, ....) fine, but when I open it in browser, the french accent letters are changed to some strange symbols.When,I added the "meta http-equiv="content-type" content="text/html;charset=utf-8"tag to the html.It started showing the french letters correctly.
So, will this "meta" tag make the html work for all possible french letters.Or only selective ones?
Also is there any tag which can make the html accept letters from any language?
Thanks in advance.
Computers deal in binary data. Under the hood, all the characters (letters, numbers, punctuation, etc) in an HTML (or other kind of text) document are just groups of 1s and 0s as far as the computer is concerned.
Which characters those groups of 1s and 0s represent depend on the choice of character encoding.
Unicode encodings, including UTF-8, can represent just about any human language.
If the document is actually encoded in UTF-8 and you tell the browser then it is encoded in UTF-8 then you are highly unlikely to run into characters that you can't represent.
For further reading, start with Character encodings: Essential concepts
UTF-8 (Unicode) covers almost all of the characters and symbols in the world.
To display an HTML page correctly, a web browser must know the character set used in the page.
This is specified in the <meta> tag:
For HTML4:
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
For HTML5: <meta charset="UTF-8">
Note: If a browser detects ISO-8859-1 in a web page, it defaults to ANSI, because ANSI is identical to ISO-8859-1 except that ANSI has 32 extra characters.
You can get more info here.

Foreign characters in website

I found a website that contains the string "don’t". The obvious intent was the word "don't". I looked at the source expecting to see some character references, but didn't (it just shows the literal string "don’t". A Google search yielded nothing (expect lots of other sites that have the same problem!). Can anyone explain what's happening here?
Edit: Here's the meta tag that was used:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Would this not cause the page to be served up as Latin-1 in the HTTP header?
In your browser, switch the page encoding to "UTF-8". You're seeing a right single quote character, which is encoded by the octets 0xE2 0x80 0x99 in UTF-8. In your charset, windows-1252, those 3 octets render as "’". The page should be explicitly specifying UTF-8 as its charset either in the HTTP headers or in an HTML <meta> tag, but it probably isn't.
According to Character encondings in HTML a lemme in wikipedia:
HTML (Hypertext Markup Language) has
been in use since 1991, but HTML 4.0
(December 1997) was the first
standardized version where
international characters were given
reasonably complete treatment. When an
HTML document includes special
characters outside the range of
seven-bit ASCII two goals are worth
considering: the information's
integrity, and universal browser
display.
I suppose the site you checked, isn't impelemented with this in mind.
This has all got to do with encoding. Take a look back at the source, is there a tag at the top specifying it (charset)? My guess is it'll be UTF8 - although it could be something completely different.
This thread explains all. A combination of using a weird UTF-8 apostrophe character (probably originating from a Word Document), on a server that probably reports its encoding as non-UTF-8, despite the page having UTF characters (and possible even correctly reporting its own encoding).

HTML character entities and character encoding set

When including HTML entities in an HTML document, do the entities need to be from the same character encoding set that the document is specified to be using?
For example, if I am going to use the copyright sign in an HTML document that is specified as UTF-8, is it necessary to use the Unicode HTML entity (©) or is it okay to use other entities, such as the ASCII HTML entity (©)?
Please explain your answer. I am aware that it will "work", but is there a case where it will not work?
Thanks!
© and © specify the same character - 169 is equivalent to hexadecimal A9. These both specify a copyright symbol. Character entities in HTML always refer to Unicode code points, this is covered in the HTML 4 Standard. Thus, even if your character set changes, your entities still refer to the same characters.
This also means that you can encode characters that don't actually appear within your character set of choice. I just created a document in the ISO-8859-1 character set, but it includes a Greek lambda. Also, ASCII is not able to directly encode a copyright symbol, but it can through character entities.
Edit: Reading the comments on the other answer, I want to clarify this a bit. If you are using UTF-8 as the character encoding for your document, you can, within the raw HTML source, write a copyright symbol just as-is. (You need to find some way to input it, of course: copy-paste being the usual.) UTF-8 will allow you to directly encode any symbol you want. ISO-8859-1 is much more limited, and ASCII even more so. For example, within my HTML, if my document is a UTF-8 document, I can do:
<p>Hi there. This document is ©2010. Good day!</p>
or:
<p>Hi there. This document is ©2010. Good day!</p>
or:
<p>Hi there. This document is ©2010. Good day!</p>
The first is only valid if the character set supports "©". The other two are always valid, but less readable. Whatever text editor you're using, if it is worth its weight, should be able to tell you what character set it is encoding the document in.
If you do this, you need to make sure your web server informs the client of the correct character set, or that your document declares it with something like:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I've used UTF-8 there as an example. XHTML should have the character set in the opening <?xml ... ?> tag.
The beauty of the UTF-8 encoding is that you can actually just include the binary character. You don't need to encode it as an entity at all. Thusly: ©
Oh, you just want to know the difference between the two entities? There is none. One describes the byte in Hex and the other in decimal.

HTML - Arabic Support

i have a website in which i have to put some lines in Arabic.... how to do it...
where to get the Arabic text characters... how to make the page support Arabic...
i have to put a line per page and there is a lotta lotta pages so can't go around making images and putting them...
This is the answer that was required but everybody answered only part one of many.
Step 1 - You cannot have the multilingual characters in unicode document.. convert the document to UTF-8 document
advanced editors don't make it simple for you... go low level...
use notepad to save the document as meName.html & change the encoding
type to UTF-8
Step 2 - Mention in your html page that you are going to use such characters by
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
Step 3 - When you put in some characters make sure your container tags have the following 2 properties set
dir='rtl'
lang='ar'
Step 4 - Get the characters from some specific tool\editor or online editor like i did with Arabic-Keyboard.org
example
<p dir="rtl" lang="ar" style="color:#e0e0e0;font-size:20px;">رَبٍّ زِدْنٍي عِلمًا</p>
NOTE: font type, font family, font face setting will have no effect on special characters
The W3C has a good introduction.
In short:
HTML is a text markup language. Text means any characters, not just ones in ASCII.
Save your text using a character encoding that includes the characters you want (UTF-8 is a good bet). This will probably require configuring your editor in a way that is specific to the particular editor you are using. (Obviously it also requires that you have a way to input the characters you want)
Make sure your server sends the correct character encoding in the headers (how you do this depends on the server software you us)
If the document you serve over HTTP specifies its encoding internally, then make sure that is correct too
If anything happens to the document between you saving it and it being served up (e.g. being put in a database, being munged by a server side script, etc) then make sure that the encoding isn't mucked about with on the way.
You can also represent any unicode character with ASCII
You not only have to put the meta tag, telling that it is UTF-8 but really make the document UTF-8. You can do that with good editors (like notepad++) by converting them to "unicode" or "UTF-8 without BOM". Than you can simply use arabic characters
As this page is UTF-8, here are some examples (I hope I don't write anything rude here): شغف
If you use a server side scripting language make sure that it does not output the page in a different encoding. In PHP e.g. you can set it like this:
header('Content-Type: text/html; charset=utf-8');
If you don't even know where to get Arabic characters, but you want to display them, then you're doing something wrong.
Save files containing Arabic characters with encoding UTF-8. A good editor allows you to set the character encoding.
In the HTML page, place the following after <head>:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
If you're using XHTML:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
That's it.
An alternative way (without messing with the encoding of a file), is using HTML escape sequences. This website does that jobs for you: http://www.htmlescape.net/
Won't you need the ensure the area where you display the Arabic is Right-to-Left orientated also?
e.g.
<p dir="rtl">
i edit the html page with notepad ++ ,set encoding to utf-8 and its work
As mentioned above, by default text editors will not use UTF-8 as the standard encoding for documents.
However most editors will allow you to change that in the settings. Even for each specific document.
Check you have <meta charset="utf-8"> inside head block.