HTML special character - html

I need some help, i entered a special character in form like "ä" and it was save in a database, but when I call it using a javascript , a suggestion type, it does not display correctly, it display like this "ä".
Is it possible from the form "ä" it will be save as "&auml" in the database? how is it?

This is caused by misinterpreting a UTF-8 string as ISO-8859-1. The character "ä" is represented in UTF-8 by the two bytes 0xC3 0xA4, but in ISO-8859-1, the byte 0xC3 represents the character "Ã", and the byte 0xA4 represents the character "¤".
Most likely, your program is sending the page as UTF-8 but not telling the browser what encoding was used, so the browser is assuming ISO-8859-1 (the default). You need to send a Content-Type HTTP header that specifies the encoding, such as text/html; charset=UTF-8.

If you haven't already, you should define the character set used by your page.
Add this inside your <head> tags:
<meta charset="utf-8">
There are other character sets as well, but UTF-8, as stated by Google back in 2008, has become one of the most popular today.

This is due to your character set, but there are two character sets that you need to keep in mind. The first is the character set in the database, and the second is the character set in your HTML. They need to match.
Even if you do something like $mysqli->set_charset("utf8"); in your connection script, the database needs to have utf8 also for the data to display properly.

Related

What's the difference between typing the Encoding of a Unicode character or just copying the character?

For example, if I want the bullet point character in my HTML page, I could either type out • or just copy paste •. What's the real difference?
≺ is a sequence of 7 ASCII characters: ampersand (&), number sign (#), eight (8), eight (8), two (2), six (6), semicolon (;).
• is 1 single bullet point character.
That is the most obvious difference.
The former is not a bullet point. It's a string of characters that an HTML browser would parse to produce the final bullet point that is rendered to the user. You will always be looking at this string of ASCII characters whenever you look at your HTML's source code.
The latter is exactly the bullet point character that you want, and it's clear and precise to understand when you look at it.
Now, ≺ uses only ASCII characters, and so the file they are in can be encoded using pure ASCII, or any compatible encoding. Since ASCII is the de-facto basis of virtually all common encodings, this means you don't need to worry much about the file encoding and you can blissfully ignore that part of working with text files and you'll probably never run into any issues.
However, ≺ is only meaningful in HTML. It remains just a string of ASCII characters in the context of a database, a plain-text email, or any other non-HTML situation.
•, on the other hand, is not a character that can be encoded in ASCII, so you need to consciously choose an encoding which can represent that character (like UTF-8), and you need to ensure that you're sending the correct metadata to ensure that clients interpret the encoding correctly as well (HTTP headers, HTML <meta> tags, etc). See UTF-8 all the way through.
But • means • in any context, plain-text or otherwise, and does not need to be specifically HTML-interpreted.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
A character entity reference such as • works indepedently of the document encoding. It takes up more octets in the source (here: 7).
A character such as • works only with the precise encoding declared with the document. It takes up less octets in the source (here: 3, assuming UTF-8).

File enconding (UTF-8 not working properly)

In my webpage, there is a form with multiple inputs. However, the input chars behave differently from the input "label" chars. I tried setting the file encoding to UTF-8 and UTF-8 +BOM (I'm using EditPlus).
Using UTF-8:
Using UTF-8 + BOM:
The input chars come from a mysql database where the collation is utf8_unicode_ci (using phpmyadmin) so i don't know if that's the problem's source. Any ideas?
This means both pieces of data are not in the same encoding. If the file is interpreted as Latin-1 (or a similar encoding), you get the first result in which the data in the input field is valid (meaning it's Latin-1 encoded) but the label is wrong (meaning it's not Latin-1 encoded). When the file is interpreted as UTF-8, the label is correct (meaning it's UTF-8 encoded) but the data in the input field is wrong (meaning it's not UTF-8 encoded). If data shows up as the � UNICODE REPLACEMENT CHARACTER, it's a sure sign the document is being interpreted as a Unicode encoding (e.g. UTF-8), but the byte sequence is invalid.
I'll guess that the label is hardcoded in the file but the data in the input field comes from a database. In this case you need to set the connection encoding for the database to return UTF-8.
As to why the file is interpreted in Latin-1 without BOM and in UTF-8 with BOM: because the browser recognizes the BOM as signifying UTF-8, without it it defaults to Latin-1. You need to set the correct HTTP header to tell the browser what encoding the file is in, and get rid of the BOM.
Read these resources:
UTF-8 all the way through
Handling Unicode Front To Back In A Web App
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
solved it: Just changed the file enconding to "Western European (Windows) 1252" (using EditPlus) and now every character is correctly shown.

How to Encode and Decode "Acute accented characters" using Perl

I am working in a web based educational website, where we are using Perl, MySQL 5, Apache and Template Toolkit. we are planning to introduce the support for multiple\ Language in our website.
What we have done in
IF we have a Tab name like Courses Main Page<\h1> in our template file, we have converted that to
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<h1>[% glossary.$language.courses_main_page %]<\h1>
where $language is getting the value which user selects when he logs in.
We have a table to maintain this data in our Mysql DB:
CREATE TABLE translation ( english varchar(255) NOT NULL,
language varchar(255) NOT NULL, translation varchar(2000) NOT
NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Translation of
Element text to a foreign language'
IN the connect function of MySQL, I am providing 'SET character_set_results=NULL'.
I tried with utf8, but the issue which is limited to some tabs got increased to many sections.
So as soon as the user logins into the system, we fetch all the translation and store it in a PERL hash and Cache it. we pass this hash to template file which will replace the value.
Problem: Acute accented characters like á and é etc are getting replaced with some different character set symbols.
For ex: in Front end we are seeing "Cursos Página Principal" for Cursos Página Principal.
It is very similar to the solution given in htmlentities and é (e acute)
Can any one tell me how to achieve the same in Perl.
Denoting the charset
For ex: in Front end we are seeing "Cursos Página Principal" for Cursos Página Principal.
This mojibake happens when the characters are transferred as UTF-8 but interpreted as ISO-8859-1 or similar. So I suggest the easiest way to fix this is making sure that your HTML page gets shipped to the client with a proper mime type, i.e.
Content-Type: text/html; charset=utf-8
If that information is present in the HTML header, the value there will override any setting in the HTML document itself. So make sure that either you set the HTML header, or that your HTML header specifies no charset at all, so that the browser will have a look at the meta setting.
In some browsers (Firefox for example) you can manually change the character set using View / Character Encoding. You can use that to check whether a wrong character encoding while rendering really is the cause of the problem.
Actually encoding and decoding
There are some situations where fixing the charset won't help. It might be that you simply don't control that part of your framework. Or that something translates your characters from ISO-8859-1 to UTF-8 twice, so that the unreadable symbols are in fact represented as UTF-8 already. In these cases, you can use the Encode module to encode the characters in Perl directly, using HTML character references as output:
use Encode qw(decode encode FB_HTMLCREF);
# maybe: $unicodeString = decode("utf-8", $byteString);
$htmlString = encode("ascii", $unicodeString, FB_HTMLCREF);
Whether or not the decode step is neccessary depends on how you talk to your database. If your database connection is capable of supporting unicode, then you'll already have unicode strings, and you can simply encode these to HTML. For DBD::mysql there is a parameter mysql_enable_utf8 => 1 which achieves this. Using it is preferable to decoding things in your own code. This answer has details on the syntax.
One example on what these functions do:
$byteString = "Cursos P\xc3\xa1gina Principal."; # two bytes
$unicodeString = "Cursos P\N{U+00E1}gina Principal."; # one unicode character
$htmlString = "Cursos Página Principal."; # html character reference

UTF-8 is an Encoding or a Document Character Set?

According with W3C Recommendation says that every aplicattion requires its document character set (Not be confused with Character Encoding).
A document character set consists of:
A Repertoire: A set of abstract characters, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
Code positions: A set of integer references to characters in the repertoire.
Each document is a sequence of characters from the repertoire.
Character Encoding is:
How those characters may be represented
When i save a file in Windows notepad im guessing that this are the "Document Character Sets":
ANSI
UNICODE
UNICODE BIG ENDIAN
UTF-8
Simple 3 questions:
I want to know if those are the "document character sets". And if they are,
Why is UTF-8 on the list? UTF-8 is not supposed to be an encoding?
If im not wrong with all this stuff:
Are there another Document Character Sets that Windows do not allow you to define?
How to define another document character sets?
In my understanding:
ANSI is both a character set and an encoding of that character set.
Unicode is a character set; the the encoding in question is probably UTF-16. An alternative encoding of the same character set is big-endian UTF-16, which is probably what the third option is referring to.
UTF-8 is an encoding of Unicode.
The purpose of that dropdown in the Save dialog is really to select both a character set and an encoding for it, but they've been a little careless with the naming of the options.
(Technically, though, an encoding just maps integers to byte sequences, so any encoding could be used with any character set that is small enough to "fit" the encoding. However, the UTF-* encodings are designed with Unicode in mind.)
Also, see Joel on Software's mandatory article on the subject.
UTF-8 is a character encoding that is also used to specify a character set for HTML and other textual documents. It is one of several Unicode encodings (UTF-16 is another).
To answer your questions:
It is on the list because Microsoft decided to implement it in notepad.
There are many other character sets, though defining your own is not useful, so not really possible.
You can't define other character sets to save with notepad. Try using a programmers editor such as notepad++ that will give you more character sets to use.

what encoding stackoverflow used in mysql?

I can not save the character 𝑴 in my mysql which encoding is utf8, but i found stackoverflow can save it and display it.
I made a mistake. stackoverflow also can not save 𝑴 .
If you can't store the character, you are encoding or decoding it incorrectly, or converting it to a character set that doesn't support the character.
The UTF-8 encoding can handle almost any character that exists in any language, so it's quite unlikely that it's a limitation of that encoding.
You have to use the Unicode character set or some Unicode encoding (UTF-7, UTF-8, UTF-16, UTF-32) for all steps of the process. If you convert the text to some other character set and then back, you can only support the characters of that specific character set.
Stackoverflow is trying to display the character as &#119924. So maybe that character value is being saved in the database (certainly some character value is being saved in the database), but the reason why we can't see that character is because of the font which is being used to display the HTML: perhaps it's the font being used, not the database, that doesn't support that character value.