Form and accents - html

I have a form that doesn't work properly when I input somehting with an accent.
If I input "bâtiment", for instance, in the form, I'm sent to
formation.php?search=b%E2timent, instead of formation.php?search=bâtiment
What could cause that ?
EDIT
I have another form that sends me correctly to something.php?search=bâtiment, with the accent in the URL...

%E2 is how you represent â in a URL.
It will be decoded automatically in $_GET['search']

you can convert it back on the far end using $search=url_decode($_REQUEST['search']); URL specs say you can't use accent characters as valid URI characters so they are URL encoded on the fly for you.

Check the page encoding.
Are you sure of what encoding you are using
%E2 is a latin1 encoding of â but many code use the windows cp1252.
Try to use %c6%92 (the utf8 encoding of â) and (%83, cp1252)

Related

Why doesn't nbsp display as nbsp in the URL

I am following a tutorial where a web application written in PHP, blacklists spaces from the input(The 'id' parameter). The task is to add other characters, which essentially bypasses this blacklist, but still gets interpreted by the MySQL database in the back end. What works is a URL constructed like so -
http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1
Now, my question is simply that if '%A0' indicates an NBSP, then why is it that when I go to a site like http://www.url-encode-decode.com, and try to decode the URL http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1, it gets decoded as http://192.168.2.15/sqli-labs/Less-26/?id=1'�||�'1.
Instead of the question mark inside a black box, I was expecting to see a blank space.
I suspect that this is due to differences between character encodings.
The value A0 represents nbsp in the ISO-8859-1 encoding (and probably in other extended-ASCII encodings too). The page at http://www.url-encode-decode.com appears to use the UTF-8 encoding.
Your problem is that there is no character represented by A0 in UTF-8. The equivalent nbsp character in UTF-8 would be represented by the value C2A0.
Decoding http://192.168.2.15/sqli-labs/Less-26/?id=1'%C2%A0||%C2%A0'1 will produce the nbsp characters that you expected.
Independently from why there is an encoding error, try %20 as a replacement for a whitespace!
Later on you can str_replace the whitespace with a
echo str_replace(" ", " ", $_GET["id"]);
Maybe the script on this site does not work properly. If you use it in your PHP code it should work properly.
echo urldecode( '%A0' );
outputs:

Could html form specify the encoding of its target page?

I have a normal html form with the action to http://another-site.com.
My website(http://my-site.com) is encoded with UTF-8, but another-site is encoded with GBK.
The problem is, when I submit my form from my-site.com, and then the page forward to another-site.com, which is encoded with GBK as i mentioned. The page's characters are totally messy.
Is it my problem ? How do I tell the browser to use GBK in another-site.com ?
NOTE : Both another-site.com and my-site.com have set content-type with its encoding type.

HTML Character Encoding

When outputting HTML content from a database, some encoded characters are being properly interpreted by the browser while others are not.
For example, %20 properly becomes a space, but %AE does not become the registered trademark symbol.
Am I missing some sort of content encoding specifier?
(note: I cannot realistically change the content to, for example, ® as I do not have control over the input editor's generated markup)
%AE is not valid for HTML safe ASCII,
You can view the table here: http://www.ascii.cl/htmlcodes.htm
It looks like you are dealing with Windows Word encoding (windows-1252?? something like that) it really will NOT convert to html safe, unless you do some sort of translation in the middle.
The byte AE is the ISO-8859-1 representation for the registered trademark. If you don't see anything, then apparently the URL decoder is using other charset to URL-decode it. In for example UTF-8, this byte does not represent any valid character.
To fix this, you need to URL-decode it using ISO-8859-1, or to convert the existing data to be URL-encoded using UTF-8.
That said, you should not confuse HTML(XML) encoding like ® with URL encoding like %AE.
The '%20' encoding is URL encoding. It's only useful for URLs, not for displaying HTML.
If you want to display the reg character in an HTML page, you have two options: Either use an HTML entity, or transmit your page as UTF-8.
If you do decide to use the entity code, it's fairly simple to convert them en-masse, since you can use numeric entities; you don't have to use the named entities -- ie use ® rather than &#reg;.
If you need to know entity codes for every character, I find this cheat-sheet very helpful: http://www.evotech.net/blog/2007/04/named-html-entities-in-numeric-order/
What server side language are you using? Check for a URL Decode function.
If you are using php you can use urldecode() but you should be careful about + characters.

How can I show special characters like "e" with accent acute over it in HTML page?

I need to put the name of some universities on my web page. I have typed them as they were but in some browser or maybe some computers they appear differently. For example, "Universite de Moncton" should have the 2nd "e" in Universite with an accent acute over it. Could you please help about it.
If you’re using a character set that contains that character, you can use an appropriate character encoding and use it literally:
Universit‌é de Moncton
Don’t forget to specify the character set/encoding properly.
If not, you can use an HTML character reference, either a numeric character reference that denotes the code point of the character in the Universal Character Set (UCS):
Universit‌é de Moncton
Universit‌é de Moncton
Or using an entity reference:
Universit‌é de Moncton
But this entity is just a named representation of the numeric character reference (see the list of entity references that are defined in HTML 4):
<!ENTITY eacute CDATA "é" -- latin small letter e with acute,
U+00E9 ISOlat1 -->
You can use UTF-8 HTML Entities:
è è
é é
ê ê
ë ë
Here's a handy search page for the UTF-8 Character Map
I think from the mention that 'in some computers or browsers they appear differently' that the problem you have is with the page or server encoding. You must
encode the file correctly (how to do this depends on your text editor)
assign the correct encoding in your webpage, done with a meta tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
force the server encoding with, for example, PHP's header() function:
header('Content-Type: text/plain; charset=ISO-8859-1');
Or, yes, as everyone has pointed out, use the html entities for those characters, which is always safe, but might make a mess when you try to find-replace in code.
There are two methods. One is by using "HTML entities." You need to enter them as, for example, é. Here is a comprehensive reference of named entities; you can also reference the Unicode code point of a given character, using its decimal form as Ӓ or its hex form as Ӓ.
Perhaps more common now (ten years after this answer was originally entered) is simply using Unicode characters directly. Rất dễ dàng, phải không? This is more acceptable and universal because most pages now use UTF-8 as their character encoding.
运气!
By typing it in to your HTML code. é <--You can copy and paste this one if you want.
Microsoft windows has a character map for accessing characters not on your keyboard, it's called Character map.
http://www.starr.net/is/type/htmlcodes.html
This site shows you the HTML markup for all of those characters that you will need :)

IE munging pound (£) symbol

I have a html form which goes of to do all sorts of strange back end things. This works fine in firefox. and in most cases it works fine in IE
However the (pound sterling) £ sign causes problems, and seems to get munged in the submit.
The forms is something like this
<form action="*MyFormAction*" accept-charset="UTF-8" method="post">
I think I have seen this problem before but can't remember the solution.
edit, the euro symbol € works fine
edit 2,
In fact if I put the € symbol with a £ symbol it also works fine. Looking at the problem if I use characters which are not in the extended part of iso8859-1 it works ok. If I use extended charicters from iso8859-1 they get munged. So how do I make IE use the character set that the accept-charset says it should?
accept-charset="UTF-8"
Does not do what you think it does (or the standard says it does) in IE. Instead, IE uses the value (‘UTF-8’) as an alternative list of encodings for if a field can't be encoded using the usual default encoding (which is the same as the page's own encoding).
So if you add this attribute and your page isn't already in UTF-8, you can be getting characters submitted as either the page encoding or UTF-8, and there is no way for your form-submission-reading script to know!
For this reason you should never use accept-charset; instead you should always ensure that the page containing the form is correctly served as “Content-Type: text/html;charset=utf-8” (by HTTP header and/or <meta>).
In fact if I put the € symbol with a £ symbol it also works fine.
Yes, that's because ‘€’ cannot be encoded in the page's default encoding (presumably ISO-8859-1). So IE resorts to sending the field encoded as UTF-8, which is what you wanted all along.
I think bobince has the ideal answer which is “serve the page in UTF-8", however as I can't do this I am posting my work around for prosperity.
Adding a hidden field unmunge with a non ISO-8859-1 (what our pages are served in) extended character forces the submission into UTF8
so
<input type="hidden" name="unmunge" value="€" />
fixes the encoding (the entity is the euro symbol).
How is the £ submitted? If it's in an input box for a price don't submit it, only allow numbers to be submitted and add the £ when you display the price again. Or add the currency symbol in the backend script.
I am no sure if this will help (read the entire article at http://fyneworks.blogspot.com/2008/06/british-pound-sign-encoding-revisited.html)
Excerpt:
THE PROBLEM If you look at the
UTF-8/Latin-1 (AKA ISO-8859-1)
Character Table you will find that the
decimal code for the British pound
sterling sign is 163 - and the
hexadecimal code is A3.
£ = %A3
However, this is not the case in (all)
encoding/decoding functions in
Javascript...
encodeURI/encodeURIComponent
Encodes a Uniform Resource Identifier (URI) component by
replacing each instance of certain
characters by one, two, or three
escape sequences representing the
UTF-8 encoding of the character
Which means, in order to encode our
beloved pound sign, Javascript uses 2
characters. This is where the annoying
"Â" comes in...
£ = %C2%A3
Hope it helps.