Why doesn't nbsp display as nbsp in the URL - html

I am following a tutorial where a web application written in PHP, blacklists spaces from the input(The 'id' parameter). The task is to add other characters, which essentially bypasses this blacklist, but still gets interpreted by the MySQL database in the back end. What works is a URL constructed like so -
http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1
Now, my question is simply that if '%A0' indicates an NBSP, then why is it that when I go to a site like http://www.url-encode-decode.com, and try to decode the URL http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1, it gets decoded as http://192.168.2.15/sqli-labs/Less-26/?id=1'�||�'1.
Instead of the question mark inside a black box, I was expecting to see a blank space.

I suspect that this is due to differences between character encodings.
The value A0 represents nbsp in the ISO-8859-1 encoding (and probably in other extended-ASCII encodings too). The page at http://www.url-encode-decode.com appears to use the UTF-8 encoding.
Your problem is that there is no character represented by A0 in UTF-8. The equivalent nbsp character in UTF-8 would be represented by the value C2A0.
Decoding http://192.168.2.15/sqli-labs/Less-26/?id=1'%C2%A0||%C2%A0'1 will produce the nbsp characters that you expected.

Independently from why there is an encoding error, try %20 as a replacement for a whitespace!
Later on you can str_replace the whitespace with a
echo str_replace(" ", " ", $_GET["id"]);

Maybe the script on this site does not work properly. If you use it in your PHP code it should work properly.
echo urldecode( '%A0' );
outputs:

Related

Proper Way to Escape the | Character Using HTML Entities

To escape the ampersand character in HTML, I use the & HTML entity, for example:
Link
If I have the following code in my HTML, how would I escape the | character?
Link
HTML Tidy is complaining, claiming an illegal character was found in my HTML.
I tried using ¦ and several other HTML entities, but Tidy says "malformed URI reference."
You wouldn't.
The problem (as the message says) is that the character is illegal in URLs. It is perfectly fine in HTML.
You need to apply encoding for URLs which would be %7C.
I don't know why tidy is complaining about it, but this character is not problematic in HTML nor in URL. | is not a reserved character and can be used in URL as is. You can percent-encode every character, but there is really no need for it.
What I would presume Tidy might be complaining is =. You have got two of them, the second being an invalid one.
There is no need to encode this character in HTML entities. It has no special meaning in HTML.

Special characters representation issue in JSP

In JSP file, the source code is
|1€3|<%="\u0031\u0080\u0033" %>|
The result on the page is:
|1€3|13|
Why is the Euro symbol represented differently ?
The HTML numerical character references in the range 0x80–0x9F don't actually correspond to the characters U+0080–U+009F. Instead, they refer to the characters mapped into the bytes 0x80–0x9F from the windows-1252 encoding.
This is a weird historical artefact from the days before browsers did Unicode. HTML5 sort-of standardises it, in that although it's invalid parsers are required to parse it this way. This does not happen in XML/XHTML.
So \u0080 gives you the actual character U+0080, which you can't see because it's an invisible control character, but € gives you code page 1252 byte 0x80, which is U+20AC Euro Sign.

£ getting converted to ? by HTML Tidy, EncodingType?

I am cleaning a HTML file using HTML Tidy, well the .NET version called TidyManaged, and my "£" symbols are being converted to "?"
ie:
Income (£)
becomes:
Income (�)
I believe it is to do with encoding types. In TidyManaged, one can specify the input encoding type and output encoding type, including such things as Latin1, utf8, utf16, win1252.
The XHTML doc will ultimately gets converted into a DOC which uses win1252.
So what should my input and output encoding be to preserve £ symbols?
Many thanks.
Well, when I've used other char-sets it's always different. I'm not fluent in them but I do know that to create symbols, punctuation you need to use a 'code' rather than their literal. Never seen win1252 but google says it's 0x00A3.
Try putting that somewhere in your document.
I know in html I would put £ for a pound sign. So Html:
<p>£0.00</p>
Where I got the code

HTML Character Encoding

When outputting HTML content from a database, some encoded characters are being properly interpreted by the browser while others are not.
For example, %20 properly becomes a space, but %AE does not become the registered trademark symbol.
Am I missing some sort of content encoding specifier?
(note: I cannot realistically change the content to, for example, ® as I do not have control over the input editor's generated markup)
%AE is not valid for HTML safe ASCII,
You can view the table here: http://www.ascii.cl/htmlcodes.htm
It looks like you are dealing with Windows Word encoding (windows-1252?? something like that) it really will NOT convert to html safe, unless you do some sort of translation in the middle.
The byte AE is the ISO-8859-1 representation for the registered trademark. If you don't see anything, then apparently the URL decoder is using other charset to URL-decode it. In for example UTF-8, this byte does not represent any valid character.
To fix this, you need to URL-decode it using ISO-8859-1, or to convert the existing data to be URL-encoded using UTF-8.
That said, you should not confuse HTML(XML) encoding like ® with URL encoding like %AE.
The '%20' encoding is URL encoding. It's only useful for URLs, not for displaying HTML.
If you want to display the reg character in an HTML page, you have two options: Either use an HTML entity, or transmit your page as UTF-8.
If you do decide to use the entity code, it's fairly simple to convert them en-masse, since you can use numeric entities; you don't have to use the named entities -- ie use ® rather than &#reg;.
If you need to know entity codes for every character, I find this cheat-sheet very helpful: http://www.evotech.net/blog/2007/04/named-html-entities-in-numeric-order/
What server side language are you using? Check for a URL Decode function.
If you are using php you can use urldecode() but you should be careful about + characters.

HTML encoding issues - "Â" character showing up instead of " "

I've got a legacy app just starting to misbehave, for whatever reason I'm not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF.
The process works like this:
Pull an HTML template from a DB with tokens in it to be replaced (e.g. "~CompanyName~", "~CustomerName~", etc.)
Replace the tokens with real data
Tidy the HTML with a simple regex function that property formats HTML tag attribute values (ensures quotation marks, etc, since ActivePDF's rendering engine hates anything but single quotes around attribute values)
Send off the HTML to a web service that creates the PDF.
Somewhere in that mess, the non-breaking spaces from the HTML template (the s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character when viewing the document in a browser (FireFox). ActivePDF pukes on these non-UTF8 characters.
My question: since I don't know where the problem stems from and don't have time to investigate it, is there an easy way to re-encode or find-and-replace the bad characters? I've tried sending it through this little function I threw together, but it turns it all into gobbledegook doesn't change anything.
Private Shared Function ConvertToUTF8(ByVal html As String) As String
Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
Dim source As Byte() = isoEncoding.GetBytes(html)
Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function
Any ideas?
EDIT:
I'm getting by with this for now, though it hardly seems like a good solution:
Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
Return Regex.Replace(html, "[^\u0000-\u007F]", " ")
End Function
Somewhere in that mess, the non-breaking spaces from the HTML template (the s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character
That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as " ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.
What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.
Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:
for HTML4: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
for HTML5: <meta charset="utf-8">
If you've done that, then any remaining problem is ActivePDF's fault.
If any one had the same problem as me and the charset was already correct, simply do this:
Copy all the code inside the .html file.
Open notepad (or any basic text editor) and paste the code.
Go "File -> Save As"
Enter you file name "example.html" (Select "Save as type: All Files (.)")
Select Encoding as UTF-8
Hit Save and you can now delete your old .html file and the encoding should be fixed
Problem:
Even I was facing the problem where we were sending '£' with some string in POST request to CRM System, but when we were doing the GET call from CRM , it was returning '£' with some string content. So what we have analysed is that '£' was getting converted to '£'.
Analysis:
The glitch which we have found after doing research is that in POST call we have set HttpWebRequest ContentType as "text/xml" while in GET Call it was "text/xml; charset:utf-8".
Solution:
So as the part of solution we have included the charset:utf-8 in POST request and it works.
In my case this (a with caret) occurred in code I generated from visual studio using my own tool for generating code. It was easy to solve:
Select single spaces ( ) in the document. You should be able to see lots of single spaces that are looking different from the other single spaces, they are not selected. Select these other single spaces - they are the ones responsible for the unwanted characters in the browser. Go to Find and Replace with single space ( ). Done.
PS: It's easier to see all similar characters when you place the cursor on one or if you select it in VS2017+; I hope other IDEs may have similar features
In my case I was getting latin cross sign instead of nbsp, even that a page was correctly encoded into the UTF-8. Nothing of above helped in resolving the issue and I tried all.
In the end changing font for IE (with browser specific css) helped, I was using Helvetica-Nue as a body font changing to the Arial resolved the issue .
I was having the same sort of problem. Apparently it's simply because PHP doesn't recognise utf-8.
I was tearing my hair out at first when a '£' sign kept showing up as '£', despite it appearing ok in DreamWeaver. Eventually I remembered I had been having problems with links relative to the index file, when the pages, if viewed directly would work with slideshows, but not when used with an include (but that's beside the point. Anyway I wondered if this might be a similar problem, so instead of putting into the page that I was having problems with, I simply put it into the index.php file - problem fixed throughout.
The reason for this is PHP doesn't recognise utf-8.
Here you can check it for all Special Characters in HTML
http://www.degraeve.com/reference/specialcharacters.php
Well I got this Issue too in my few websites and all i need to do is customize the content fetler for HTML entites. before that more i delete them more i got, so just change you html fiter or parsing function for the page and it worked. Its mainly due to HTML editors in most of CMSs. the way they store parse the data caused this issue (In My case). May this would Help in your case too