Invalid XML character § - html

I am having a issue with characters in my XML when I view it on a website. The character I want to be put in is § and what is coming out is § and my xml is <?xml version="1.0" encoding="UTF-8"?>. Any suggestions? Thanks!!

If you see “§” as “§”, then the reason is usually that the data contains “§” SECTION SIGN U+00A7 as UTF-8 encoded, as bytes 0xC2 0xA7, but it is being misinterpreted as being in an 8-bit encoding like windows-1252 or ISO-8859-1. Alternatively, an incorrect character code conversion (“double UTF-8 encoding”) has taken place.
Check out the HTTP headers of the web page. If they declare an encoding other than UTF-8, they may override the in-document declaration.

instead of the character § you can use its html code which is either &#167 or &sect.
have a look here Ascii Code, every ascii symbol has a dedicated html code that can be used instead of the symbol.
like the unbreakable space which i am sure you are familiar with: &nbsp

Related

Why doesn't nbsp display as nbsp in the URL

I am following a tutorial where a web application written in PHP, blacklists spaces from the input(The 'id' parameter). The task is to add other characters, which essentially bypasses this blacklist, but still gets interpreted by the MySQL database in the back end. What works is a URL constructed like so -
http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1
Now, my question is simply that if '%A0' indicates an NBSP, then why is it that when I go to a site like http://www.url-encode-decode.com, and try to decode the URL http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1, it gets decoded as http://192.168.2.15/sqli-labs/Less-26/?id=1'�||�'1.
Instead of the question mark inside a black box, I was expecting to see a blank space.
I suspect that this is due to differences between character encodings.
The value A0 represents nbsp in the ISO-8859-1 encoding (and probably in other extended-ASCII encodings too). The page at http://www.url-encode-decode.com appears to use the UTF-8 encoding.
Your problem is that there is no character represented by A0 in UTF-8. The equivalent nbsp character in UTF-8 would be represented by the value C2A0.
Decoding http://192.168.2.15/sqli-labs/Less-26/?id=1'%C2%A0||%C2%A0'1 will produce the nbsp characters that you expected.
Independently from why there is an encoding error, try %20 as a replacement for a whitespace!
Later on you can str_replace the whitespace with a
echo str_replace(" ", " ", $_GET["id"]);
Maybe the script on this site does not work properly. If you use it in your PHP code it should work properly.
echo urldecode( '%A0' );
outputs:

Special characters representation issue in JSP

In JSP file, the source code is
|1€3|<%="\u0031\u0080\u0033" %>|
The result on the page is:
|1€3|13|
Why is the Euro symbol represented differently ?
The HTML numerical character references in the range 0x80–0x9F don't actually correspond to the characters U+0080–U+009F. Instead, they refer to the characters mapped into the bytes 0x80–0x9F from the windows-1252 encoding.
This is a weird historical artefact from the days before browsers did Unicode. HTML5 sort-of standardises it, in that although it's invalid parsers are required to parse it this way. This does not happen in XML/XHTML.
So \u0080 gives you the actual character U+0080, which you can't see because it's an invisible control character, but € gives you code page 1252 byte 0x80, which is U+20AC Euro Sign.

£ getting converted to ? by HTML Tidy, EncodingType?

I am cleaning a HTML file using HTML Tidy, well the .NET version called TidyManaged, and my "£" symbols are being converted to "?"
ie:
Income (£)
becomes:
Income (�)
I believe it is to do with encoding types. In TidyManaged, one can specify the input encoding type and output encoding type, including such things as Latin1, utf8, utf16, win1252.
The XHTML doc will ultimately gets converted into a DOC which uses win1252.
So what should my input and output encoding be to preserve £ symbols?
Many thanks.
Well, when I've used other char-sets it's always different. I'm not fluent in them but I do know that to create symbols, punctuation you need to use a 'code' rather than their literal. Never seen win1252 but google says it's 0x00A3.
Try putting that somewhere in your document.
I know in html I would put £ for a pound sign. So Html:
<p>£0.00</p>
Where I got the code

meta tag to correct ®

I'm having some trouble getting a special character properly encoded.
® keeps coming through instead of the registered trademark symbol. I've tried changing the meta tag to UTF-8 and Windows-1252, but it still comes through in the encoded format? Can I add a meta tag to fix this?
Make sure to save your file with the proper encoding:
.
Here is an example; on the left side, the file is saved with Window-1252 encoding.
On the right side, it's saved with UTF-8 encoding
HTML options
For such characters, encoding with ISO-8859-1 might do it too, but UTF-8 is greatly encouraged.
Make sure your DOCTYPE is clearly defined : <!DOCTYPE HTML>.
Make sure your meta tag is written properly: <meta charset="UTF-8">.
PHP options
If you use PHP within your page, add the following at the beginning of the page:
<?php header('Content-Type: text/html; charset=utf-8'); ?>
If the content is output from a database, you might want to use utf8_encode() to encode different encodings to UTF-8
utf8_encode()
Encodes an ISO-8859-1 string to UTF-8
The information about encoding should correspond to the actual encoding. So instead of making guesses and trial and error, find out what the encoding really is. It seems to be UTF-8, and if declaring UTF-8 in a meta tag does not help, the probable culprit is an HTTP header that the server sends and that declares a different encoding, trumping the meta tag. Use e.g. an HTTP header viewer to check out the situation.
If the server announces iso-8859-1 or windows-1252 and if you cannot change this, then you just have to use that encoding instead of UTF-8. Then save the page in your authoring program as windows-1252 encoded.

HTML Character Encoding

When outputting HTML content from a database, some encoded characters are being properly interpreted by the browser while others are not.
For example, %20 properly becomes a space, but %AE does not become the registered trademark symbol.
Am I missing some sort of content encoding specifier?
(note: I cannot realistically change the content to, for example, ® as I do not have control over the input editor's generated markup)
%AE is not valid for HTML safe ASCII,
You can view the table here: http://www.ascii.cl/htmlcodes.htm
It looks like you are dealing with Windows Word encoding (windows-1252?? something like that) it really will NOT convert to html safe, unless you do some sort of translation in the middle.
The byte AE is the ISO-8859-1 representation for the registered trademark. If you don't see anything, then apparently the URL decoder is using other charset to URL-decode it. In for example UTF-8, this byte does not represent any valid character.
To fix this, you need to URL-decode it using ISO-8859-1, or to convert the existing data to be URL-encoded using UTF-8.
That said, you should not confuse HTML(XML) encoding like ® with URL encoding like %AE.
The '%20' encoding is URL encoding. It's only useful for URLs, not for displaying HTML.
If you want to display the reg character in an HTML page, you have two options: Either use an HTML entity, or transmit your page as UTF-8.
If you do decide to use the entity code, it's fairly simple to convert them en-masse, since you can use numeric entities; you don't have to use the named entities -- ie use ® rather than &#reg;.
If you need to know entity codes for every character, I find this cheat-sheet very helpful: http://www.evotech.net/blog/2007/04/named-html-entities-in-numeric-order/
What server side language are you using? Check for a URL Decode function.
If you are using php you can use urldecode() but you should be careful about + characters.