Special characters representation issue in JSP - html

In JSP file, the source code is
|1€3|<%="\u0031\u0080\u0033" %>|
The result on the page is:
|1€3|13|
Why is the Euro symbol represented differently ?

The HTML numerical character references in the range 0x80–0x9F don't actually correspond to the characters U+0080–U+009F. Instead, they refer to the characters mapped into the bytes 0x80–0x9F from the windows-1252 encoding.
This is a weird historical artefact from the days before browsers did Unicode. HTML5 sort-of standardises it, in that although it's invalid parsers are required to parse it this way. This does not happen in XML/XHTML.
So \u0080 gives you the actual character U+0080, which you can't see because it's an invisible control character, but € gives you code page 1252 byte 0x80, which is U+20AC Euro Sign.

Related

What kind of encoding is this html encoding?

I am doing a project that involves searching words in the Arabic script on Wiktionary, and when I do a GET request on certain word pages, I get something like this for example:
title="\xd8\xb1\xd8\xa3\xd8\xb3\xd9\x85\xd8\xa7\xd9\x84\xd9\x8a\xd8\xa9">\xd8\xb1\xd8\xa3\xd8\xb3\xd9\x85\xd8\xa7\xd9\x84\xd9\x8a\xd8\xa9</a></li>\n<li><a href="/wiki/%D8%B1%D8%A3%D8%B3%D9%8A"
This corresponds to the following URL: https://en.wiktionary.org/wiki/%D8%B1%D8%A3%D8%B3%D9%8A.
Does anyone know what the \xd8 or %D8 encodings are called? I want to say they are hex codes, but I have already looked up hex codes for the Arabic script and they certainly are not these.
The percentages you see in the url are used to substitute characters that are'nt allowed in URLs, such as special characters like "/", ":" and "&" and non ASCII characters. This is called percent encoding - https://en.m.wikipedia.org/wiki/Percent-encoding
The "\xd.." prefixed represent hexadecimal character codes, since arabic characters fall outside of UTF-8 thats how that have to be represented. Thats assuming that HTML you showed used UTF-8 encoding.

Why doesn't nbsp display as nbsp in the URL

I am following a tutorial where a web application written in PHP, blacklists spaces from the input(The 'id' parameter). The task is to add other characters, which essentially bypasses this blacklist, but still gets interpreted by the MySQL database in the back end. What works is a URL constructed like so -
http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1
Now, my question is simply that if '%A0' indicates an NBSP, then why is it that when I go to a site like http://www.url-encode-decode.com, and try to decode the URL http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1, it gets decoded as http://192.168.2.15/sqli-labs/Less-26/?id=1'�||�'1.
Instead of the question mark inside a black box, I was expecting to see a blank space.
I suspect that this is due to differences between character encodings.
The value A0 represents nbsp in the ISO-8859-1 encoding (and probably in other extended-ASCII encodings too). The page at http://www.url-encode-decode.com appears to use the UTF-8 encoding.
Your problem is that there is no character represented by A0 in UTF-8. The equivalent nbsp character in UTF-8 would be represented by the value C2A0.
Decoding http://192.168.2.15/sqli-labs/Less-26/?id=1'%C2%A0||%C2%A0'1 will produce the nbsp characters that you expected.
Independently from why there is an encoding error, try %20 as a replacement for a whitespace!
Later on you can str_replace the whitespace with a
echo str_replace(" ", " ", $_GET["id"]);
Maybe the script on this site does not work properly. If you use it in your PHP code it should work properly.
echo urldecode( '%A0' );
outputs:

utf-8 / utf-16 conversion

When I design a html page in Dreamweaver CS6 I use its validation tool (it sends the code to w3c) and I get no errors. However, when I validate the same page in UltraEdit 21 (it uses HTML Tidy) I get the warning:
"Specified input encoding (utf-8) does not match actual input encoding (utf-16)"
The page is set as html5 (with <!doctype html>), as utf-8 (with <meta charset="utf-8">) and contains greek text.
Well, the question is:
Does that problem affect the appearance of the page? I mean, when I publish it, will a user in China, Germany, or ...Tierra del Fuego see the greek text?
If yes, the rest are less important, but I'll ask them:
What makes HTML Tidy to define the document as utf-16? Is there a character, word or visible string of any kind that I can remove/delete to correct the problem?
If I use <meta charset="utf-16"> will browsers parse the code correctly (ending to greek text for the global user)?
The actual file encoding will be set in Dreamweaver properties for the file.
Dreamweaver Help / Set title and encoding properties for a page:
The Title/Encoding Page Properties options let you specify the document encoding type that is specific to the language used to author your web pages as well as specify which Unicode Normalization Form to use with that encoding type.
Select Modify > Page Properties, or click the Page Properties button in the text Property inspector.
Choose the Title/Encoding category and set the options.
...
Encoding
Specifies the encoding used for characters in the document.
If you select Unicode (UTF‑8) as the document encoding, entity encoding is not necessary because UTF‑8 can safely represent all characters. If you select another document encoding, entity encoding may be necessary to represent certain characters. For more information on character entities, see www.w3.org/TR/REC-html40/sgml/entities.html.
...
Include Unicode Signature (BOM)
Includes a Byte Order Mark (BOM) in the document. A BOM is 2 to 4 bytes at the beginning of a text file that identifies a file as Unicode, and if so, the byte order of the following bytes. Because UTF‑8 has no byte order, adding a UTF‑8 BOM is optional. For UTF‑16 and UTF‑32, it is required.
Choose UTF-8 without BOM.
UltraEdit automatically detects encoding of a file on opening and displays it at bottom in status bar. See in UltraEdit Advanced - Configuration - File Handling - Unicode/UTF-8 Detection and press button Help for some more details.
UTF-16 is displayed for a file encoded in UTF-16 Little Endian with or without BOM on using standard status bar since UE v19.00. Clicking on this list box in status bar and selecting Unicode - UTF-8 results in converting the file from UTF-16 LE to UTF-8 which then matches with the character set declaration in head of your HTML5 file.
When using basic status bar in UE v19.00 or any later version or using any UltraEdit version prior v19.00, the status bar field right to the field with line, column and clipboard number starts with U- for a file with UTF-16 LE encoding.
The UltraEdit help page about the Status Bar contains more information about information shown in standard and basic status bar in UltraEdit.
Conversion to UTF-8 can be done with UltraEdit also with command UNICODE/UTF-8 to UTF-8 (Unicode Editing) in submenu Conversions in menu File.
There are 2 configuration settings at Advanced - Configuration - File Handling - Save which define saving a UTF-8 encoded file with or without byte order mark (BOM):
Write UTF-8 BOM header to all UTF-8 files when saved
Write UTF-8 BOM on new files created within this program (if above is not set)
As UTF-8 encoded HTML files should be always without BOM, it is better to have both UTF-8 BOM settings unchecked when using UltraEdit mainly for editing HTML files.
Another possibility to convert a file with UltraEdit is using command Save As from menu File and use appropriate Encoding / Format setting. UTF-8 in Save As dialog means saving the file as UTF-8 encoded file with BOM and UTF-8 - NO BOM without BOM independent on the two configuration settings for standard Save.
For converting all files in a single folder, a folder tree, opened in UltraEdit, etc. to UTF-8 using UltraEdit, there is an UltraEdit scripting solution, see How to convert all files in a folder to UTF-8?
Unfortunately UE v21.30.0.1024 still does not recognize the short character set declaration <meta charset="utf-8"> as defined in HTML5 standard. See Short utf-8 charset declaration in HTML5 header with details about this limitation and how it can be worked around. This limitation does not matter if within first 64 KB at least one UTF-8 encoded character is found as it will be the case for your HTML5 files with Greek text.
HTML Tidy installed with UltraEdit v21.30.0.1024 is of version 25 March 2009. I'm not sure if HTML Tidy really supports short charset declaration of HTML5. But it looks so because otherwise you would not see the warning on validating the HTML5 file with HTML Tidy.
It might be useful for you to read UltraEdit power tip Unicode text and Unicode files in UltraEdit/UEStudio as it looks like you do not really know what encoding and character set really means and why it is important for applications that the declaration in the HTML5 matches with really used encoding.
I answer your questions now after all those general UltraEdit stuff.
Does that problem affect the appearance of the page?
Although the file contains the declaration that file contents is encoded with UTF-8, but is in real encoded with UTF-16 Little Endian, the browsers display the contents correct. UTF-16 detection is very easy, especially with BOM present and therefore browsers ignore wrong declaration and interpret the bytes of the HTML file from beginning right as UTF-16 encoded text file.
However, it would be much better to convert the UTF-16 encoded HTML files to UTF-8 without BOM. UTF-8 without BOM is most commonly used for HTML files worldwide and then the character set declaration in head of your HTML file would also match with really used encoding.
What makes HTML Tidy to define the document as utf-16?
The really used encoding of your HTML file is UTF-16 Little Endian and UltraEdit, HTML Tidy and the browsers detect that already after reading in the first 2 bytes of the text file - the byte order mark. That's the reason why HTML Tidy suggests to declare the encoding in head of HTML file correct as utf-16 as the file is really encoded with.
If I use <meta charset="utf-16"> will browsers parse the code correctly?
In case of keeping the file encoded in UTF-16 LE (always 2 bytes per character), it would be better to declare the character set right with <meta charset="utf-16">. But no Unicode aware text editor or browser has a problem to automatically detect UTF-16 Little Endian encoding with byte order mark.
The character set declaration becomes very important mainly for UTF-8 encoded files (1, 2, 3 or even 4 bytes per character) or files with single-byte coded characters using a code page like Windows-1252 / ISO 8859-1 (Latin 1) or Windows-1253 / ISO 8859-7 (Latin/Greek).

£ getting converted to ? by HTML Tidy, EncodingType?

I am cleaning a HTML file using HTML Tidy, well the .NET version called TidyManaged, and my "£" symbols are being converted to "?"
ie:
Income (£)
becomes:
Income (�)
I believe it is to do with encoding types. In TidyManaged, one can specify the input encoding type and output encoding type, including such things as Latin1, utf8, utf16, win1252.
The XHTML doc will ultimately gets converted into a DOC which uses win1252.
So what should my input and output encoding be to preserve £ symbols?
Many thanks.
Well, when I've used other char-sets it's always different. I'm not fluent in them but I do know that to create symbols, punctuation you need to use a 'code' rather than their literal. Never seen win1252 but google says it's 0x00A3.
Try putting that somewhere in your document.
I know in html I would put £ for a pound sign. So Html:
<p>£0.00</p>
Where I got the code

Detect Multibyte and Chinese Characters in rtf markup

I'm trying to translate parse a RTF formatted message (I need to keep the formatting tags so I can't use the trick where you just paste into a RichTextBox and get the .PlainText out)
Take the RTF code for the string a基bমূcΟιd pasted straight into Wordpad:
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}
It's difficult to make out if you've not had much to do with RTF. So here's the bit I'm looking at
\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9
Notice the 基 (u+57FA) is \'8a\'ee but the মূ, which is actually two characters ম (\u2478?) and ূ (\u2498?), is \u2478?\u2498? which is fine, but the Οι which is two separate characters Ο and ι is \'cf\'e9.
Is there a way to determine if I'm looking at something that should be one character such as 基 = \'bb\'f9 or two characters Ο and ι = \'cf\'e9?
I was thinking that maybe the \lang was it, but that isn't the case at all because the \lang does not change from when it's first set. I am already accounting for the Different Codepages from different Charset values in the fonts, but it doesn't seem to tell me anything about if I should treat two Unicode references next to each other as being a double byte character or not.
How can I tell if the character I'm looking at should be double-byte (or multi-byte) or single byte?
\'xx escapes represent bytes and should be interpreted using the fcharset encoding. (Or potentially cchs. Falling back to the ansicpg if not present.)
You need to know that encoding intimately to be able to decide whether a single \'xx sequence represents a character on its own or is only a part of a multi-byte character; typically you will be consuming each section of text as a unit before converting that byte string into a Unicode string using whatever library or OS interface you have available, to avoid having to write byte-by-byte parsers for every code page supported by RTF.
\uxxxx? escapes represent UTF-16 code units. This is much simpler, but Word[pad] only produces this form of encoding as a last resort, because it's not compatible with earlier RTF versions. (? is the fallback character for when the receiver can't cope with the Unicode.)
So:
The two characters Οι are represented as two byte-escapes because the font associated with that stretch of text is using a Greek single-byte encoding (charset 161 = cp1253).
The one character 基 is represented as two byte-escapes because the font associated with that stretch of text is using a Japanese multibyte encoding (charset 128 = cp932 ≈ Shift-JIS). In Shift-JIS the leading \'8a byte signals a further byte to come, as do various others in the top-bit-set range (but not all of them).
The two characters মূ are represented as Unicode code unit escapes, because there's no other option: there isn't any RTF-compatible code page that contains Bengali characters. (Code page 57003 for ISCII came much later.)
RTF has tags for specifying the codepage/encoding used to encode Unicode characters. The actual hex codes for the characters are the byte octets used by the specified encoding. In this case, \ansicpg1252 for Ansi codepage 1252.