is getting replaced with á on labels - html

I have the following code
<td colspan="#missingGridColumnCount">** <span translate="MissingItems">.MissingInstruments</span> **</td>
This prints correctly through the browser but when I print to my Zebra printer, I get the following on the label:
**_áMissing Items_á**
I have looked through Zebra Label documentation but cannot find a way to convert this or accept the for the labels.

This is a character encoding issue.
The probable chain of events is this:
The browser is rendering the entity into the Unicode code point "U+00A0 NO-BREAK SPACE".
This is being encoded in UTF-8, as the sequence of bytes C2 A0.
These bytes are being interpreted by the Zebra printer according to Code page 850, where C2 is mapped to "┴" (U+2534 BOX DRAWINGS LIGHT UP AND HORIZONTAL) and A0 to "á" (U+00E1 LATIN SMALL LETTER A WITH ACUTE).
In code page 850, a non-breaking space is represented by the byte FF.
You may be able to tell the whatever is interpreting the HTML to use Code page 850 instead of UTF-8, and it will send the byte sequences the printer is expecting. You will need to make sure your input doesn't contain any literal UTF-8 - escape all non-ASCII characters as HTML entities.
Otherwise, you will need to substitute byte-wise before sending to the printer, or encode in some other way.

Related

HTML Issue, strange characters replacing HREF quotes

I am new to HTML coding. I'm taking an intro web design course this semester and i'm having a difficult time with my HREF segment. I have a table of contents page that references all of my projects over the semester.
This includes direct links to my projects where I should be able to embed my index.html file with the links to my new projects. However, whenever I try to update the HREF segments with quotes linking to my new project it spits out odd characters where the quotes would be.
â₠example of what the error shows below.
**The requested URL /“http://userid.myweb.usf.edu/project1/index.html“ was not found on this server.**
<li>This link goes to <a href=“http://userid.myweb.usf.edu/project1/index.html“>Project1</a></li>
I see a lot of references to it being a UNICODE8 issue but i have no idea what that means. If anyone could help i would greatly appreciate it as my professor is not the best at getting back to us.
Your <a> tag is using “ quote characters (Unicode codepoint U+201C LEFT DOUBLE QUOTATION MARK). HTML requires " quote characters instead (codepoint U+0022 QUOTATION MARK).
<li>This link goes to Project1</li>
Some editors, particularly word processors that were designed for editing documents and not HTML, will use “ instead of " when you type " on the keyboard or copy/paste text from other apps, so watch out for that. Use a text editor that is specifically designed for editing HTML, or at least a plain vanilla text editor, like NotePad/NodePad++, which doesn't reinterpret entered characters.
Here is a breakdown of what “ means:
The Unicode “ (U+201C) character, which you are entering in your HTML, is encoded in UTF-8 as bytes E2 80 9C.
When those same bytes are interpreted in the Windows-1252 charset (the default charset used by most Windows systems in Western countries), byte E2 is Unicode codepoint U+00E2 (â), byte 80 is codepoint U+20AC (€), and byte 9C is codepoint U+0153 (œ).
When encoded in UTF-8, codepoint U+00E2 is bytes C3 A2, codepoint U+20AC is bytes E2 82 AC, and codepoint U+0153 is bytes C5 93.
In Windows-1252, characters “ are bytes C3 A2 E2 82 AC C5 93.
Look familiar?
You have a charset mismatch between what you are saving your HTML file as, and what your web browser is interpreting the HTML as. Your HTML is being saved as UTF-8, but is being decoded to Unicode mis-interpretted as Windows-1252 instead of as UTF-8, re-encoded as UTF-8, and then displayed as Windows-1252.
If you are serving your HTML file over HTTP, make sure the HTTP server is reporting the correct charset=UTF-8 attribute in the Content-Type HTTP header.
You can (and should) also add a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> tag (if using HTML4) or <meta charset="UTF-8"> tag (if using HTML5) to your HTML itself (when served over HTTP, web browsers are required to give the actual Content-Type HTTP header higher priority, though).
Make sure the reported charset in all cases matches the actual charset that you are saving your HTML file as.

Why doesn't nbsp display as nbsp in the URL

I am following a tutorial where a web application written in PHP, blacklists spaces from the input(The 'id' parameter). The task is to add other characters, which essentially bypasses this blacklist, but still gets interpreted by the MySQL database in the back end. What works is a URL constructed like so -
http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1
Now, my question is simply that if '%A0' indicates an NBSP, then why is it that when I go to a site like http://www.url-encode-decode.com, and try to decode the URL http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1, it gets decoded as http://192.168.2.15/sqli-labs/Less-26/?id=1'�||�'1.
Instead of the question mark inside a black box, I was expecting to see a blank space.
I suspect that this is due to differences between character encodings.
The value A0 represents nbsp in the ISO-8859-1 encoding (and probably in other extended-ASCII encodings too). The page at http://www.url-encode-decode.com appears to use the UTF-8 encoding.
Your problem is that there is no character represented by A0 in UTF-8. The equivalent nbsp character in UTF-8 would be represented by the value C2A0.
Decoding http://192.168.2.15/sqli-labs/Less-26/?id=1'%C2%A0||%C2%A0'1 will produce the nbsp characters that you expected.
Independently from why there is an encoding error, try %20 as a replacement for a whitespace!
Later on you can str_replace the whitespace with a
echo str_replace(" ", " ", $_GET["id"]);
Maybe the script on this site does not work properly. If you use it in your PHP code it should work properly.
echo urldecode( '%A0' );
outputs:

Special characters representation issue in JSP

In JSP file, the source code is
|1€3|<%="\u0031\u0080\u0033" %>|
The result on the page is:
|1€3|13|
Why is the Euro symbol represented differently ?
The HTML numerical character references in the range 0x80–0x9F don't actually correspond to the characters U+0080–U+009F. Instead, they refer to the characters mapped into the bytes 0x80–0x9F from the windows-1252 encoding.
This is a weird historical artefact from the days before browsers did Unicode. HTML5 sort-of standardises it, in that although it's invalid parsers are required to parse it this way. This does not happen in XML/XHTML.
So \u0080 gives you the actual character U+0080, which you can't see because it's an invisible control character, but € gives you code page 1252 byte 0x80, which is U+20AC Euro Sign.

Detect Multibyte and Chinese Characters in rtf markup

I'm trying to translate parse a RTF formatted message (I need to keep the formatting tags so I can't use the trick where you just paste into a RichTextBox and get the .PlainText out)
Take the RTF code for the string a基bমূcΟιd pasted straight into Wordpad:
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}
It's difficult to make out if you've not had much to do with RTF. So here's the bit I'm looking at
\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9
Notice the 基 (u+57FA) is \'8a\'ee but the মূ, which is actually two characters ম (\u2478?) and ূ (\u2498?), is \u2478?\u2498? which is fine, but the Οι which is two separate characters Ο and ι is \'cf\'e9.
Is there a way to determine if I'm looking at something that should be one character such as 基 = \'bb\'f9 or two characters Ο and ι = \'cf\'e9?
I was thinking that maybe the \lang was it, but that isn't the case at all because the \lang does not change from when it's first set. I am already accounting for the Different Codepages from different Charset values in the fonts, but it doesn't seem to tell me anything about if I should treat two Unicode references next to each other as being a double byte character or not.
How can I tell if the character I'm looking at should be double-byte (or multi-byte) or single byte?
\'xx escapes represent bytes and should be interpreted using the fcharset encoding. (Or potentially cchs. Falling back to the ansicpg if not present.)
You need to know that encoding intimately to be able to decide whether a single \'xx sequence represents a character on its own or is only a part of a multi-byte character; typically you will be consuming each section of text as a unit before converting that byte string into a Unicode string using whatever library or OS interface you have available, to avoid having to write byte-by-byte parsers for every code page supported by RTF.
\uxxxx? escapes represent UTF-16 code units. This is much simpler, but Word[pad] only produces this form of encoding as a last resort, because it's not compatible with earlier RTF versions. (? is the fallback character for when the receiver can't cope with the Unicode.)
So:
The two characters Οι are represented as two byte-escapes because the font associated with that stretch of text is using a Greek single-byte encoding (charset 161 = cp1253).
The one character 基 is represented as two byte-escapes because the font associated with that stretch of text is using a Japanese multibyte encoding (charset 128 = cp932 ≈ Shift-JIS). In Shift-JIS the leading \'8a byte signals a further byte to come, as do various others in the top-bit-set range (but not all of them).
The two characters মূ are represented as Unicode code unit escapes, because there's no other option: there isn't any RTF-compatible code page that contains Bengali characters. (Code page 57003 for ISCII came much later.)
RTF has tags for specifying the codepage/encoding used to encode Unicode characters. The actual hex codes for the characters are the byte octets used by the specified encoding. In this case, \ansicpg1252 for Ansi codepage 1252.

IE munging pound (£) symbol

I have a html form which goes of to do all sorts of strange back end things. This works fine in firefox. and in most cases it works fine in IE
However the (pound sterling) £ sign causes problems, and seems to get munged in the submit.
The forms is something like this
<form action="*MyFormAction*" accept-charset="UTF-8" method="post">
I think I have seen this problem before but can't remember the solution.
edit, the euro symbol € works fine
edit 2,
In fact if I put the € symbol with a £ symbol it also works fine. Looking at the problem if I use characters which are not in the extended part of iso8859-1 it works ok. If I use extended charicters from iso8859-1 they get munged. So how do I make IE use the character set that the accept-charset says it should?
accept-charset="UTF-8"
Does not do what you think it does (or the standard says it does) in IE. Instead, IE uses the value (‘UTF-8’) as an alternative list of encodings for if a field can't be encoded using the usual default encoding (which is the same as the page's own encoding).
So if you add this attribute and your page isn't already in UTF-8, you can be getting characters submitted as either the page encoding or UTF-8, and there is no way for your form-submission-reading script to know!
For this reason you should never use accept-charset; instead you should always ensure that the page containing the form is correctly served as “Content-Type: text/html;charset=utf-8” (by HTTP header and/or <meta>).
In fact if I put the € symbol with a £ symbol it also works fine.
Yes, that's because ‘€’ cannot be encoded in the page's default encoding (presumably ISO-8859-1). So IE resorts to sending the field encoded as UTF-8, which is what you wanted all along.
I think bobince has the ideal answer which is “serve the page in UTF-8", however as I can't do this I am posting my work around for prosperity.
Adding a hidden field unmunge with a non ISO-8859-1 (what our pages are served in) extended character forces the submission into UTF8
so
<input type="hidden" name="unmunge" value="€" />
fixes the encoding (the entity is the euro symbol).
How is the £ submitted? If it's in an input box for a price don't submit it, only allow numbers to be submitted and add the £ when you display the price again. Or add the currency symbol in the backend script.
I am no sure if this will help (read the entire article at http://fyneworks.blogspot.com/2008/06/british-pound-sign-encoding-revisited.html)
Excerpt:
THE PROBLEM If you look at the
UTF-8/Latin-1 (AKA ISO-8859-1)
Character Table you will find that the
decimal code for the British pound
sterling sign is 163 - and the
hexadecimal code is A3.
£ = %A3
However, this is not the case in (all)
encoding/decoding functions in
Javascript...
encodeURI/encodeURIComponent
Encodes a Uniform Resource Identifier (URI) component by
replacing each instance of certain
characters by one, two, or three
escape sequences representing the
UTF-8 encoding of the character
Which means, in order to encode our
beloved pound sign, Javascript uses 2
characters. This is where the annoying
"Â" comes in...
£ = %C2%A3
Hope it helps.