Swift 3 - Apostrophes being converted to â on screen-scrape - html

In my Swift 3 application, I am getting HTML text from a web api to render inside of a UIWebView. However, apostrophes specifically and maybe other special characters are rendering as accent letters instead of their real value. For example, the text for “Transportation’s” displays as “Transportationâs”.
The code is simple.
var myHTMLString = try String(contentsOf: myURL, encoding: .ascii)
//myHTMLString = "some bâd string"
webview.loadHTMLString(myHTMLString, baseURL: nil);
The values are correct on the API. Why is this happening when grabbing?

Change the encoding to .utf8
I'm not sure why the .ascii isn't working, though I suspect it has to do with HTML entity encoding. If I find a reason, I'll update this answer...
Update: This w3schools.com page explains that HTML is now considered UTF-8 standard. UTF-8 can handle a larger set of characters. Apparently the webview browser understands HTML entities (for example, ' representing an apostrophe) in UTF-8, but not ASCII.

Related

How to display accented characters using NSMutableAttributedString?

In my NSString, I've some accented characters and html tags, such as <i></i>. I want to display them in NSTextField with text attributes to them.
When I build the attrStr, I use:
NSMutableAttributedString *attrStr = [[NSMutableAttributedString alloc] initWithHTML:[text dataUsingEncoding:NSUTF8StringEncoding]
options:#{NSWebPreferencesDocumentOption : webPreferences}
documentAttributes:NULL];
I display the text as follow:
someTextField.attributedStringValue = attrStr;
But this require I know the encoding of the NSString, but the problem is that I do not know the encoding. So how do I build an attribute string without knowing the encoding?
You don't need to know the encoding of your NSString – it is an Objective-C string. The definition of dataUsingEncoding: states, emphasis added:
Returns an NSData object containing a representation of the receiver encoded using a given encoding.
This method converts from whatever encoding NSString uses internally into the specified encoding – so the result is your case is an NSData containing the bytes of a UTF8 string.
If you are seeing invalid results maybe the issue is when you create the original NSString?
HTH
Addendum I now see you asked this question first (I added an answer – basically you need to know or guess). Why are you reading the file into an NSString in the first place? Just read it directly into an NSData and then let the HTML parser sort it out (it will either read the HTML encoding tag or default to an encoding).
Eventually, I found that if I use NSUnicodeStringEncoding on the above, the accented chars will be displayed correctly.

utf-8 on web page translating some characters

I seem to have encountered a weird behaviour with my HTML page, where it can
display some Chinese characters only and some are tagged as ??
I have already changed the HTTP header to utf-8
Content-Type: "text/html; charset=utf-8"
�?��?好
Question is why only some Chinese characters can be shown and some not??
Edit :-
I have dug deeper in the issue and there is two parts in the code.
There is a function in the code to encode strings to cp1252 encoding on this string
before encoding :-
<pre>
<font size=\"2\">\x{e6}\x{99}\x{9a}\x{e5}\x{ae}\x{89} </font>
</pre>
after encoding :-
Then I tested by changing the encoding to iso-8859-1 and everything is showing fine. My question now is why is that ?? I'm assuming cp1252 is older and does not support some utf-8 encodings ??
Thank You.

Why doesn't nbsp display as nbsp in the URL

I am following a tutorial where a web application written in PHP, blacklists spaces from the input(The 'id' parameter). The task is to add other characters, which essentially bypasses this blacklist, but still gets interpreted by the MySQL database in the back end. What works is a URL constructed like so -
http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1
Now, my question is simply that if '%A0' indicates an NBSP, then why is it that when I go to a site like http://www.url-encode-decode.com, and try to decode the URL http://192.168.2.15/sqli-labs/Less-26/?id=1'%A0||%A0'1, it gets decoded as http://192.168.2.15/sqli-labs/Less-26/?id=1'�||�'1.
Instead of the question mark inside a black box, I was expecting to see a blank space.
I suspect that this is due to differences between character encodings.
The value A0 represents nbsp in the ISO-8859-1 encoding (and probably in other extended-ASCII encodings too). The page at http://www.url-encode-decode.com appears to use the UTF-8 encoding.
Your problem is that there is no character represented by A0 in UTF-8. The equivalent nbsp character in UTF-8 would be represented by the value C2A0.
Decoding http://192.168.2.15/sqli-labs/Less-26/?id=1'%C2%A0||%C2%A0'1 will produce the nbsp characters that you expected.
Independently from why there is an encoding error, try %20 as a replacement for a whitespace!
Later on you can str_replace the whitespace with a
echo str_replace(" ", " ", $_GET["id"]);
Maybe the script on this site does not work properly. If you use it in your PHP code it should work properly.
echo urldecode( '%A0' );
outputs:

What other characters beside ampersand (&) should be encoded in HTML href/src attributes?

Is the ampersand the only character that should be encoded in an HTML attribute?
It's well known that this won't pass validation:
Because the ampersand should be &. Here's a direct link to the validation fail.
This guy lists a bunch of characters that should be encoded, but he's wrong. If you encode the first "/" in http:// the href won't work.
In ASP.NET, is there a helper method already built to handle this? Stuff like Server.UrlEncode and HtmlEncode obviously don't work - those are for different purposes.
I can build my own simple extension method (like .ToAttributeView()) which does a simple string replace.
Other than standard URI encoding of the values, & is the only character related to HTML entities that you have to worry about simply because this is the character that begins every HTML entity. Take for example the following URL:
http://query.com/?q=foo&lt=bar&gt=baz
Even though there aren't trailing semi-colons, since < is the entity for < and > is the entity for >, some old browsers would translate this URL to:
http://query.com/?q=foo<=bar>=baz
So you need to specify & as & to prevent this from occurring for links within an HTML parsed document.
The purpose of escaping characters is so that they won't be processed as arguments. So you actually don't want to encode the entire url, just the values you are passing via the querystring. For example:
http://example.com/?parameter1=<ENCODED VALUE>&parameter2=<ENCODED VALUE>
The url you showed is actually a perfectly valid url that will pass validation. However, the browser will interpret the & symbols as a break between parameters in the querystring. So your querystring:
?q=whatever&lang=en
Will actually be translated by the recipient as two parameters:
q = "whatever"
lang = "en"
For your url to work you just need to ensure that your values are being encoded:
?q=<ENCODED VALUE>&lang=<ENCODED VALUE>
Edit: The common problems page from the W3C you linked to is talking about edge cases when urls are rendered in html and the & is followed by text that could be interpreted as an entity reference (&copy for example). Here is a test in jsfiddle showing the url:
http://jsfiddle.net/YjPHA/1/
In Chrome and FireFox the links works correctly, but IE renders &copy as ©, breaking the link. I have to admit I've never had a problem with this in the wild (it would only affect those entity references which don't require a semicolon, which is a pretty small subset).
To ensure you're safe from this bug you can HTML encode any of your URLS you render to the page and you should be fine. If you're using ASP.NET the HttpUtility.HtmlEncode method should work just fine.
You do not need HTML escapement here:
According to the HTML5 spec:
http://www.w3.org/TR/html5/tokenization.html#character-reference-in-attribute-value-state
&lang= should be parsed as non-recognized character reference and value of the attribute should be used as it is: http://domain.com/search?q=whatever&lang=en
For the reference: added question to HTML5 WG: http://lists.w3.org/Archives/Public/public-html/2011Sep/0163.html
In HTML attribute values, if you want ", '&' and a non-breaking space as a result, you should (as an author who is clear about intent) have ", & and in the markup.
For " though, you don't have to use " if you use single quotes to encase your attribute values.
For HTML text nodes, in addition to the above, if you want < and > as a result, you should use < and >. (I'd even use these in attribute values too.)
For hfnames and hfvalues (and directory names in the path) for URIs, I'd used Javascript's encodeURIComponent() (on a utf-8 page when encoding for use on a utf-8 page).
If I understand the question correctly, I believe this is what you want.

HTML Character Encoding

When outputting HTML content from a database, some encoded characters are being properly interpreted by the browser while others are not.
For example, %20 properly becomes a space, but %AE does not become the registered trademark symbol.
Am I missing some sort of content encoding specifier?
(note: I cannot realistically change the content to, for example, ® as I do not have control over the input editor's generated markup)
%AE is not valid for HTML safe ASCII,
You can view the table here: http://www.ascii.cl/htmlcodes.htm
It looks like you are dealing with Windows Word encoding (windows-1252?? something like that) it really will NOT convert to html safe, unless you do some sort of translation in the middle.
The byte AE is the ISO-8859-1 representation for the registered trademark. If you don't see anything, then apparently the URL decoder is using other charset to URL-decode it. In for example UTF-8, this byte does not represent any valid character.
To fix this, you need to URL-decode it using ISO-8859-1, or to convert the existing data to be URL-encoded using UTF-8.
That said, you should not confuse HTML(XML) encoding like ® with URL encoding like %AE.
The '%20' encoding is URL encoding. It's only useful for URLs, not for displaying HTML.
If you want to display the reg character in an HTML page, you have two options: Either use an HTML entity, or transmit your page as UTF-8.
If you do decide to use the entity code, it's fairly simple to convert them en-masse, since you can use numeric entities; you don't have to use the named entities -- ie use ® rather than &#reg;.
If you need to know entity codes for every character, I find this cheat-sheet very helpful: http://www.evotech.net/blog/2007/04/named-html-entities-in-numeric-order/
What server side language are you using? Check for a URL Decode function.
If you are using php you can use urldecode() but you should be careful about + characters.