How to display accented characters using NSMutableAttributedString? - html

In my NSString, I've some accented characters and html tags, such as <i></i>. I want to display them in NSTextField with text attributes to them.
When I build the attrStr, I use:
NSMutableAttributedString *attrStr = [[NSMutableAttributedString alloc] initWithHTML:[text dataUsingEncoding:NSUTF8StringEncoding]
options:#{NSWebPreferencesDocumentOption : webPreferences}
documentAttributes:NULL];
I display the text as follow:
someTextField.attributedStringValue = attrStr;
But this require I know the encoding of the NSString, but the problem is that I do not know the encoding. So how do I build an attribute string without knowing the encoding?

You don't need to know the encoding of your NSString – it is an Objective-C string. The definition of dataUsingEncoding: states, emphasis added:
Returns an NSData object containing a representation of the receiver encoded using a given encoding.
This method converts from whatever encoding NSString uses internally into the specified encoding – so the result is your case is an NSData containing the bytes of a UTF8 string.
If you are seeing invalid results maybe the issue is when you create the original NSString?
HTH
Addendum I now see you asked this question first (I added an answer – basically you need to know or guess). Why are you reading the file into an NSString in the first place? Just read it directly into an NSData and then let the HTML parser sort it out (it will either read the HTML encoding tag or default to an encoding).

Eventually, I found that if I use NSUnicodeStringEncoding on the above, the accented chars will be displayed correctly.

Related

Swift 3 - Apostrophes being converted to â on screen-scrape

In my Swift 3 application, I am getting HTML text from a web api to render inside of a UIWebView. However, apostrophes specifically and maybe other special characters are rendering as accent letters instead of their real value. For example, the text for “Transportation’s” displays as “Transportationâs”.
The code is simple.
var myHTMLString = try String(contentsOf: myURL, encoding: .ascii)
//myHTMLString = "some bâd string"
webview.loadHTMLString(myHTMLString, baseURL: nil);
The values are correct on the API. Why is this happening when grabbing?
Change the encoding to .utf8
I'm not sure why the .ascii isn't working, though I suspect it has to do with HTML entity encoding. If I find a reason, I'll update this answer...
Update: This w3schools.com page explains that HTML is now considered UTF-8 standard. UTF-8 can handle a larger set of characters. Apparently the webview browser understands HTML entities (for example, &apos; representing an apostrophe) in UTF-8, but not ASCII.

How to convert NSAttributedString (with image) into HTML?

I know how to convert the common NSAttributedString (which has no image) into HTML. I set the Images as NSTextAttachment to NSAttributedString (these attributed strings are set as attributeText of a UITextView). How could I convert the whole attributed string into HTML?
I‘ve found an answer related solution,which mentions:
Simple idea for image: encode it with base64 and put it directly in a < img > tag with the right frame.
But how could i implement that?

HTML files with no http-equiv meta tag and the charset may be other than UTF-8

we are using jsoup - excellent thanks.
We may get HTML files with no http-equiv meta tag and the charset may be other than UTF-8.
How is it best to handle this please. We can have a list of encodings and try them but I am not sure how to tell programatically if something is wrong. Would jsoup throw an IOException?
Jsoup will try to determine the encoding by the content type header or http equiv tag, if you have none of them it will use utf8. Not sure if jsoup can do more for you here.
But you can try another approach:
Implement a class that reads the files for you. There you can take care of all encoding issues. As a result such a class should give you proper encoded string or at least the encoding that's used for your input.
(html input) --> [encoding class] --normalized encoding--> [jsoup] --> (whatever)
Jsoup can now parse that input with a known encoding.
I guess changes on the html-creation thing is not possible, isn't it?
Some further readings:
http://illegalargumentexception.blogspot.co.uk/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_autodetect
Character Encoding Detection Algorithm
What is the most accurate encoding detector? (includes a list of implementation)
Java Text File Encoding
Detect (or best guess of) incoming string encoding in Java

Why is “ not showing up as a quote on my web page?

Other ASCII codes are doing the same thing.
Just to give you some background, these codes are part of the HTML that I'm reading from WordPress blog posts. I'm porting them over to BlogEngine.NET using a little C# WinForm app I wrote. Do I need to do some kind of conversion as I port them over to BlogEngine.NET (as XML files)?
It'd sure be nice if they just displayed properly without any intervention on my part.
Here's a code fragment from one of the WordPress source pages:
<link rel="alternate" type="application/rss+xml" title="INRIX® Traffic » Taking the “E” out of your “ETA” Comments Feed" href="http://www.inrixtraffic.com/blog/2012/taking-the-e-out-of-your-eta/feed/" />
Here's the corresponding chunk of XML that's in the XML file I output during the conversion:
<title>Taking the &#8220;E&#8221; out of your &#8220;ETA&#8221;</title>
UPDATE.
Tried this, but still no dice.
writer.WriteElementString("title", string.Format("<![CDATA[{0}]]>", post.Title));
...outputs this:
<title><![CDATA[Taking the &#8220;E&#8221; out of your &#8220;ETA&#8221;]]></title>
Since the data you are getting from Wordpress is already encoded you can decode it to a regular string and then let the XMLWriter encode it properly for XML.
string input = "Taking the “E” out of your “ETA”";
string decoded = System.Net.WebUtility.HtmlDecode(input);
//decoded = Taking the "E" out of your "ETA"
This may not be very efficient, but since this sounds like a one time conversion I don' think it will be an issue.
A similar question was asked here: How can I decode HTML characters in C#?
As I pointed out in my comment above: Your problem is that your Ü gets encoded into &8220;. When you output this in the browser it displays as Ü
I don't know how your porting works, but to fix this issue, you need to make sure that the & in the ASCII codes doesn't get encoded to &
Any chance CDATA tags solve the issue? Just make sure the text is correct in the source XML file. You don't need the ampersand magic (in the source) if you use CDATA tags.
<some_tag><![CDATA[Taking the “ out of your ...]]></some_tag>

HTML Character Encoding

When outputting HTML content from a database, some encoded characters are being properly interpreted by the browser while others are not.
For example, %20 properly becomes a space, but %AE does not become the registered trademark symbol.
Am I missing some sort of content encoding specifier?
(note: I cannot realistically change the content to, for example, ® as I do not have control over the input editor's generated markup)
%AE is not valid for HTML safe ASCII,
You can view the table here: http://www.ascii.cl/htmlcodes.htm
It looks like you are dealing with Windows Word encoding (windows-1252?? something like that) it really will NOT convert to html safe, unless you do some sort of translation in the middle.
The byte AE is the ISO-8859-1 representation for the registered trademark. If you don't see anything, then apparently the URL decoder is using other charset to URL-decode it. In for example UTF-8, this byte does not represent any valid character.
To fix this, you need to URL-decode it using ISO-8859-1, or to convert the existing data to be URL-encoded using UTF-8.
That said, you should not confuse HTML(XML) encoding like ® with URL encoding like %AE.
The '%20' encoding is URL encoding. It's only useful for URLs, not for displaying HTML.
If you want to display the reg character in an HTML page, you have two options: Either use an HTML entity, or transmit your page as UTF-8.
If you do decide to use the entity code, it's fairly simple to convert them en-masse, since you can use numeric entities; you don't have to use the named entities -- ie use ® rather than &#reg;.
If you need to know entity codes for every character, I find this cheat-sheet very helpful: http://www.evotech.net/blog/2007/04/named-html-entities-in-numeric-order/
What server side language are you using? Check for a URL Decode function.
If you are using php you can use urldecode() but you should be careful about + characters.