Other ASCII codes are doing the same thing.
Just to give you some background, these codes are part of the HTML that I'm reading from WordPress blog posts. I'm porting them over to BlogEngine.NET using a little C# WinForm app I wrote. Do I need to do some kind of conversion as I port them over to BlogEngine.NET (as XML files)?
It'd sure be nice if they just displayed properly without any intervention on my part.
Here's a code fragment from one of the WordPress source pages:
<link rel="alternate" type="application/rss+xml" title="INRIX® Traffic » Taking the “E” out of your “ETA” Comments Feed" href="http://www.inrixtraffic.com/blog/2012/taking-the-e-out-of-your-eta/feed/" />
Here's the corresponding chunk of XML that's in the XML file I output during the conversion:
<title>Taking the “E” out of your “ETA”</title>
UPDATE.
Tried this, but still no dice.
writer.WriteElementString("title", string.Format("<![CDATA[{0}]]>", post.Title));
...outputs this:
<title><![CDATA[Taking the “E” out of your “ETA”]]></title>
Since the data you are getting from Wordpress is already encoded you can decode it to a regular string and then let the XMLWriter encode it properly for XML.
string input = "Taking the “E” out of your “ETA”";
string decoded = System.Net.WebUtility.HtmlDecode(input);
//decoded = Taking the "E" out of your "ETA"
This may not be very efficient, but since this sounds like a one time conversion I don' think it will be an issue.
A similar question was asked here: How can I decode HTML characters in C#?
As I pointed out in my comment above: Your problem is that your Ü gets encoded into &8220;. When you output this in the browser it displays as Ü
I don't know how your porting works, but to fix this issue, you need to make sure that the & in the ASCII codes doesn't get encoded to &
Any chance CDATA tags solve the issue? Just make sure the text is correct in the source XML file. You don't need the ampersand magic (in the source) if you use CDATA tags.
<some_tag><![CDATA[Taking the “ out of your ...]]></some_tag>
Related
I'm creating a PDF with a large collection of quotes that I've imported into python with docx2python, using html=True so that they have some tags. I've done some processing to them so they only really have the bold, italics, underline, or break tags. I've sorted them and am trying to write them onto a PDF using the fpdf library, specifically the pdf.write_html(quote) method. The trouble comes with several special characters I have, so I am hoping to encode the PDF to UTF-8. To write with .write_html(), I had to create a new class as shown in their readthedocs under the .write_html() method at the very bottom of the left hand side:
from fpdf import FPDF, HTMLMixin
class htmlFPDF(FPDF, HTMLMixin):
pass
pdf = htmlFPDF()
pdf.add_page()
#set the overall PDF to utf-8 to preserve special characters
pdf.set_doc_option('core_fonts_encoding', 'utf-8')
pdf.write_html(quote) #[![a section of quote giving trouble with quotations][2]][2]
The list of quotes that I have going into the pdf all appear with their special characters and the html tags (<u> or <i>) in the debugger, but after the .write_html() step they then show up in the pdf file with mojibake, even before being saved, as seen through debugger. An example being "dayâ€ÂTMs demands", when it should be "day's demands" (the apostrophe is curled clockwise in the quote, but this textbox doesn't support).
I've tried updating the font I use by
pdf.add_font('NotoSans', '', 'NotoSans-Regular.ttf', uni=True)
pdf.set_font('NotoSans', '', size=12)
added after the .add_page() method, but this doesn't change the current font (or fix mojibake) on the PDF unless I use the more common .write(text_height, quote) method, which renders the underline/italicize tags into the PDF as text. The .write() method does preserve the special characters. I'm not trying to change the font really, but make sure that what's written onto the PDF preserves the special characters instead of mojibake them.
I've also attempted some .encode/.decode action before going into the .write_html(), as well as attempted some methods from the ftfy library. And tried adding '' to the start of each quote to no effect.
If anyone has ideas for a way to iterate through each line on the PDF that'd be terrific, since then I could use ftfy to fix the mojibake. But ideally, it would be some other html tag at the start of each quote or a way to change the font/encoding of the .write_html() method, maybe in the class declaration?
Or if I'm at a dead-end and should just split each quote on '<', use if statements to detect underlines, italicize, etc., and use the .write() method after all.
Extract docx to html works really bad with docx2python. I do this few month ago. I recommend PyDocX. docx2python are good for docx file content extracting, not converting it into a html.
I have an application on Apache. My Apache is configured with default encoding ISO-8859, and I´m not able to change it because Apache suport others applications that need this.
Then, in my application I´m using numerical HTML encoding in special characters, like that: Usu& #225;rio (this is Usuário).
It´s working fine, but in placeholders and title (HTML5 elements), the interface is showing á ; instead to show á.
Any idea?
Thanks
You could rename your .html file to .php and add following line to the first row:
<?php header('Content-Type:text/html; charset=UTF-8'); ?>
This will send a response from server that the content which is sent is encoded in utf-8.
By adding above code nothing will be broken and you wont see any difference exept for correct encoding.
In case you need to move the site from one server to another, you can undo those steps and everything will still work as expected.
It tried to reproduce your issue with the given HTML entity and placeholder encodes the character correctly.
Resolved. I used unicode code point instead numerical HTML encoding. Take a look at UTF-8 encoding table and unicode characters here.
<label for="abc" id="xyz">http://abc.com/player.js</xref>?xyz="foo" </label>
is ignoring
</xref> tag
value in the browser. So, the displayed output is
http://abc.com/player.js?xyz="foo"
but i want the browser to display
http://abc.com/player.js</xref>?xyz="foo"
Please help me how to achieve this.
It isn't being ignored. It is being treated as an end tag (for a non-HTML element that has no start tag). Use < if you want a < character to appear as data instead of as "start of tag".
That said, this is a URL and raw <, > and " characters shouldn't appear in URIs anyway. So encode it as http://abc.com/player.js%3C/xref%3E?xyz=%22foo%22
You should do it like this
"http://abc.com/player.js%3C/xref%3E?xyz=foo"
Url should be encoded properly to work as valid URL
Use encodeURI for encoding URLs for a valid one
var ValidURL = encodeURI("http://abc.com/player.js</xref>?xyz=foo");
See this answer on encodeURI for better knowledge.
I misunderstood the question, I thought the URI was to be used elsewhere within JavaScript. But the question pretty clearly states that the URI is to just be rendered as text.
If the text being displayed is being passed in from a server, then your best bet is to encode it before printing it on the page (or if you're using a template engine, then you can most likely just encode it on the template). Pretty much any web framework/templating engine should have this functionality.
However, if it is just static HTML, just manually encode the the characters. If you don't know the codes off the top of your head, you can just use some online converter to help, such as something like:
HTML Encode/Decode:
http://htmlentities.net/
Old Answer:
Try encoding the URI using the JavaScript function encodeURI before using it:
encodeURI('http://abc.com/player.js</xref>?xyz="foo"');
You can also decode it using decodeURI if need be:
decodeURI(yourEncodedURI);
So ultimately I don't think you'll be able to get the browser to display the </xref> tag as is, but you will be able to preserve it (using encodeURI/decodeURI) and use it in your code, if this is what you need.
Fiddle:
http://jsfiddle.net/rk8nR/3/
More info:
When are you supposed to use escape instead of encodeURI / encodeURIComponent?
I have a Perl program that is reading html tags from a text file. (im pretty sure this is working because when i run the perl program on the command line it prints out the HTML like it should be.)
I then pass that "html" to the web page as the return to an ajax request. I then use innerHTML to stick that string into a div.
Heres the problem:
all the text information is getting to where it needs to be. but the "<" ">" and "/" are getting stripped.
any one know the answer to this?
The question is a bit unclear to me without some code and data examples, but if it is what it vaguely sounds like, you may need to HTML-encode your text (e.g. using HTML::Entities).
I'm kind of surprized that's an issue with inserting into innerHTML, but without specific example, that's the first thing which comes to mind
There could be a mod on the server that is removing special characters. Are you running Apache? (I doubt this is what's happening).
If something is being stripped on the client-side, it is most likely in the response handler portion of the AJAX call. Show your code where you stick the string in the div.
I've got a legacy app just starting to misbehave, for whatever reason I'm not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF.
The process works like this:
Pull an HTML template from a DB with tokens in it to be replaced (e.g. "~CompanyName~", "~CustomerName~", etc.)
Replace the tokens with real data
Tidy the HTML with a simple regex function that property formats HTML tag attribute values (ensures quotation marks, etc, since ActivePDF's rendering engine hates anything but single quotes around attribute values)
Send off the HTML to a web service that creates the PDF.
Somewhere in that mess, the non-breaking spaces from the HTML template (the s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character when viewing the document in a browser (FireFox). ActivePDF pukes on these non-UTF8 characters.
My question: since I don't know where the problem stems from and don't have time to investigate it, is there an easy way to re-encode or find-and-replace the bad characters? I've tried sending it through this little function I threw together, but it turns it all into gobbledegook doesn't change anything.
Private Shared Function ConvertToUTF8(ByVal html As String) As String
Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
Dim source As Byte() = isoEncoding.GetBytes(html)
Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function
Any ideas?
EDIT:
I'm getting by with this for now, though it hardly seems like a good solution:
Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
Return Regex.Replace(html, "[^\u0000-\u007F]", " ")
End Function
Somewhere in that mess, the non-breaking spaces from the HTML template (the s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character
That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as "Â ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.
What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.
Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:
for HTML4: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
for HTML5: <meta charset="utf-8">
If you've done that, then any remaining problem is ActivePDF's fault.
If any one had the same problem as me and the charset was already correct, simply do this:
Copy all the code inside the .html file.
Open notepad (or any basic text editor) and paste the code.
Go "File -> Save As"
Enter you file name "example.html" (Select "Save as type: All Files (.)")
Select Encoding as UTF-8
Hit Save and you can now delete your old .html file and the encoding should be fixed
Problem:
Even I was facing the problem where we were sending '£' with some string in POST request to CRM System, but when we were doing the GET call from CRM , it was returning '£' with some string content. So what we have analysed is that '£' was getting converted to '£'.
Analysis:
The glitch which we have found after doing research is that in POST call we have set HttpWebRequest ContentType as "text/xml" while in GET Call it was "text/xml; charset:utf-8".
Solution:
So as the part of solution we have included the charset:utf-8 in POST request and it works.
In my case this (a with caret) occurred in code I generated from visual studio using my own tool for generating code. It was easy to solve:
Select single spaces ( ) in the document. You should be able to see lots of single spaces that are looking different from the other single spaces, they are not selected. Select these other single spaces - they are the ones responsible for the unwanted characters in the browser. Go to Find and Replace with single space ( ). Done.
PS: It's easier to see all similar characters when you place the cursor on one or if you select it in VS2017+; I hope other IDEs may have similar features
In my case I was getting latin cross sign instead of nbsp, even that a page was correctly encoded into the UTF-8. Nothing of above helped in resolving the issue and I tried all.
In the end changing font for IE (with browser specific css) helped, I was using Helvetica-Nue as a body font changing to the Arial resolved the issue .
I was having the same sort of problem. Apparently it's simply because PHP doesn't recognise utf-8.
I was tearing my hair out at first when a '£' sign kept showing up as '£', despite it appearing ok in DreamWeaver. Eventually I remembered I had been having problems with links relative to the index file, when the pages, if viewed directly would work with slideshows, but not when used with an include (but that's beside the point. Anyway I wondered if this might be a similar problem, so instead of putting into the page that I was having problems with, I simply put it into the index.php file - problem fixed throughout.
The reason for this is PHP doesn't recognise utf-8.
Here you can check it for all Special Characters in HTML
http://www.degraeve.com/reference/specialcharacters.php
Well I got this Issue too in my few websites and all i need to do is customize the content fetler for HTML entites. before that more i delete them more i got, so just change you html fiter or parsing function for the page and it worked. Its mainly due to HTML editors in most of CMSs. the way they store parse the data caused this issue (In My case). May this would Help in your case too