I'm creating a PDF with a large collection of quotes that I've imported into python with docx2python, using html=True so that they have some tags. I've done some processing to them so they only really have the bold, italics, underline, or break tags. I've sorted them and am trying to write them onto a PDF using the fpdf library, specifically the pdf.write_html(quote) method. The trouble comes with several special characters I have, so I am hoping to encode the PDF to UTF-8. To write with .write_html(), I had to create a new class as shown in their readthedocs under the .write_html() method at the very bottom of the left hand side:
from fpdf import FPDF, HTMLMixin
class htmlFPDF(FPDF, HTMLMixin):
pass
pdf = htmlFPDF()
pdf.add_page()
#set the overall PDF to utf-8 to preserve special characters
pdf.set_doc_option('core_fonts_encoding', 'utf-8')
pdf.write_html(quote) #[![a section of quote giving trouble with quotations][2]][2]
The list of quotes that I have going into the pdf all appear with their special characters and the html tags (<u> or <i>) in the debugger, but after the .write_html() step they then show up in the pdf file with mojibake, even before being saved, as seen through debugger. An example being "dayâ€ÂTMs demands", when it should be "day's demands" (the apostrophe is curled clockwise in the quote, but this textbox doesn't support).
I've tried updating the font I use by
pdf.add_font('NotoSans', '', 'NotoSans-Regular.ttf', uni=True)
pdf.set_font('NotoSans', '', size=12)
added after the .add_page() method, but this doesn't change the current font (or fix mojibake) on the PDF unless I use the more common .write(text_height, quote) method, which renders the underline/italicize tags into the PDF as text. The .write() method does preserve the special characters. I'm not trying to change the font really, but make sure that what's written onto the PDF preserves the special characters instead of mojibake them.
I've also attempted some .encode/.decode action before going into the .write_html(), as well as attempted some methods from the ftfy library. And tried adding '' to the start of each quote to no effect.
If anyone has ideas for a way to iterate through each line on the PDF that'd be terrific, since then I could use ftfy to fix the mojibake. But ideally, it would be some other html tag at the start of each quote or a way to change the font/encoding of the .write_html() method, maybe in the class declaration?
Or if I'm at a dead-end and should just split each quote on '<', use if statements to detect underlines, italicize, etc., and use the .write() method after all.
Extract docx to html works really bad with docx2python. I do this few month ago. I recommend PyDocX. docx2python are good for docx file content extracting, not converting it into a html.
I am using magicsuggest as a auto-complete plugin of a web application with web2py. I define a list variable dt=['张','李'] in the model/db.py. The element in the list is Chinese. However when I embeded the variable in the html like{{=XML(dt)}} according to the manual book of magicsuggest. The chinese character was garbled. After several days searching, I find the list variable with chinese character was encode into hex in the html. I know there is something wrong about encode/decode. Could someone help me to display the correct chinese character in the html?
XML() is meant to take a string, not a list of strings. If you pass it something other than a string, it will first be converted to a string, so your code is equivalent to {{=XML(str(dt))}}, and you'll notice that in Python, str(['张','李']) yields "['\\xe5\\xbc\\xa0', '\\xe6\\x9d\\x8e']".
Instead, you can do {{=XML(dt[0])}}, and you will see the first character in the list displayed properly.
If you want to display a comma separated list surrounded by brackets, you can do:
{{=json.dumps(dt, encoding="UTF-8", ensure_ascii=False)}}
I am trying to create an Html Editor. For this I am using JEditorPane, in which I want to read input from the JEditorPane character by character and want them to be stored in a string. For example: if user types <h so I want to read those two characters and according to those characters I will suggest users for the tags, in this case <html>,<header>,<head> etc (i.e. all tags starting with 'h'). So I am not getting how and which function to use to read character from JEditorPane as soon as user inputs into the JEditorPane.
So I am not getting how and which function to use to read character from JEditorPane as soon as user inputs into the JEditorPane.
You can use a DocumentListener Read the section from the Swing tutorial on How to Write a DocumentListener for more information and examples.
If you are creating an editor, which just displays the text, not the actual formatting, then you should use a JTextArea or a JTextPane. A JEditorPane is really only for displaying existing HTML files.
Keylistener worked for me. Using keylistener we can get input key strokes by the user.
I've been running into a problem that was revealed through our Google adwords-driven marketing campaign. One of the standard parameters used is "region". When a user searches and clicks on a sponsored link, Google generates a long URL to track the click and sends a bunch of stuff along in the referrer. We capture this for our records, and we've noticed that the "Region" parameter is coming through incorrectly. What should be
http://ravercats.com/meow?foo=bar®ion=catnip
is instead coming through as:
http://ravercats.com/meow?foo=bar®ion=catnip
I've verified that this occurs in all browsers. It's my understanding that HTML entity syntax is defined as follows:
&VALUE;
where the leading boundary is the ampersand and the closing boundary is the semicolon. Seems straightforward enough. The problem is that this isn't being respected for the ® entity, and it's wreaking all kinds of havoc throughout our system.
Does anyone know why this is occurring? Is it a bug in the DTD? (I'm looking for the current HTML DTD to see if I can make sense of it) I'm trying to figure out what would be common across browsers to make this happen, thus my looking for the DTD.
Here is a proof you can use. Take this code, make an HTML file out of it and render it in a browser:
<html>
http://foo.com/bar?foo=bar®ion=US®ister=lowpass®_test=fail&trademark=correct
</html>
EDIT: To everyone who's suggesting that I need to escape the entire URL, the example URLs above are exactly that, examples. The real URL is coming directly from Google and I have no control over how it is constructed. These suggestions, while valid, don't answer the question: "Why is this happening".
Although valid character references always have a semicolon at the end, some invalid named character references without a semicolon are, for backward compatibility reasons, recognized by modern browsers' HTML parsers.
Either you know what that entire list is, or you follow the HTML5 rules for when & is valid without being escaped (e.g. when followed by a space) or otherwise always escape & as & whenever in doubt.
For reference, the full list of named character references that are recognized without a semicolon is:
AElig, AMP, Aacute, Acirc, Agrave, Aring, Atilde, Auml, COPY, Ccedil,
ETH, Eacute, Ecirc, Egrave, Euml, GT, Iacute, Icirc, Igrave, Iuml, LT,
Ntilde, Oacute, Ocirc, Ograve, Oslash, Otilde, Ouml, QUOT, REG, THORN,
Uacute, Ucirc, Ugrave, Uuml, Yacute, aacute, acirc, acute, aelig,
agrave, amp, aring, atilde, auml, brvbar, ccedil, cedil, cent, copy,
curren, deg, divide, eacute, ecirc, egrave, eth, euml, frac12, frac14,
frac34, gt, iacute, icirc, iexcl, igrave, iquest, iuml, laquo, lt,
macr, micro, middot, nbsp, not, ntilde, oacute, ocirc, ograve, ordf,
ordm, oslash, otilde, ouml, para, plusmn, pound, quot, raquo, reg,
sect, shy, sup1, sup2, sup3, szlig, thorn, times, uacute, ucirc,
ugrave, uml, uuml, yacute, yen, yuml
However, it should be noted that only when in an attribute value, named character references in the above list are not processed as such by conforming HTML5 parsers if the next character is a = or a alphanumeric ASCII character.
For the full list of named character references with or without ending semicolons, see here.
This is a very messy business and depends on context (text content vs. attribute value).
Formally, by HTML specs up to and including HTML 4.01, an entity reference may appear without trailing semicolon, if the next character is not a name character. So e.g. ®ion= would be syntactically correct but undefined, as entity region has not been defined. XHTML makes the trailing semicolon required.
Browsers have traditionally played by other rules, though. Due to the common syntax of query URLs, they parse e.g. href="http://ravercats.com/meow?foo=bar®ion=catnip" so that ®ion is not treated as an entity reference but as just text data. And authors mostly used such constructs, even though they are formally incorrect.
Contrary to what the question seems to be saying, href="http://ravercats.com/meow?foo=bar®ion=catnip" actually works well. Problems arise when the string is not in an attribute value but inside text content, which is rather uncommon: we don’t normally write URLs in text. In text, ®ion= gets processed so that ® is recognized as an entity reference (for “®”) and the rest is just character data. Such odd behavior is being made official in HTML5 CR, where clause 8.2.4.69 Tokenizing character references describes the “double standard”:
If the character reference is being consumed as part of an attribute,
and the last character matched is not a ";" (U+003B) character, and
the next character is either a "=" (U+003D) character or in the range
ASCII digits, uppercase ASCII letters, or lowercase ASCII letters,
then, for historical reasons, all the characters that were matched
after the U+0026 AMPERSAND character (&) must be unconsumed, and
nothing is returned.
Thus, in an attribute value, even ®= would not be treated as containing a character reference, and still less ®ion=. (But reg_test= is a different case, due to the underscore character.)
In text content, other rules apply. The construct ®ion= causes then a parse error (by HTML5 CR rules), but with well-defined error handling: ® is recognized as a character reference.
Maybe try replacing your & as &? Ampersands are characters that must be escaped in HTML as well, because they are reserved to be used as parts of entities.
1: The following markup is invalid in the first place (use the W3C Markup Validation Service to verify):
In the above example, the & character should be encoded as &, like so:
2: Browsers are tolerant; they try to make sense out of broken HTML. In your case, all possibly valid HTML entities are converted to HTML entities.
Here is a simple solution and it may not work in all instances.
So from this:
http://ravercats.com/meow?status=Online®ion=Atlantis
To This:
http://ravercats.com/meow?region=Atlantis&status=Online
Because the ® as we know triggers the special character ®
Caveat: If you have no control over the order of your URL query string parameters then you'll have to change your variable name to something else.
Escape your output!
Simply enough, you need to encode the url format into html format for accurate representation (ideally you would do so with a template engine variable escaping function, but barring that, with htmlspecialchars($url) or htmlentities($url) in php).
See your test case and then the correctly encoded html at this jsfiddle:
http://jsfiddle.net/tchalvakspam/Fp3W6/
Inactive code here:
<div>
Unescaped:
<br>
http://foo.com/bar?foo=bar®ion=US®ister=lowpass®_test=fail&trademark=correct
</div>
<div>
Correctly escaped:
<br>
http://foo.com/bar?foo=bar®ion=US®ister=lowpass®_test=fail&trademark=correct
</div>
It seems to me that what you have received from google is not an actual URL but a variable which refers to a url (query-string). So, thats why it's being parsed as registration mark when rendered.
I would say, you owe to url-encode it and decode it whenever processing it. Like any other variable containing special entities.
To prevent this from happening you should encode urls, which replaces characters like the ampersand with a % and a hexadecimal number behind it in the url.
For HTML input, I want to neutralize all HTML elements that have inline js (onclick="..", onmouseout=".." etc).
I am thinking, isn't it enough to encode the following chars? =,(,)
So onclick="location.href='ggg.com'"
will become
onclick%3D"location.href%3D'ggg.com'"
What am I missing here?
Edit: I do need to accept active HTML (I can't escape it all or entities is it).
There's no simple method to accept HTML, but not scripts.
You have to parse HTML to DOM, remove all unwanted elements and attributes in DOM and generate new HTML.
It can't be done reliably with regular expressions.
on* attributes are not enough. Scripts can be embedded in style, src, href and other attributes.
If you're using PHP, then use HTML Purifier.
You probably have a couple of options... easiest way is to convert quotes, and possibly <> characters, to their HTML encoded equivalents (" etc.), which will result in the HTML code being displayed literally.
Tell me what server-side language are you using and I can point you towards more language-specific information, if you like. (For example, PHP has htmlspecialchars()[1]).
EDIT: I just actually read your question. Okay, you want to allow HTML through but no JavaScript? Well, for lack of a simple solution jumping to my mind, I suggest just using string replacement (regular expressions if you can, maybe?) to get rid of them entirely.
There are a finite set of event handler attributes in JavaScript. Couple that with the need for quotation marks and you're probably good.
For proof of concept, in Perl, you'd probably do something like this:
$myInput =~ s/on(mouseover|mouseout|click|focus|blur|[...])(\"[^\"]*\")|(\'[^\']*\')\s*//gi;
So, capture the event handler name (only some of which I included), then a quoted expression using either single or double quotes, have optional whitespace on the end, and replace the entire thing with nothing (i.e., delete it).
That won't work for something requiring more levels of quotation, though, since eventually you would come back to the original delimiters. Forgive the contrived and completely useless example:
onclick="eval('3+prompt("Enter a number: ")')"
In THAT case, you might want to write a loop that parses the string first by word (i.e., looking for the event handler name), then going character by character, keeping track of the number of quoting levels as you go and keeping track of the current delimiter:
Mark the index of the beginning of the handler name (the "o" in onclick, etc.)
Start with quoting level 0 (or 1 after you've processed the opening quotation delimiter).
If the current delimiter is " and you see ', then increase the quoting level by 1 and switch current delimiter to '.
If the current delimiter is " and you see ", decrease the quoting level by 1 and switch current delimiter to '.
If the current delimiter is ' and you see ", then increase the quoting level by 1 and switch current delimiter to '.
If the current delimiter is ' and you see ', decrease the quoting level by 1 and switch current delimiter to '.
If the quoting level gets back down to 0, then your string has ended. Mark the index of where the string ends.
Use a string manipulation function to cut out the substring from the first index to the last index.
It's a little more time-consuming, but it should theoretically work no matter what, assuming the HTML is well-formed. (That's a horrible assumption, but if it's not well-formed you could just reject the input anyway!)
[1] http://us3.php.net/manual/en/function.htmlspecialchars.php