Django templatetag for rendering a subset of html - html

I have some html (in this case created via TinyMCE) that I would like to add to a page. However, for security reason, I don't want to just print everything the user has entered.
Does anyone know of a templatetag (a filter, preferably) that will allow only a safe subset of html to be rendered?
I realize that markdown and others do this. However, they also add additional markup syntax which could be confusing for my users, since they are using a rich text editor that doesn't know about markdown.

There's removetags, but it's a blacklisting approach which fails to remove tags when they don't look exactly like the well-formed tags Django expects, and of course since it doesn't attempt to remove attributes it is totally vulnerable to the 1,000 other ways of script-injection that don't involve the <script> tag. It's a trap, offering the illusion of safety whilst actually providing no real security at all.
HTML-sanitisation approaches based on regex hacking are almost inevitably a total fail. Using a real HTML parser to get an object model for the submitted content, then filtering and re-serialising in a known-good format, is generally the most reliable approach.
If your rich text editor outputs XHTML it's easy, just use minidom or etree to parse the document then walk over it removing all but known-good elements and attributes and finally convert back to safe XML. If, on the other hand, it spits out HTML, or allows the user to input raw HTML, you may need to use something like BeautifulSoup on it. See this question for some discussion.
Filtering HTML is a large and complicated topic, which is why many people prefer the text-with-restrictive-markup languages.

Use HTML Purifier, html5lib, or another library that is built to do HTML sanitization.

You can use removetags to specify list of tags to be remove:
{{ data|removetags:"script" }}

Related

XML as complement to HTML

I'm having trouble wrapping my head around using XML as complement to HTML. I know what they are used for but I don't quite understand how to use them together.
I know that you can use JavaScript to convert an XML file to HTML, but I don't get how that's going to do the trick. How would I be able to style this HTML-file?
I have a template form, which I want to be accessible on a server and for which I want to enable edits. Once edited I want to save the edits on a separate file, so that the template is still available.(Just so you guys have a little bit of background regarding what I need this for).
After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data.
Could anyone explain in more detail how exactly XML can be used as a complement to HTML?
If you need more details or information please let me know. I did do a lot of research and I read the other posts regarding how to convert XML to HTML with JavaScript, but that doesn't answer my question about how EXACTLY they complement each other.
I guess my problem here is that I have yet to manage to wrap my head around the concept.
XML is related to HTML, as it uses the same magic characters for its markup and the same logic where to put the data.
The characters <> are used to separate the markups from the content.
The character & together with an entity code like < is used to encode characters, which would lead to troubles otherwise
elements can contain attributes like <someElement someAttribute="attr value">
elements can contain text or sub elements
The big difference is, that XML is absolutely free how you name your elements and attributes, while HTML relys on dedicated names (like <body>), whereas XML is absolutely strict in structure while HTML allows a lot (like unclosed tags).
As a thing in the middle there is XHTML, which is as strict as XML but sticks to the rules of HTML.
It is almost impossible to read HTML as XML, but you can easily create XML which is taken by any browser as a valid web page.
Your issue cries for XSLT. This is a method to transform a given XML into a new format. This allows for example, to export your data as XML and create a nice web page from it. Different XSLT will present the same data in different ways.
There are several online tools to test this feature. you might have a look here.
Your statement After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data is not all clear... How you send data (to a web application), and the way you send the (manipulated) data back, is not bound to XML. This is very often done with JSON, using Java Script to read, edit and send it back.
XML -> XSLT - HTML is often seen to create (rather static) reports for a web viewer

What are some good ways to parse HTML and CSS in Perl?

I have a project where my input files used to be XML. I'm now being asked to start processing HTML with embedded CSS instead, and I'd like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we're moving to HTML with CSS, I'm thinking I'll need to move to something else. That said, before I dig myself knee deep into silly decisions I'll likely regret, I wanted to ask here: what do you guys use for this kind of task?
The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML's text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.
An example of the old XML is:
<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
o_size="11.04" o_cs="4.6">
Some text
</text>
An example of the new HTML is:
<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
<span class="ft19" >
Some text
</span></nobr>
</div>
where "ft19" refers to a css style element from the top of the page of the format:
.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
x-pdf-letter-spacing:0.83px;}
Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:
my #texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');
as I'm able to do with the XML. Does anything like that exist for parsing HTML? I'd really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I'm trying to do.
Ideas?
The basic one I am aware of is HTML::Parser.
There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author's blog which is very interesting but much newer and experimental.
I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there's that too.
If you need something even more generic (and evil) you can look into "writing" your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don't do what you need.
Perhaps I haven't helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.
Edit: one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.
It's not clear - is the Perl parsing for the purposes of doing the conversion to HTML (with embedded CSS)? If so, why not forget Perl and use XSLT which is designed to transform XML documents?

Parsing Random Web Pages

I need to parse a bunch of random pages and add them to a DB. I am thinking of using regular expressions but I was wondering if there are any 'special' techniques (other than looking for content between known text/tags). The content is more(not always) like:
Some Title
Text related to Title
I guess I don't need to extract complete Text but some way to know where the Title/Paragraph and extract the content from there. The content itself may have images/links that I would like to retain.
Thanks!
Please see this answer: RegEx match open tags except XHTML self-contained tags
Use Python. http://www.python.org/
Use Beautiful Soup. http://www.crummy.com/software/BeautifulSoup/
You need to use a proper HTML parser, and extract the elements you’re interested in via the parser’s API (or via the DOM).
Since I don’t know what language you’re programming in, it’s rather difficult to recommend a parser, but some well known ones are Jericho for Java, and Beautiful Soup for Python.

Rails - Escaping HTML using the h() AND excluding specific tags

I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.

How do you parse a poorly formatted HTML file?

I have to parse a series of web pages in order to import data into an application. Each type of web page provides the same kind of data. The problem is that the HTML of each page is different, so the location of the data varies. Another problem is that the HTML code is poorly formatted, making it impossible to use a XML-like parser.
So far, the best strategy I can think of, is to define a template for each kind of page, like:
Template A:
<html>
...
<tr><td>Table column that is missing a td
<td> Another table column</td></tr>
<tr><td>$data_item_1$</td>
...
</html>
Template B:
<html>
...
<ul><li>Yet another poorly formatted page <li>$data_item_1$</td></tr>
...
</html>
This way I would only need one single parser for all the pages, that would compare each page with its template and retrieving the $data_item_1$, $data_item_2$, etc. Still, it is going to be a lot of work. Can you think of any simpler solution? Any library that can help?
Thanks
You can pass the page's source through tidy to get a valid page. You can find tidy here
. Tidy has bindings for a lot of programming languages. After you've done this, you can use your favorite parser/content extraction technique.
I'd recommend Html Agility Pack. It has the ability to work with poorly structured HTML while giving you Xml like selection using Xpath. You would still have to template items or select using different selections and analyze but it will get you past the poor structure hump.
As mentioned here and on other SO answers before, Beautiful Soup can parse weird HTML.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
Use HTML5 parser like html5lib.
Unlike HTML Tidy, this will give you error handling very close to what browsers do.
There's a couple C# specific threads on this, like Looking for C# HTML parser.
Depending on what data you need to extract regular expressions might be an option. I know a lot of people will shudder at the thought of using RegExes on structured data but the plain fact is (as you have discovered) that a lot of HTML isn't actually well structured and can be very hard to parse.
I had a similar problem to you, but in my case I only wanted one specific piece of data from the page which was easy to identify without parsing the HTML so a RegEx worked very nicely.