django escaping for html in json - html

I was going over my django site looking for xss problems. I figured I had it covered since django does auto escaping. So I put the usual alert('foo'); in sample data and I found a huge hole where I'm using ajax to pull data down as json and using jquery.append to add it, none of that is escaped for html, oops.
So my question is what is the best way to fix this:
Use my own copy of simplejson that auto escapes based on a param.
Just make sure I always use escape() when creating dicts that are going to be json dumped
Always use .text on the client side
Something I haven't thought of
It seems like this is a pretty easy problem to get yourself into.

Do something that is obvious/transparent/automatic, like Joel suggested here: http://www.joelonsoftware.com/articles/Wrong.html
Still, I don't see how "alert('foo');" can be harmful when injected into HTML. What would be harmful is if it was surrounded by "< script />" tag.
And for escaping HTML, you have to figure out if you want to do this on input or on output. Depending on what you want to achieve (e.g. allow a subset of HTML tags) and taking performance issues into account, you might want to escape the input and store escaped HTML into database.

Related

Store arbitrary characters in Semantic MediaWiki

I'm trying to store some text containing html tags into properties, which doesn't work. I created a form for a property with the data type 'text' and a template. Saving the form writes the text into the template, but it can't get displayed, as it contains illegal characters, as I guess.
What I'm trying to do:
I need a form to enter data, containing html tags and special
characters
I'd like to be able to use a query to find all those pages
and show that text using a template I provide to the ask query.
I also tried to use the free text option, but then I can't retrieve it using the ask query.
What would be the best, or at least a working solution to this?
Thanks a lot
storing text with html tags is a bit tricky in SemanticMediaWiki
The reason is the invention of the StripMarkers UNIQ/QINU by the MediaWiki developers.
When parsing the content of page with html tags in it the parsing is sort of "postponed". This technical detail unfortunately makes it hard for extension developers like the SMW developers to solve the issue of handling such content. Also it makes it hard for lay people to follow the discussion on how to solve the problem
Here are two examples of SMW Issues that are marked as "closed". This state of affairs means that by following the configuration hints in the issue your problem should be solved. If not please ask a question on the SMW issue list or even initiate the reopening of the issues.
https://github.com/SemanticMediaWiki/SemanticMediaWiki/pull/794
https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/3707
On my wiki we ran into this and resolved it by replacing special characters (we had issues with [ ] =, but the same problem happens with to < > tags too) with alternate unicode characters using the regex extension and a template before setting the property with {{#set:}}. If you want to display the formatted text on the wiki directly then call that parameter separately without replacing the unicode characters.
When you want to display the property, you can then run the reverse replacement with regex before displaying your now intact code (using the template result format to allow you to perform the operation on the output of the query).
To switch to special characters you can create this template
{{#regex:{{#regex:{{#regex:{{#regex:{{#regex:{{{1|}}}|/=/|꞊}}|/\[/|[}}|/\]/|]}}|/>/|≽}}|/</|≼}}
And to switch back you can use this as a template
{{#regex:{{#regex:{{#regex:{{#regex:{{#regex:{{{1|}}}|/꞊/|=}}|/[/|[}}|/]/|]}}|/≽/|>}}|/≼/|<}}

Ignore HTML tags in QC REST API output

Using HP ALM REST API, we get the Memo fields embedded with HTML tags such as <html>, <span>, <body>, etc. Is there a way to suppress the same using any options?
Using the earlier OTA API, we had the option to use tdconnection.IgnoreHtmlFormat=True, which used to suppress these tags, but using REST API, I am unable to find an equivalent one. Any suggestions or should I build a parser myself after reading the output?
I personally don't know of a switch like that.
Alternatively you might try this:
How to Parse Only Text from HTML
On paper this seems quite nice.
This requires an extra step though. After getting the request you'ld have to run it through the proposed library to get the get the flat text. Shouldn't be more than a line of code I think.
Downside is there might be some stuff going south because you dump any formatting stored as HTML. Usually that isn't much though. Depends of the project and the people off course.

How can I take an xml string and display it on my page similiar to how StackOverflow does it with 'insert code'?

I'm using the DataContractSerializer to convert and object returned from a WCF call to xml. The client would like to see that xml string in a webpage. If I output the string directly to a label, the browser strips out the angle brackets obviously. My question is how can I do something similar to StackOverflow? Are they doing a find & replace to replace angle brackets with their html entities? I see they are doing a code tag inside a pre tag and then making spans with the appropriate class. Is there an existing utility out there I can use to do this instead of writing some kind of parsing routine. I'm sure something free must be out there. If anyone can direct to the right place or some code that can easily accomplish this, I would greatly appreciate it. I apologize if this is more of a meta.stackoverflow question. Thanks for any tips.
The basic answer is that to get HTML displayed as typed, special characters and all, you need to replace the special characters (<, > etc.), with their escaped equivalents (>, < etc.). Beyond that if you want syntax colouring you'll have to parse the input to identify the keywords etc.
A full list of the special characters and their escape codes can be found here, but this is just one site of many.
you're talking about "pretty print".. if you want to diplay source code you could use this link 16 Free Javascript Code Syntax Highlighters For Better Programming
But if you want to display only xml.. there are some functions on the web that could help you with that, like this one: XML PHP Pretty Printer
and dont forget the special characters =)
good luck

Django templatetag for rendering a subset of html

I have some html (in this case created via TinyMCE) that I would like to add to a page. However, for security reason, I don't want to just print everything the user has entered.
Does anyone know of a templatetag (a filter, preferably) that will allow only a safe subset of html to be rendered?
I realize that markdown and others do this. However, they also add additional markup syntax which could be confusing for my users, since they are using a rich text editor that doesn't know about markdown.
There's removetags, but it's a blacklisting approach which fails to remove tags when they don't look exactly like the well-formed tags Django expects, and of course since it doesn't attempt to remove attributes it is totally vulnerable to the 1,000 other ways of script-injection that don't involve the <script> tag. It's a trap, offering the illusion of safety whilst actually providing no real security at all.
HTML-sanitisation approaches based on regex hacking are almost inevitably a total fail. Using a real HTML parser to get an object model for the submitted content, then filtering and re-serialising in a known-good format, is generally the most reliable approach.
If your rich text editor outputs XHTML it's easy, just use minidom or etree to parse the document then walk over it removing all but known-good elements and attributes and finally convert back to safe XML. If, on the other hand, it spits out HTML, or allows the user to input raw HTML, you may need to use something like BeautifulSoup on it. See this question for some discussion.
Filtering HTML is a large and complicated topic, which is why many people prefer the text-with-restrictive-markup languages.
Use HTML Purifier, html5lib, or another library that is built to do HTML sanitization.
You can use removetags to specify list of tags to be remove:
{{ data|removetags:"script" }}

Rails - Escaping HTML using the h() AND excluding specific tags

I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.