I'm working on an assistant (written in VB.NET for Winforms) to help screening and distribution of incoming e-mails inside my organization, which is and will continue to be made by human employees (I just need to speed up their work). Program will convert each message to HTML document and its attachments to PDF and will store it in internal database. This part is already working.
I'm already using HtmlAgilityPack to handle inline images (src="cid:..."), but I'm worried about what can I do to prevent malicious content inside the message can be activated when showing it (in a .NET's WebBrowser control).
I thought of two things I could do, also with HtmlAgilityPack:
removal of every <script> element;
changing of every <a href="..."> attribute to "#"
Can anyone who is more experienced with this issue suggest additional steps I should take on this "cleansing" of each message's HTML?
Thank you very much!
As an extra layer of security you can:
check embedded URLs with an URL scanner. I suggest UrlVoid, they have an API too (pointless if you have already a proper virus scanner).
As suggested, you can remove all script and additional all style blocks:
Dim doc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
doc.LoadHtml(html)
doc.DocumentNode.Descendants.Where(() => { }, ((n.Name = "script") _
OrElse (n.Name = "style"))).ToList.ForEach(() => { }, n.Remove)
parse out any unwanted attributes to elements you don't want to allow, like onclick() and other javascript events.
remove other unwanted tags: HTML Agility Pack strip tags NOT IN whitelist
Note: There are a lot of powerful PHP HTML sanitizer/purifier. You can play around with them to do some quick tests (or even use one to pre-process your content). Most often HTML Purifier is recommended.
Related
Lets say that i have a website with an input text box where i can write whatever. The stuff that i write here will be displayed on another website. Please see the following example.
If i write the following into text box:
<div style="height:50px;width:50px;background-color:red"></div>
It looks like this on the other website:
How can i make it display a red box (code) instead of a string?
What you're describing is very dangerous. It could reveal XSS vulnerabilities to your website.
Setting HTML content dangerously works by setting the innerHTML on any DOM element, or innerText on script tags. If you would like to protect against XSS, one way to do it is to set content using textContent.
const handleUserInput = (input) => {
const { body } = document;
body.innerHTML = input; // very dangerous;
body.innerText = input; // dangerous, if used on <script>
body.textContent = input; // wraps in text node, safe.
}
What's happening in your provided example, is that the user input is being sanitized. While this is one way to prevent XSS, it's best to follow other best-practices to prevent such vulnerabilities as websites vulnerable to XSS attacks can allow attackers to access user sessions, identifying information, and more.
There are ways to bypass sanitizing input, depending on the patterns, and engines you use. You can view the XSS Filter Evasion Cheat Sheet. Note that using this without the clients permission may be considered hacking under the law, and should only be done for security research, and approved testing purposes.
If you are implementing this in order to be able to display HTML content on another web page, you should re-think this, and instead make pre-built components that only store input in memory, and only set text node values on the client side. It can open a world of vulnerabilities to your site.
If you are dead-set on this solution for your website, you must remove the user input HTML sanitization on your end, whether that is happening on the frontend, or backend.
If I use a data URI to construct a src attribute for an HTML element, can it in turn have another data URI inside it?
I know you can't use data uri's for iframes (I'm actually trying to construct an OSDX document and pass it to the browser with an icon encoded in base64 but that's a really niche use case and this is more of a general question), but assuming you could, my use case would look like:
var iframe = document.createElement('iframe');
var icon = document.createElement('image');
var iSrc = '*[REALLY LONG STRING]*/';
iframe.src='data:text/html,<html><body><image src="'+iSrc+'" /></body</html>
document.body.appendChild(iframe);
Basically what I'm after is is there anything in a data uri that would break a parent data uri?
Yes you can. I really thought it was impossible, as did everyone I asked.
Example:
Pasting the following into your browser's URL bar should render a gmail logo in an html page that says hello world.
data:text/html,<html><body><p>hello world</p><img src="" /></body></html>
or for a shorter example courtesy of Pumbaa80:
data:text/html,<script src="data:text/javascript,alert('hello world')"></script>
MSDN explicitly supports this:
Data URIs can be nested.
An old blog entry talks a little bit more about embedding images within CSS using data: :
Neither dataURI spec nor any other mentions if dataURI’es can not be nested. So here’s the testcase where dataURI’ed CSS has dataURI’ed image embedded. IE8b1, Firefox3 and Safari applied the stylesheet and showed the image, Opera9.50 (build 9613) applies the stylesheet but doesn’t show the embedded image! So it seems that Opera9 doesn’t expect to get anything embedded inside of an already embedded resource! :D
But funny thing, as IE8b1 supports expressions and also supports nested data URI’es, it has the same potential security flaw as Firefox does (as described in the section above). See the testcase — embedded CSS has the following code: body { background: expression(a()); } which calls function a() defined in the javascript of the main page, and this function is called every time the expression is reevaluated. Though IE8b1 has limited expressions support (which is going to be explained in a separate post) you can’t use any code as the expression value, but you can only call already defined functions or use direct string values. So in order to exploit this feature we need to have a ready javascript function already located on the page and then we can just call it from the expression embedded in the stylesheet. That’s not very trivial obviously, but if you have a website that allows people to specify their own stylesheets and you want to be on the safe side, you have to either make sure you don’t have a javascript function that can cause any potential harm or filter expressions from people’s stylesheets.
I have graphs in an html page. The graphs are generated by a call to a cgi-bin program in an IMG tag:
<IMG src="http://myserver.com/cgi-bin/StatBarChart.cgi?data=1,2,&data=3,5,1&legend=EC,ER">
Currently, the data for the graphs is passed as GET args (in the URL itself.)
Everything’s working OK, but te GET arguments are too long. I want to pass the data via POSTDATA. All the books I have (and discussions on the web that I’ve found) talk about using POSTDATA in forms that include a Submit button. I just want the graphs to appear as part of the page, without a Submit. Can this be done? Can it be done in HTML4, or does it require javascript?
I would require javascript, as you would have to get the resource yourself and set it to the img tag. This is not possible in html4.
Also, I don't see the problem with a long url. Your user will never see it (unless he looks in the sourcecode, which I don't consider as simple "user" anymore) so there is no problem with that either.
Ive used WindowsHost to host a WebBrowser control, and that has allowed me to access the WebBrowsers Document/DOM directly, t read HTML content via mouse clicks on HTML document elements and also to invokes on submit forms. I never found a way even in Net 3.5 to do this when I was searching at the time. Ive found this post http://rhizohm.net/irhetoric/blog/72/default.aspx and it looks like through som magic casing you can expose the dom. BUT My question is, has any one done this, and is it possible once you get the dom to do Invokes to submit contect to html forms and also get HTML elements via mouse click events????
Anyone tried? and was able to do both?
Thanks
I'm using WPF.
add a reference to:
Microsoft.mshtml
then:
var doc = ( mshtml.HTMLDocument )_wbOne.Document;
and this gives you the raw string:
doc.documentElement.innerHTML
in return, if you know how to get information out of the HTML document, i'd appreciate it.
for example get all the s and and the metas and whatever else might be gettable so i can get the information from them? i don't want to dink around with the html, just get the info from them...:-)
I'm attempting to send HTML formatted emails using C# 3 via Outlook.MailItem
Outlook.MailItem objMail = (Outlook.MailItem)olkApp.CreateItem(Outlook.OlItemType.olMailItem);
objMail.To = to;
objMail.Subject = subject;
objMail.HTMLBody = htmlBody;
The email is generated externally by saving from an RTF control (TX Text Control), which yields HTML with links to images stored in a <<FileName>>_files subdirectory. Example:
<img border="0" src="file:///C:/Documents%20and%20Settings/ItsMe/Local%20Settings/Temp/2/zbt4dmvs_Images/zbt4dmvs_1.png" width="94" height="94" alt="[image]">
Sending the email this way generates a mail with broken links.
Using Outlook 2007 as the email client with Word as the email editor, switching to RTF (Options tab, Format tab group) preserves the layout and inlines the images.
Programmatically doing this via:
var oldFormat = objMail.BodyFormat;
objMail.BodyFormat = Outlook.OlBodyFormat.olFormatRichText;
objMail.BodyFormat = oldFormat;
loses the formatting and mangles the images (the image becomes a [image] link marker on screen which is clickable but no longer shows the image). This isn't a surprise given that the documentation for MailItem.BodyFormat Property says "All text formatting will be lost when the BodyFormat property is switched from RTF to HTML and vice-versa".
Sadly there doesnt seem to be an easy way to change the Type of each Attachment in the MailItem.Attachements to OlAttachmentType.olByValue, as it's a read-only property that's set when you create the Attachment.
An approach that comes to mind is to walk the HTML, replacing the <img> tags with markers and programatically walking the MailItem text, inserting an Outlook.Attachment of Type OlAttachmentType.olByValue.
Another option is to convert the <img> links to use src="cid:uniqueIdN" and add the images as attachments with the referenced identities.
So, to the question... Is there a way to get the linked images converted to embedded images, ideally without getting into third party tools like Redemption? Converting to RTF happens to yield the outcome, but doing it that way is by no means a pre-requisite, and obviously may lose fidelity - I Just Want It to Just Work :D Neither of my existing ideas sound Clean to me.
Since you are using .net > 2.0, you may want to look into the System.Net.Mail namespace for the creation of mail messages. I have found that its quite customizable and was very easy to use for a task similar to yours. The only problems that I had was making sure I was using the right encoding, and I had to use HTML tables for layouts (css would not work right). Here are some links to show you how this works...
Basic
With multiple views (Plain Text and HTML)
If that's not an option, then I would recommend going the Content ID route and embedding the images as attachments. Your other option is to host the images publicly on a website, and change the image links in the html to the public images.
Something that you should be cognizant about is that HTML emails can easily look like spam and can be treated as such by email servers and clients. Even ones that are just for in-house usage (its happened to me) can end up in Outlook's Junk Mail folder..
DOH!, actually sending the email in Outlook 2007 forces the images to become embedded.
The Sent Item size of 8K is a lot smaller than the draft size of 60K (RTF) I was seeing vs the draft size of 1K (HTML that hadn't been converted to RTF and back again).
So it was Doing What I Mean all the time. Grr.
I'll leave the Q and the A up here in case it helps someone of a similarly confused state of mind.
BTW some useful links I found on my journey:
Sending emails example
General Q&A site with other examples of varying quality