Html entity encoding in webapplication - html

Iam looking for all your valuable suggestions for avoiding a vunerbility through form data in a web application.
Which characters needs to be encoded to avoid such injection attacks as part of html entity encoding?.Injection of which chars into our form data will prone to HTML Injections?
As of now we are vaidating \",/,\,:,*,?,<,>,|,;,%,#,~ characters from user input into the form input fields of our web application.Also we have implemented encoding methodology to encode {"<",">","\'","&"} to {"<",""",">","&"} where ever we allowed these characters from user input into our application form fields.Does it requires to enhance out encoding methodology to any other chars to get rid of any vulnerbility situation?
Please update me with your valuable suggestions ASAP.
Thanks & regards,
Sureshbabu

Basically it's enough to escape <>&" to their corresponding html entities, but there's some complicated attacks that includes some characters patterns that makes the browser switch to another encoding, in which the attacker have encoded the attack string.
Since it's complicated, there's some libraries that are constantly updated to do this job the best way possible, one of them is HTML Purifier (for php).

Try Apache Commons - Lang:
http://commons.apache.org/lang/api-release/index.html
The Class StringEscapeUtils provide a method for your Problem.
http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html

Related

Store TinyMCE form data in a MySQL database and return as is

I am currently in a CMS project and we wish to use TinyMCE as our WYSIWYG editor. In here, we have allowed users to customize their content (such as some sections are bold, different indentations etc. ) All I want to know is, how do we store those form data in a MySQL database, and how do we return data from those styles? Are there problems in database, as this contains html tags? I think you may get what im asking. This is our first time using any of WYSIWYG editor and Thank you for any help..
To MySQL, the HTML generated by TinyMCE is just a String, you can store it as it is (after your security filter has validated it)
But, your application must handle the Strong very carefully because careless handling will result in Cross Site Scripting (XSS).
It's more difficult than preventing XSS on non-html fields because output sanitizing won't work.
The most robust way to prevent XSS on HTML field is that you will need filter the HTML text in the server , whitelist or blacklist.
Blacklist is easier to implement, but may miss some patterns we didn't know. Whitelist is more robust but troublesome. you will define the tags and attributes you allowed, HTML contains invalid tags or attributes can either be filtered or blocked. You can achieve whitelist by Jsoup ( for better performance, I will suggest you to customize Jsoup)
You can not trust tinymce to help you, because a hacker can easily bypass the client side validation.

XSS attacks, multiple html sanitization

I'm working on a web service project, where I display some data obtained from database, which in turn is made of users' inputs. Of course I want to prevent my application from being vulnerable to XSS attacks, so obviously I sanitize the input from html special characters. But I have a following problem - data returned from the server is in form < (in this example case for '<' sign), and on the front end the second sanitization process occurs, making it <, which is totally incomprehensible by the web browser. Is there a simple way to get over it, or maybe I should sanitize inputs only in one place (I presume that the server would be the best option).
Thanks for all answers.
You can't reliably sanitize user input. It's a losing battle. As soon as you think you've filtered out all the "bad" characters, someone will pass in an escape sequence or something else unexpected
If you're using a database server, make sure all input is handled by pre-compiled stored procedures, and make sure that the user that the web app logs in as, only has EXECUTE perms. This prevents SQL injection and other mischief.
If you're worried about actual characters, make sure you have a "pass through OK characters" filter and not a "remove bad characters" filter. The number of "good characters" is finite, while the number of attack vectors is infinite.
As for your question about "<" characters, if the intended output is for user display, you can run the entire string through HttpServerUtility.HtmlEncode or it's equivalent in whatever language you use. This will convert the string into code that will display properly in the browser but not be interpreted.
It doesn't look like you're having a problem escaping it, it looks like you're having a problem deciding if you need to escape it. Pick a standard and stick with it, then convert as necessary. If it normally comes in unescaped, just store it that way, and escape it when you want to display it.
The best way to sanitize untrusted data that is served back to a user in the context of XSSs for spring boot is to use a template engine that will suit your needs (e.g. JSP).
Template engine will automatically generate HTML you need, escape it properly and insert the content in a required placeholder (if an issue with broken encoding occurs for async requests).
Be careful and check if a chosen engine does it by default or it needs a special directive to do so.

storing code snippets in a database

I want to make a code snippet database web application. Would the best way to store it in the database be to html encode everything to prevent XSS when displaying the snippets on the web page?
Thanks for the help!
The database has nothing to do with this; you simply need to escape the snippets when they are rendered as HTML.
At minimum, you need to encode all & as & and all < characters as <.
However, your server-side language already has a built-in HTML encoding function; you should use it instead of re-inventing the wheel. For more details, please tell us what language your server-side code is in.
Based on your previous questions, I assume you're using PHP.
If so, you're looking for the htmlspecialchars or htmlentities functions.
You would either have to escape it when you store it, or escape it when you display it. It'd probably be better to do it on display so that if you need to edit it later on, you don't have to decode it then re-encode it.
Also, you'll want to make sure you escape it properly when you store it in the database, otherwise you'd be leaving yourself open to SQL injection. Parameterized statements would be the best method, you shouldn't have to change the raw data at all.
The best thing to do is to not store it in the database. I have seen people store stored procedures in databases as a row. Just because you can doesn't mean you should.
It doesn't matter how you store it, what matters is how you render it in the HTML representation. I'd guess you'll need to do some sort of sanitation before rendering the bytes. Another option might be to convert every character to an HTML entity; this might suffice to prevent any code or tags from actually being interpreted.
As an example, view the source of a Stack Overflow page with some example code, and see how they're representing the code in the HTML.

Should HTML be encoded before being persisted?

Should HTML be encoded before being stored in say, a database? Or is it normal practice to encode on its way out to the browser?
Should all my text based field lengths be quadrupled in the database to allow for extra storage?
Looking for best practice rather than a solid yes or no :-)
Is the data in your database really HTML or is it application data like a name or a comment that you just happen to know will end up as part of an HTML page?
If it's application data, I think its best to:
represent it in a form that native to the environment (e.g. unencoded in the database), and
make sure its properly translated as it crosses representational boundaries (encode when you generate the HTML page).
If you're a fan of MVC, this also helps separates the view/controller from the model (and from the persistent storage format).
Representation
For example, assume someone leaves the comment "I love M&Ms". Its probably easiest to represent it in the code as the plain-text String "I love M&Ms", not as the HTML-encoded String "I love M&Ms". Technically, the data as it exists in the code is not HTML yet and life is easiest if the data is represented as simply as accurately possible. This data may later be used in a different view, e.g. desktop app. This data may be stored in a database, a flat file, or in an XML file, perhaps later be shared with another program. Its simplest for the other program to assume the string is in "native" representation for the format: "I love M&Ms" in a database and flat file and "I love M&Ms" in the XML file. I would cringe to see the HTML-encoded value encoded in an XML file ("I love &amp;Ms").
Translation
Later, when the data is about to cross a representation boundary (e.g. displayed in HTML, stored in a database, plain-text file, or XML file), then its important to make sure it is properly translated so it is represented accurately in a format native to that next environment. In short, when you go to display it on an HTML page, make sure its translated to properly-encoded HTML (manually or through a tool) so the value is accurately displayed on the page. When you go to store it in the database or use it in a query, use escaping and/or prepared statements and bound variable to ensure the same conceptual value is accurately represented to the database. When you go to store it in an XML file, you ensure its XML-encoded.
Failure to translate properly when crossing representation boundaries is the source of injection attacks such SQL-injection attacks. Be conscientious of that whenever you are working with multiple representations/languages (e.g. Java, SQL, HTML, Javascript, XML, etc).
--
On the other hand, if you are really trying to save HTML page fragments to the database, then I am unclear by what you mean by "encoded before being stored". If its is strictly valid HTML, all the necessary values should already be encoded (e.g. &, <, etc).
The practice is to HTML encode before display.
If you are consistent about encoding before displaying, you have done a good bit of XSS prevention.
You should save the original form in your database. This preserved the original and you may want to do other processing on that and not on the encoded version.
Database vendor specific escaping on the input, html escaping on the output.
I disagree with everyone who thinks it should be decoded at display time, the chances of an attack occuring if its encoded before it reaches the database is only possible if a developer purposes decodes it before displaying it. However, if you decode it before presenting it there is always a chance that it could happen by some other newbie developer, like a new hire, or a bad implementation. If its sitting there unencoded its just waiting to pop out on the internet and spread like herpes. Losing the original data shouldnt be a concern. encode + decode should produce the same data every time. Just my two cents.
For security reasons, yes you should first convert the html to their entities and then insert into the database. Attacks such as XSS are initiated when you allow users (or rather bad guys) to use html tags and then you process/insert them in to the databse. XSS is one of the root causes of most security holes. So you definitely need to encode your html before storing it.

HTML encode user input when storing or when displaying

Simple question that keeps bugging me.
Should I HTML encode user input right away and store the encoded contents in the database, or should I store the raw values and HTML encode when displaying?
Storing encoded data greatly reduces the risk of a developer forgetting to encode the data when it's being displayed. However, storing the encoded data will make datamining somewhat more cumbersome and it will take up a bit more space, even though that's usually a non-issue.
i'd strongly suggest encoding information on the way out. storing raw data in the database is useful if you wish to change the way it's viewed at a certain point. the flow should be something similar to:
sanitize user input -> protect against sql injection -> db -> encode for display
think about a situation where you might want to display the information as an RSS feed instead. having to redo any HTML specific encoding before you re-display seems a bit silly. any development should always follow the "don't trust input" meme, whether that input is from a user or from the database.
Keep in mind that you may need to access the database with something that doesn't understand HTML encoded text (e.g., a reporting tool). I agree that space is a non-issue, but IMHO, putting HTML encoding in the database moves knowledge of your view/front end into the lowest tier in the application, and that is a design mistake.
The encoding should only only only be done in the display. Without exception.
Output.
With HTML you can't simply check length of a string (& is 1 character, but strlen() will tell you 5), you can easily crop it (it could break entities).
You may need to mix strings from database with strings from another source, or read and write them back. Doing this application-wide without missing any escaping and avoiding double escaping is a nightmare.
PHP tried to do similar thing with magic_quotes and it turned out to be a huge failure. Don't take magic_entities route! :)