Should HTML be encoded before being stored in say, a database? Or is it normal practice to encode on its way out to the browser?
Should all my text based field lengths be quadrupled in the database to allow for extra storage?
Looking for best practice rather than a solid yes or no :-)
Is the data in your database really HTML or is it application data like a name or a comment that you just happen to know will end up as part of an HTML page?
If it's application data, I think its best to:
represent it in a form that native to the environment (e.g. unencoded in the database), and
make sure its properly translated as it crosses representational boundaries (encode when you generate the HTML page).
If you're a fan of MVC, this also helps separates the view/controller from the model (and from the persistent storage format).
Representation
For example, assume someone leaves the comment "I love M&Ms". Its probably easiest to represent it in the code as the plain-text String "I love M&Ms", not as the HTML-encoded String "I love M&Ms". Technically, the data as it exists in the code is not HTML yet and life is easiest if the data is represented as simply as accurately possible. This data may later be used in a different view, e.g. desktop app. This data may be stored in a database, a flat file, or in an XML file, perhaps later be shared with another program. Its simplest for the other program to assume the string is in "native" representation for the format: "I love M&Ms" in a database and flat file and "I love M&Ms" in the XML file. I would cringe to see the HTML-encoded value encoded in an XML file ("I love &Ms").
Translation
Later, when the data is about to cross a representation boundary (e.g. displayed in HTML, stored in a database, plain-text file, or XML file), then its important to make sure it is properly translated so it is represented accurately in a format native to that next environment. In short, when you go to display it on an HTML page, make sure its translated to properly-encoded HTML (manually or through a tool) so the value is accurately displayed on the page. When you go to store it in the database or use it in a query, use escaping and/or prepared statements and bound variable to ensure the same conceptual value is accurately represented to the database. When you go to store it in an XML file, you ensure its XML-encoded.
Failure to translate properly when crossing representation boundaries is the source of injection attacks such SQL-injection attacks. Be conscientious of that whenever you are working with multiple representations/languages (e.g. Java, SQL, HTML, Javascript, XML, etc).
--
On the other hand, if you are really trying to save HTML page fragments to the database, then I am unclear by what you mean by "encoded before being stored". If its is strictly valid HTML, all the necessary values should already be encoded (e.g. &, <, etc).
The practice is to HTML encode before display.
If you are consistent about encoding before displaying, you have done a good bit of XSS prevention.
You should save the original form in your database. This preserved the original and you may want to do other processing on that and not on the encoded version.
Database vendor specific escaping on the input, html escaping on the output.
I disagree with everyone who thinks it should be decoded at display time, the chances of an attack occuring if its encoded before it reaches the database is only possible if a developer purposes decodes it before displaying it. However, if you decode it before presenting it there is always a chance that it could happen by some other newbie developer, like a new hire, or a bad implementation. If its sitting there unencoded its just waiting to pop out on the internet and spread like herpes. Losing the original data shouldnt be a concern. encode + decode should produce the same data every time. Just my two cents.
For security reasons, yes you should first convert the html to their entities and then insert into the database. Attacks such as XSS are initiated when you allow users (or rather bad guys) to use html tags and then you process/insert them in to the databse. XSS is one of the root causes of most security holes. So you definitely need to encode your html before storing it.
Related
please my first question here,
I am working on a project on an accounting site to help generate an ixbrl file from the account details which are in json and xlsx format.
Please has anyone worked with something similar that can put me through on how to go about it.
Welcome #Abiola Aribisala.
An ixbrl file, also known as Inline XBRL or the XHTML syntax of XBRL, requires two things:
The "print friendly" part, in "raw" XHTML, that a human user can look at;
Extra tags within this XHTML (they are in a namespace specific to XBRL), which are the machine-readable part.
Thus, in order to produce Inline XBRL syntax, you first need to have a print friendly version in a format that can be converted to XHTML (like Word, etc), as this cannot be automated just reading from JSON. I imagine that if the Excel file is nicely formatted, it might be possible to convert it to some "raw" XHTML in some way, too.
Second, for the tags, you need a data source with all the contexts, characteristics, etc for each fact value. If your JSON data is in xBRL-JSON format, it should contain this information. Otherwise, it requires extra work.
Finally, a challenge is knowing to put which tag where in XHTML, i.e. "merging" the print version with the data. In a regular setup, this comes from a common source that both generated the print version and the machine-readable data. That way, this common source can directly generate the Inline XBRL file and it is best for quality and correctness.
If the binding between the print version and the data is not available, one could in theory put all the tags in an ix-hidden section in XHTML, however it defeats the purpose of tagging the data exactly where it is on the XHTML page, i.e., it makes it less interactive.
I'm working on a web service project, where I display some data obtained from database, which in turn is made of users' inputs. Of course I want to prevent my application from being vulnerable to XSS attacks, so obviously I sanitize the input from html special characters. But I have a following problem - data returned from the server is in form < (in this example case for '<' sign), and on the front end the second sanitization process occurs, making it <, which is totally incomprehensible by the web browser. Is there a simple way to get over it, or maybe I should sanitize inputs only in one place (I presume that the server would be the best option).
Thanks for all answers.
You can't reliably sanitize user input. It's a losing battle. As soon as you think you've filtered out all the "bad" characters, someone will pass in an escape sequence or something else unexpected
If you're using a database server, make sure all input is handled by pre-compiled stored procedures, and make sure that the user that the web app logs in as, only has EXECUTE perms. This prevents SQL injection and other mischief.
If you're worried about actual characters, make sure you have a "pass through OK characters" filter and not a "remove bad characters" filter. The number of "good characters" is finite, while the number of attack vectors is infinite.
As for your question about "<" characters, if the intended output is for user display, you can run the entire string through HttpServerUtility.HtmlEncode or it's equivalent in whatever language you use. This will convert the string into code that will display properly in the browser but not be interpreted.
It doesn't look like you're having a problem escaping it, it looks like you're having a problem deciding if you need to escape it. Pick a standard and stick with it, then convert as necessary. If it normally comes in unescaped, just store it that way, and escape it when you want to display it.
The best way to sanitize untrusted data that is served back to a user in the context of XSSs for spring boot is to use a template engine that will suit your needs (e.g. JSP).
Template engine will automatically generate HTML you need, escape it properly and insert the content in a required placeholder (if an issue with broken encoding occurs for async requests).
Be careful and check if a chosen engine does it by default or it needs a special directive to do so.
Has anyone implemented a good system for ensuring that output is properly HTML-encoded where it makes sense? Maybe even something that recognizes when output should be URL-encoded or JSON-encoded instead?
The lazy approach — just encoding all inputs — causes problems when you want to send those inputs to a database, or to a block of JavaScript code. So something a little smarter is needed.
The tedious approach — putting the proper encoding function around each piece of data on the template — works, but it's easy for developers to forget to do it.
Is there a good approach that makes it easy for developers, and ensures that the right encoding is done? I was listening to one of the SO podcasts, and Joel tossed out an idea about using typed data to enforce a difference between HTML-encoded strings and non-encoded strings. Maybe that could be a starting point.
I'm looking more for a strategy than for an implementation in a particular language (although I'd be happy to hear about implementations that already exist and work).
EDIT: Here are some links I've found so far:
A type-based solution to the "strings problem"
String::Smart
Reducing XSS by way of Automatic Context-Aware Escaping in Template Systems
Secure String Interpolation in JS
Data that goes into your database probably should not have any escaping for HTML, JavaScript, or what have you. If you do include markup, you'll just have to strip it out if you decide to inject this data into a CSV file or PDF, etc...
Instead, whenever you query 'raw' data like this out of the database, escape the data at that time as appropriate to wherever you're injecting it; HTML, a JavaScript string, server-side scripting, etc.
I want to make a code snippet database web application. Would the best way to store it in the database be to html encode everything to prevent XSS when displaying the snippets on the web page?
Thanks for the help!
The database has nothing to do with this; you simply need to escape the snippets when they are rendered as HTML.
At minimum, you need to encode all & as & and all < characters as <.
However, your server-side language already has a built-in HTML encoding function; you should use it instead of re-inventing the wheel. For more details, please tell us what language your server-side code is in.
Based on your previous questions, I assume you're using PHP.
If so, you're looking for the htmlspecialchars or htmlentities functions.
You would either have to escape it when you store it, or escape it when you display it. It'd probably be better to do it on display so that if you need to edit it later on, you don't have to decode it then re-encode it.
Also, you'll want to make sure you escape it properly when you store it in the database, otherwise you'd be leaving yourself open to SQL injection. Parameterized statements would be the best method, you shouldn't have to change the raw data at all.
The best thing to do is to not store it in the database. I have seen people store stored procedures in databases as a row. Just because you can doesn't mean you should.
It doesn't matter how you store it, what matters is how you render it in the HTML representation. I'd guess you'll need to do some sort of sanitation before rendering the bytes. Another option might be to convert every character to an HTML entity; this might suffice to prevent any code or tags from actually being interpreted.
As an example, view the source of a Stack Overflow page with some example code, and see how they're representing the code in the HTML.
Simple question that keeps bugging me.
Should I HTML encode user input right away and store the encoded contents in the database, or should I store the raw values and HTML encode when displaying?
Storing encoded data greatly reduces the risk of a developer forgetting to encode the data when it's being displayed. However, storing the encoded data will make datamining somewhat more cumbersome and it will take up a bit more space, even though that's usually a non-issue.
i'd strongly suggest encoding information on the way out. storing raw data in the database is useful if you wish to change the way it's viewed at a certain point. the flow should be something similar to:
sanitize user input -> protect against sql injection -> db -> encode for display
think about a situation where you might want to display the information as an RSS feed instead. having to redo any HTML specific encoding before you re-display seems a bit silly. any development should always follow the "don't trust input" meme, whether that input is from a user or from the database.
Keep in mind that you may need to access the database with something that doesn't understand HTML encoded text (e.g., a reporting tool). I agree that space is a non-issue, but IMHO, putting HTML encoding in the database moves knowledge of your view/front end into the lowest tier in the application, and that is a design mistake.
The encoding should only only only be done in the display. Without exception.
Output.
With HTML you can't simply check length of a string (& is 1 character, but strlen() will tell you 5), you can easily crop it (it could break entities).
You may need to mix strings from database with strings from another source, or read and write them back. Doing this application-wide without missing any escaping and avoiding double escaping is a nightmare.
PHP tried to do similar thing with magic_quotes and it turned out to be a huge failure. Don't take magic_entities route! :)