Technique for ensuring HTML- and URL-encoding - html

Has anyone implemented a good system for ensuring that output is properly HTML-encoded where it makes sense? Maybe even something that recognizes when output should be URL-encoded or JSON-encoded instead?
The lazy approach — just encoding all inputs — causes problems when you want to send those inputs to a database, or to a block of JavaScript code. So something a little smarter is needed.
The tedious approach — putting the proper encoding function around each piece of data on the template — works, but it's easy for developers to forget to do it.
Is there a good approach that makes it easy for developers, and ensures that the right encoding is done? I was listening to one of the SO podcasts, and Joel tossed out an idea about using typed data to enforce a difference between HTML-encoded strings and non-encoded strings. Maybe that could be a starting point.
I'm looking more for a strategy than for an implementation in a particular language (although I'd be happy to hear about implementations that already exist and work).
EDIT: Here are some links I've found so far:
A type-based solution to the "strings problem"
String::Smart
Reducing XSS by way of Automatic Context-Aware Escaping in Template Systems
Secure String Interpolation in JS

Data that goes into your database probably should not have any escaping for HTML, JavaScript, or what have you. If you do include markup, you'll just have to strip it out if you decide to inject this data into a CSV file or PDF, etc...
Instead, whenever you query 'raw' data like this out of the database, escape the data at that time as appropriate to wherever you're injecting it; HTML, a JavaScript string, server-side scripting, etc.

Related

How to write a parser for Markup?

I would like to program a parser for a Markup language similar to BBCode, Markdown, Wikisyntax etc. using a high-level language like Python or Perl. It should feature sectioning, code highlighting, automatic link creation, embedding images but allowing HTML for more complex formatting.
Has anyone done similar things or has worked closely with those systems and could describe generally how this could be done efficiently?
Although efficiency is not really of concern for such a small system, it is generally favourable.
In particular I would like to learn if there is a more efficient way than using regular expressions for such a program.
For your general discussion…
You should start with the following blueprint:
you need to iterate charwise over entire data
you need to identify every char by its context, for it may be a tag-opening ('<', '[' etc) or just the char. This may be done by having an escapement flag, triggered by an escape-char (like backslashes in some languages do). if you use that approach, you also need to check for an escaped escapement.
you may also need some flag telling you to be inside a comment or special data section, that may have different escapement rules.
you need to build a tree-like structure or at least some stack for nested tags. This is why regexes are a bad idea: they not only take much to much overhead, they're also of no use if you want to get the correct closing tag for the second x (x=any tag) in the following snipped: <x><x><x></x><x><x></x></x><x></x><!-- </x> -->this one →</x><x></x></x>

Is there anything wrong with YAML format to be joined to the web standards

Well, I think YAML is really fantastic...
It's beautiful, easy to read, clever syntax...compared to any other data serialization format.
As a superset of JSON we could say it's more elaborated, hence its language evolution.
But I see some different opinions out there, such:
YAML is dead,
don't use yaml and so on...
I simply can't understand on what this is based because it seems so nice :)
If we take few well succeeded examples over the web such as Ruby on Rails, we know they use yaml for simple configuration, but one thing that gets me curious is why yaml is not being part of most used formats over web like XML and JSON.
If you take twitter for example...why not offer the data in YAML format from the API as well?
Is there something wrong by doing it?
We can see the evolution on no-sql databases like couchdb, mongo, all json based, even one great project called jsondb which looks very lightweight and it definitely can do the job.
But when writing data structures in json I really can't understand why YAML is not being used instead.
So one of my concerns would be if is there something wrong with YAML?
People can say it's complex, but well, if you pretend to use the same features you would get in json it's definitely not. You will get a more beautiful file for sure tho and with no hassle. It would be indeed more complex if you decide to use more features, but that's how things are, at least you have the possibility to use it if you want to.
The possibility to choose if you want or not to use double-quotes for string is fantastic makes everything cleaner and easier to read....well you see what's my point :)
So my question would be, why YAML is not vastly used in place of JSON?
Why it doesn't seem that it will be used for data structure transfers within the online community?
All I can see is people using it for simple configuration files and nothing else...
Please bear with me since I might be completely wrong and very big projects might be happening and my ignorance on the subject didn't allow me to be a part of it :)
If is there any big project based on yaml out there I would be very happy to know about it
Thanks in advance
It's not that there's something wrong with YAML — it's just that it doesn't offer any compelling benefits in many cases. YAML is basically a superset of JSON. For most purposes, JSON is quite sufficient — people wouldn't be using advanced YAML features even if they had a full YAML parser — and its close ties to JavaScript make it fit in well with the technologies that Web developers are using anyway.
TLDR: People are already using as much YAML as they need. In most cases, that's JSON.
YAML uses more data than non-prettified JSON. It's great for files that humans might want to edit themselves but when all you're doing is passing data around, you're wasting bandwidth if you're using YAML.
If you need an explanation: each space in UTF-16 is two bytes. YAML uses spaces for indentation, and newline characters for nesting.
Take this example:
foo:
bar:
- foo
- bar
This requires 44 characters (including newline characters). The equivalent JSON would be only 29 characters:
{"foo":{"bar":["foo","bar"]}}
Then just imagine what happens if you URL-encode the YAML. It becomes 95 characters:
foo%3A%0A%20%20%20%20bar%3A%0A%20%20%20%20%20%20%20%20-%20foo%0A%20%20%20%20%20%20%20%20-%20bar
Meanwhile the JSON just becomes 64 characters:
%7B%22foo%22%3A%7B%22bar%22%3A%5B%22foo%22%2C%22bar%22%5D%7D%7D
The size increase to YAML from JSON is more than double when it's URL-encoded, in the above example. And I'm sure you can just imagine that the longer your YAML file, the more and more this difference will increase.
Oh, and one other reason not to use YAML: stackoverflow.com does not support YAML syntax highlighting... ! (Of course, I would argue that YAML is so beautiful that it doesn't need syntax highlighting. That's kind of the point of YAML, I think.)
In Ruby many people argue that configuration should be Ruby, rather than YAML. This saves the parsing stage, means you don't have to learn the new syntax, and don't end up with ERB tags everywhere when you are dynamically generating YAML content (Rails fixtures).
Personally I have to agree, and can't see what YAML would offer to network transfers that would make it a worthwhile consideration over JSON.
YAML has an amount of problems, there is a good article
YAML: probably not so great after all on that.
Short summary (in addition to problems already listed in other answers):
Unreadable except for simple and short things
Insecure by default
Has portability issues
Very complex, with amount of surprising behaviors
I considered using YAML few times and never did. The reason always had to do with white spaces for indentation. While I personally love this, even to me it sounded like asking for trouble, because
For sure someone will make a mistake, not expecting that changing white spaces will break the file. Sometimes someone who has no idea about the language / format has to go to the file to change one number or string.
You can't guarantee that everybody everywhere will have it's comparison / merging / SC software configured properly to catch white space or empty lines differences.

storing code snippets in a database

I want to make a code snippet database web application. Would the best way to store it in the database be to html encode everything to prevent XSS when displaying the snippets on the web page?
Thanks for the help!
The database has nothing to do with this; you simply need to escape the snippets when they are rendered as HTML.
At minimum, you need to encode all & as & and all < characters as <.
However, your server-side language already has a built-in HTML encoding function; you should use it instead of re-inventing the wheel. For more details, please tell us what language your server-side code is in.
Based on your previous questions, I assume you're using PHP.
If so, you're looking for the htmlspecialchars or htmlentities functions.
You would either have to escape it when you store it, or escape it when you display it. It'd probably be better to do it on display so that if you need to edit it later on, you don't have to decode it then re-encode it.
Also, you'll want to make sure you escape it properly when you store it in the database, otherwise you'd be leaving yourself open to SQL injection. Parameterized statements would be the best method, you shouldn't have to change the raw data at all.
The best thing to do is to not store it in the database. I have seen people store stored procedures in databases as a row. Just because you can doesn't mean you should.
It doesn't matter how you store it, what matters is how you render it in the HTML representation. I'd guess you'll need to do some sort of sanitation before rendering the bytes. Another option might be to convert every character to an HTML entity; this might suffice to prevent any code or tags from actually being interpreted.
As an example, view the source of a Stack Overflow page with some example code, and see how they're representing the code in the HTML.

Should HTML be encoded before being persisted?

Should HTML be encoded before being stored in say, a database? Or is it normal practice to encode on its way out to the browser?
Should all my text based field lengths be quadrupled in the database to allow for extra storage?
Looking for best practice rather than a solid yes or no :-)
Is the data in your database really HTML or is it application data like a name or a comment that you just happen to know will end up as part of an HTML page?
If it's application data, I think its best to:
represent it in a form that native to the environment (e.g. unencoded in the database), and
make sure its properly translated as it crosses representational boundaries (encode when you generate the HTML page).
If you're a fan of MVC, this also helps separates the view/controller from the model (and from the persistent storage format).
Representation
For example, assume someone leaves the comment "I love M&Ms". Its probably easiest to represent it in the code as the plain-text String "I love M&Ms", not as the HTML-encoded String "I love M&Ms". Technically, the data as it exists in the code is not HTML yet and life is easiest if the data is represented as simply as accurately possible. This data may later be used in a different view, e.g. desktop app. This data may be stored in a database, a flat file, or in an XML file, perhaps later be shared with another program. Its simplest for the other program to assume the string is in "native" representation for the format: "I love M&Ms" in a database and flat file and "I love M&Ms" in the XML file. I would cringe to see the HTML-encoded value encoded in an XML file ("I love &amp;Ms").
Translation
Later, when the data is about to cross a representation boundary (e.g. displayed in HTML, stored in a database, plain-text file, or XML file), then its important to make sure it is properly translated so it is represented accurately in a format native to that next environment. In short, when you go to display it on an HTML page, make sure its translated to properly-encoded HTML (manually or through a tool) so the value is accurately displayed on the page. When you go to store it in the database or use it in a query, use escaping and/or prepared statements and bound variable to ensure the same conceptual value is accurately represented to the database. When you go to store it in an XML file, you ensure its XML-encoded.
Failure to translate properly when crossing representation boundaries is the source of injection attacks such SQL-injection attacks. Be conscientious of that whenever you are working with multiple representations/languages (e.g. Java, SQL, HTML, Javascript, XML, etc).
--
On the other hand, if you are really trying to save HTML page fragments to the database, then I am unclear by what you mean by "encoded before being stored". If its is strictly valid HTML, all the necessary values should already be encoded (e.g. &, <, etc).
The practice is to HTML encode before display.
If you are consistent about encoding before displaying, you have done a good bit of XSS prevention.
You should save the original form in your database. This preserved the original and you may want to do other processing on that and not on the encoded version.
Database vendor specific escaping on the input, html escaping on the output.
I disagree with everyone who thinks it should be decoded at display time, the chances of an attack occuring if its encoded before it reaches the database is only possible if a developer purposes decodes it before displaying it. However, if you decode it before presenting it there is always a chance that it could happen by some other newbie developer, like a new hire, or a bad implementation. If its sitting there unencoded its just waiting to pop out on the internet and spread like herpes. Losing the original data shouldnt be a concern. encode + decode should produce the same data every time. Just my two cents.
For security reasons, yes you should first convert the html to their entities and then insert into the database. Attacks such as XSS are initiated when you allow users (or rather bad guys) to use html tags and then you process/insert them in to the databse. XSS is one of the root causes of most security holes. So you definitely need to encode your html before storing it.

HTML encode user input when storing or when displaying

Simple question that keeps bugging me.
Should I HTML encode user input right away and store the encoded contents in the database, or should I store the raw values and HTML encode when displaying?
Storing encoded data greatly reduces the risk of a developer forgetting to encode the data when it's being displayed. However, storing the encoded data will make datamining somewhat more cumbersome and it will take up a bit more space, even though that's usually a non-issue.
i'd strongly suggest encoding information on the way out. storing raw data in the database is useful if you wish to change the way it's viewed at a certain point. the flow should be something similar to:
sanitize user input -> protect against sql injection -> db -> encode for display
think about a situation where you might want to display the information as an RSS feed instead. having to redo any HTML specific encoding before you re-display seems a bit silly. any development should always follow the "don't trust input" meme, whether that input is from a user or from the database.
Keep in mind that you may need to access the database with something that doesn't understand HTML encoded text (e.g., a reporting tool). I agree that space is a non-issue, but IMHO, putting HTML encoding in the database moves knowledge of your view/front end into the lowest tier in the application, and that is a design mistake.
The encoding should only only only be done in the display. Without exception.
Output.
With HTML you can't simply check length of a string (& is 1 character, but strlen() will tell you 5), you can easily crop it (it could break entities).
You may need to mix strings from database with strings from another source, or read and write them back. Doing this application-wide without missing any escaping and avoiding double escaping is a nightmare.
PHP tried to do similar thing with magic_quotes and it turned out to be a huge failure. Don't take magic_entities route! :)