W3C Validating an HTML Page with & in URLs

W3C Validating an HTML Page with & in URLs - html

I have a page in which users submit URLs, some of which contain &, = etc. Now if I want it to validate it with W3C I need to write it as & = etc. How can I automatically do this? Also, should I even bother?

you should encode the urls on server side then. not knowing what backend language you use, here's a list:
* htmlentities() - PHP
* HttpUtility.UrlEncode() - ASP.net
* URI.escape() - Ruby
* URLEncodedFormat() - Coldfusion
* urllib.urlencode() - Python
* java.net.URLEncoder.encode() - Java

Yes, you should bother, and it's quite simple. Saying, "Oh, look how many invalid pages there are" does not excuse your contributions to the problem. Every major language either has this functionality built-in (as Can noted for PHP) and/or can implement it trivially.

If users are submitting urls and you want to assist them in not making errors, then I'd validate the url by calling it. Use the http head method to validate the url.
This will take more programming than statically looking at the url string. You'll want to think about using a helper process, returning the result asynchronously to the original submit, etc. But that's the sort of stuff which separates the students from the professionals.

You need to use %26 instead of &.
In the general case though, find a URL encoder function in whatever language you're using.

I'd say don't even bother. See Jeff's post on the subject: HTML Validation: Does It Matter?
On the other hand, if you're a perfectionist, properly escaping query strings should be pretty trivial in any language. For example, you can use htmlspecialchars, htmlentities, urlencode or rawurlencode in PHP.

Related

Is it dangerous to display a parameter from URL without escaping?

Say I'm in a Spring environment and I have an URL http://www.example.com/name=alice
In the controller, I have code like,
mav.addObject("name", request.getParameter("name"));
And in the JSP file, it is rendered like
<div><c:out value="${name}" /></div>
My question is,
If a malicious user appends a bad string, for example, a short script in the URL, like http://www.example.com/name={some bad script}, <c:out> will protect me, is my understanding correct?
What if I cannot use <c:out>? Say, the parameter is "alice&bob", <c:out> will turn it to "alice%26bob", which is not what I want. How can I protect myself in this case?

You will always have to escape and sanitize untrusted input before you send it to the client.
Coding it by hand is painful and error prone. But you could use SafeHtmlBuilder for example. :-)

Passing an HTML parameter in routes (Play framework)

Good day! How are HTML parameters expressed in the routes file? I am trying to pass an HTML but I don't know how. All I know are passing integers ((id: Integer)) and some data types. I tried (content: Html)and (content: Html). I also tried javax.swing.text.html.HTML but it says something about QueryStringBindable. Please help me. Thanks a lot.

Remember that all you pass by route's params will be included in the URL so what is the advantage of using HTML in this place ? GET params should use only simple data types like numerical types, booleans and strings - so you can pass some HTML part as a String (preferably url-encoded or even beter with Base64 encoding).
Much better option is sending it via POST, your URLs won't be terrible long - you won't hit any limitation of URL length, also after common serialization it won't break at special HTML chars.

How can I send HTML over JSON (in JS / Node)

I'm trying to return an html snippet from a service that can only return valid JSON.
I've tried some things like:
This gets me a bunch of character like \n\n\n\n\t\t\t\t
return JSON.stringify({html: $('body').html()});
or
return JSON.stringify($('body').html());
On the receiving end, I'd like to be able to parse that HTML via Cheerio, or jQuery or JSDom so I can then run queries like $(".some_selector") on that data.
What is the proper way of doing this? Any special libraries / methods that can handle the escaping for me? I've googled it, but haven't had any clear results...
Thanks.

On the receiving end, you need to simply undo the JSON serialization. That's it!
Your HTML will be in its regular format at obj.html, which you can then parse with whatever DOM parser you want.

Well, you are probably going to need to worry about quotes in the HTML (like with attributes) because the could interfere with the quotes that delimit your JSON values.
Here is similar question as well as this web page that explains some of what you need to consider.
Briefly looking at npmsj.org, I didn't see any reputable modules that might help you make HTML JSON compatible, but I think you can probably figure it out fairly easily on your own, given a large enough sample set of HTML. You can always run your JSON through this validator to check it. I suppose you could also simply do a JSON.parse(jsonContainingHtml) on it as well. You'll get an exception if the string is not valid JSON.

How can I safely add user-supplied URLs to my HTML page?

As with any user supplied data, the URLs will need to be escaped and filtered appropriately to avoid all sorts of exploits. I want to be able to
Put user supplied URLs in href attributes. (Bonus points if I don't get screwed if I forget to write the quotes)
...
Forbid malicious URLs such as javascript: stuff or links to evil domain names.
Allow some leeway for the users. I don't want to raise an error just because they forgot to add an http:// or something like that.
Unfortunately, I can't find any "canonical" solution to this sort of problem. The only thing I could find as inspiration is the encodeURI function from Javascript but that doesn't help with my second point since it just does a simple URL parameter encoding but leaving alone special characters such as : and /.

OWASP provides a list of regular expressions for validating user input, one of which is used for validating URLs. This is as close as you're going to get to a language-neutral, canonical solution.
More likely you'll rely on the URL parsing library of the programming language in use. Or, use a URL parsing regex.
The workflow would be something like:
Verify the supplied string is a well-formed URL.
Provide a default protocol such as http: when no protocol is specified.
Maintain a whitelist of acceptable protocols (http:, https:, ftp:, mailto:, etc.)
The whitelist will be application-specific. For an address-book app the mailto: protocol would be indispensable. It's hard to imagine a use case for the javascript: and data: protocols.
Enforce a maximum URL length - ensures cross-browser URLs and prevents attackers from polluting the page with megabyte-length strings. With any luck your URL-parsing library will do this for you.
Encode a URL string for the usage context. (Escaped for HTML output, escaped for use in an SQL query, etc.).
Forbid malicious URLs such as javascript: stuff or links or evil domain names.
You can utilize the Google Safe Browsing API to check a domain for spyware, spam or other "evilness".

For the first point, regular attribute encoding works just fine. (escape characters into HTML entities. escaping quotes, the ampersand and brackets is OK if attributes are guaranteed to be quotes. Escaping other alphanumeric characters will make the attribute safe if its accidentally unquoted.
The second point is vague and depends on what you want to do. Just remember to use a whitelist approach instead of a blacklist one its possible to use html entity encoding and other tricks to get around most simple blacklists.

Is it possible to parse a Google+ (Google Plus) profile page?

If you view the source of a Google+ profile page, it appears rather complex. It seems most of the data is kept in a huge JSON-like objects. However, they don't seem to be really JSON, since they don't get recognized when I try to decode them. I am hoping the format is more clear to other people here. How would you go about parsing it? It seems it would fairly trivial, if you know where to start.
Here is a sample profile, for example: http://plus.google.com/104560124403688998123

Here's a PHP API I'm working on. It can download and parse the data for a profile page and people's public relationships.
https://github.com/jmstriegel/php.googleplusapi
The JSON piece is a bit mangled. To generate valid JSON, you basically have to remove the first 5 characters that prevent XSRF attacks and then add in all the nulls that have been removed. Here's the code specific to handling parsing the weird Google Plus JSON responses:
https://github.com/jmstriegel/php.googleplusapi/blob/master/lib/GooglePlus/GoogleUtil.php
Call GoogleUtil::FetchGoogleJSON( $url ) and you'll get back a giant array that you can then pull data from. Using this, it should be trivial to make a proxy service to translate stuff into valid json(p) for you to use in your own apps.

I don't have access to Google+ yet, so I'll just answer the general question - that is, how to parse JSON.
JSON is just JavaScript, so parsing it is as simple as evaluating the script. To do this, use the eval() JavaScript function.
var obj = eval('{"JSON":"goes here"}');
Another option is to leverage a console tool. Popular modern browsers pretty much all have them. I recommend Firebug for Firefox in particular.
Using Firefox, log into Google+, then open the Firebug console. You can use the console's dir() command to create a browseable representation of the data. Ex:
console.dir(eval('{"JSON":"goes here"}'));
Sorry I can't be more specific about how to get a handle on Google+'s JSON in particular; without access to the service, this is about the best I can do blind. Good luck!

Thanks to Jason for the excellent php class which reads a profile page into an array.
I've used this class as a base and then parsed it, based upon Russell Beattie's python code from the original appspot rss feed application.
Code here
A few notes:
I use this to merge G+ and WP feeds, hence writing posts into an intermediate array ($items).
I have a convention of creating a pseudo title in Google Plus posts, by emboldening a line and adding two newlines before writing the post. The function getTitle strips this out as a better formatted title in my website and getSummary produces the rest of the post with duplicating the title.

It's made up of a number of parts, an object describing your picasa images, one describing the fields on your profile, one describing your friends.
Most of the long numbers are the internal IDs of people, posts and photos. For instance, my ID is 105249724614922381234. Other than that, it could be parsed if you needed to.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

W3C Validating an HTML Page with & in URLs - html

I have a page in which users submit URLs, some of which contain &, = etc. Now if I want it to validate it with W3C I need to write it as & = etc. How can I automatically do this? Also, should I even bother?

you should encode the urls on server side then. not knowing what backend language you use, here's a list: * htmlentities() - PHP * HttpUtility.UrlEncode() - ASP.net * URI.escape() - Ruby * URLEncodedFormat() - Coldfusion * urllib.urlencode() - Python * java.net.URLEncoder.encode() - Java

Yes, you should bother, and it's quite simple. Saying, "Oh, look how many invalid pages there are" does not excuse your contributions to the problem. Every major language either has this functionality built-in (as Can noted for PHP) and/or can implement it trivially.

You need to use %26 instead of &. In the general case though, find a URL encoder function in whatever language you're using.

Related

Is it dangerous to display a parameter from URL without escaping?

Passing an HTML parameter in routes (Play framework)

How can I send HTML over JSON (in JS / Node)

How can I safely add user-supplied URLs to my HTML page?

Is it possible to parse a Google+ (Google Plus) profile page?

Categories

Resources