Dynamically Obfuscate HTML - html

I was wondering if there was any way to dynamically obfuscate html on a live server but not offline, so soon as my website was visited the source would be obfuscated rather than in plain text.

Since the client (browser) will have to parse it into a sensible DOM tree, this is pretty much fruitless. These days it's a lot more common to inspect a site using Firebug/Webkit Inspector, which provides a nicely formatted, navigable tree. Most people won't even notice that the HTML is "obfuscated", much less be stopped by it.
Executable code can be obfuscated by minimizing variable names and such without changing the result. HTML is the result though, if you change anything about it, the result will change. So "obfuscation" would mostly be limited to creative use of spacing anyway.

The real question you should ask yourself is "why do I need to obfuscate HTML?". If you're hiding sensitive information, then you should be either encrypting that data, or never presenting it to the client.
Most sensitive information or transactions should take place on the server, and the client only receives a token, or encrypted information, or a unique transaction identifier that can be passed back and forth.

Let me put it this way: There's no way to dynamically obfuscate the HTML on your site such that any reasonably competent person couldn't get it anyway.
You could use JavaScript to attempt to obfuscate it, but you'd have to do it in a way that didn't actually affect the DOM.
You could generate the contents of the page itself with JavaScript, but that is likely to damage accessibility, and once again the DOM will have to be in a condition the browser can use.
You could insert massive amounts of whitespace into the source, but that is easily overcome as well.
All this, and you make it harder and more annoying to manage your site. Minification has its purpose, but obfuscation here is lose-lose.

Your could search for and remove all tabs, newlines, extra spaces, and comments

If you are using php, IonCube has a plugin. it can be found here: http://www.ioncube.com/html_encoder.php it turns your html page into minified javascript.

Related

Link: Response Header VS HTML

I am currently working on a function to assist in preparing Link: HTTP header or a set of <link> tags and while reading different materials on this, I still am not able to find an answer to simple question: when to use Link: header and when to use <link>.
So far I can only say, that if you want to use HTTP20 server push, it is recommended to utilize the header. On the other hand, even if I push a stylesheet, it will not be applied unless there is a respective tag in HTML output.
Since I am preparing the library in order to help with some standardization and sanitization, I would like to catch, at least, some "weird" cases like this, if it's possible, but for that I need some set of recommendations or best practices in that regard. Sadly I am unable to find any thus far, so am turning to more knowledgeable people: what best practices or weird cases should I consider catching or should I just allow whatever to be sent regardless of whether it's a header or a tag?
If anyone is interested, the code is present in https://github.com/Simbiat/HTTP20/blob/main/src/Headers.php (links function).
They are supposed to be equivalent as #Evert states so in theory you can use either. However there are some considerations:
Headers are usually set in web server config (at least for static pages) which may not be as easy to update for developers.
However it has the added advantage that you can set these for multiple pages all at once (e.g. preload your core fonts on every .html file, rather than having to remember to set this on all pages, or all page templates if using a CMS).
On the other side with the HTML version it’s often easier to configure it per page (or page template), if you have different needs (e.g. different fonts are used in different pages).
There’s also some which say there are slight performance considerations to doing it in the header but honestly, as long as it’s high enough in the <HEAD> element I really think you’d struggle to notice this.
Of perhaps of more importance is whether it’s passed on hop to hop if your web server is hidden behind other infrastructure (e.g. a CDN or other proxy). In theory it should be, for simple headers, but for things like HTTP/2 push that’s not so easy. If it’s in the HTML you don’t need to worry about this (assuming intermediaries are not changing the markup of course!).
You mentioned the HTTP/2 push use case and that definitely needs the header (though this is not a defined standard method of setting push and some servers or CDNs use other methods, but many use this). However given HTTP/2 push’s complexities and concerns it can cause more problems than it solves, this is maybe a reason to recommend the HTML method to ensure it’s never pushed.
All in all I recommend setting this in the HTML. It’s just easier.
This is not the case however with other, similar things, which can be set in HTML and HTTP headers. CSP for example is limited in the HTML version, lacking some features of the HTTP Header version, and is also not recommended as it could be altered with JavaScript whereas the HTTP header cannot. But for simple Link headers these are less of a concern.

Is there any way to view a file live in html?

For example, if a file is updated while someone is viewing the page, automatically display this update, instead of requiring a refresh.
I want to make sure before I start JavaSripting to pull in the file on a interval and append it. Unless I'm mistaking, I don't think this can be done, but I'm just checking, just in case.
No. HTML is for structuring a document. Without a refresh, there is no way for HTML to change its contents.
HTML is a markup language related to SGML for creating structure and formatting in a web page. To achieve interaction beyond the initial request and response, you must utilize a programming language, not a document markup language. JS, possibly with some AJAX, is your subject matter. Good luck.

allowing users to add html formatted notes

We want to allow the users of our web application, to leave notes formatted with html.
On client side we are providing them with ckeditor [http://ckeditor.com/] which is a wisywig editor that generates html, that is then submitted to the server via a form
We then want to display the notes created by the users, with exactly the same formatting as they submitted them
My concerns are:
Putting attacks and bad intentions aside, how can I encapsulate the note when displayed on the site, so that
a. They don't inherit the design from the rest of the page
b. They don't influence the rest of the page, for example by opening and not closing a tag accidentally, or closing without opening.
Malicious code injection attacks
At the moment, the first is much more important, as it's an in house product for our clients, and is not open to the wide public. But security comments are very wellcome as well
Possible solutions that I consider are:
Ideally, I look for a way to encapsulate this pieces of user html, like : inside this area I show what you submitted (rendered, not source), you cannot influence and are not influenced by the code on other parts of the page
Specifically, we thought of displaying the notes inside iframes.
Other natural direction is dealing with parsing the inserted contents, and stripping out stuff.
Any inputs are welcome, and mainly:
How can I "encapsulate" the inserted contents, if I can?
Any comments on the iframe direction
Do I have to parse the contents anyway? What do I absolutely have to strip out?
How can I "encapsulate" the inserted contents, if I can?
The truth is unless you 'fix' their code (via some kind of check) you will get issues (think broken divs, etc). I don't see how you can encapsulate HTML FROM HTML. I would however only let them put in content like bold, italicize, center, etc;
Any comments on the iframe direction
Personally I wouldn't go that route, new can of worms for security and not a 'clean' way of doing this.
Do I have to parse the contents anyway? What do I absolutely have to strip out?
Yes don't be lazy, some devs always say "well I dont need it, its internal" and then it becomes an external thing, and at that point its so big that ONLY a full re-write will set it right, and it keeps chugging along until something is broken, then shit hits the fan and the big boss cries out why hasn't this been done. Long story short.
Yes you have to parse / validate / check all your input, wether internal or external. Anything other than that is just lazy.
In closing I would do it by using an editor like here on SO, which only allows some types of selective formatting. After all a broken <b> will not kill your whole layout, a <div> will...
Markdown formatting
You could use exactly the same type of intermediary solution that this site (StackOverflow) uses in it's user-generated-content (questions, answers, comments).
It's not the complete solution that could replace WYSIWYG solutions like the code editor, but it's just what a usual user-generated-content woudl require. It even allows you to include images.
For a complete guide:
https://www.markdownguide.org/cheat-sheet

What is the best way to handle user generated html content that will be viewed by the public?

In my web application I allow user generated content to be posted for public consumption similar to Stackoverflow.
What is the best practice for handing this?
My current steps for handling user generated content are:
I use MarkItUp to allow users
an easy way to format their html.
After a user has submitted thier
changes I run it through an HTML
Sanitizer (scroll to the
bottem) that uses a white list
approach.
If the Sanitization process has
removed any user created content I
do not save the content. I then
Return there modified content with a
warning message, "Some illegal
content tags where detected and
removed double check your work and
try again."
If the content passes through the
sanitization process cleanly, I save
the raw html content to the
database.
When rendering to the client I just
pass the raw html out of the db to
the page.
That's an entirely reasonable approach. For typical applications it will be entirely sufficient.
The trickiest part of white-listing raw HTML is the style attribute and embed/object. There are legitimate reasons why someone might want to put CSS styles into an otherwise untrusted block of formatted text, or say, an embedded YouTube video. This issue comes up most commonly with feeds. You can't trust the arbitrary block of text contained within a feed entry, but you don't want to strip out, e.g., syntax highlighting CSS or flash video, because that would fundamentally change the content and potentially confuse anyone reading it. Because CSS can contain dangerous things like behaviors in IE, you may have to parse the CSS if you decide to allow the style attribute to stay in. And with embed/object you may need to white-list hostnames.
Addenda:
In worst case scenarios, HTML escaping everything in sight can lead to a very poor user experience. It's much better to use something like one of the HTML5 parsers to go through the DOM with your whitelist. This is much more flexible in terms of how you present the sanitized output to your users. You can even do things like:
<div class="sanitized">
<div class="notice">
This was sanitized for security reasons.
</div>
<div class="raw"><pre>
<script>alert("XSS!");</script>
</pre></div>
</div>
Then hide the .raw stuff with CSS, and use jQuery to bind a click handler to the .sanitized div that toggles between .raw and .notice:
CSS:
.raw {
display: none;
}
jQuery:
$('.sanitized').click(function() {
$(this).find('.notice').toggle();
$(this).find('.sanitized').toggle();
});
The white list is a good move. Any black list solution is prone to letting through more than it should, because you just can't think of everything. I've seen some attemts of using black lists (for example The Code Project), and if they manage to catch everything, generally they still cause additional problems like replacing characters in code so that it can't be used without manually restoring it first.
The safest method would be:
HTML encode all the text.
Match a set of allowed tags and attributes and decode those.
Using a regular expression you can even require that each opening tag has a closing tag, so that an unclosed tag can't mess up the page.
You should be able to do this in something like ten lines of code, so the code that you linked to seems overly complicated.

Should HTML co-exist with code?

In a web application, is it acceptable to use HTML in your code (non-scripted languages, Java, .NET)?
There are two major sub questions:
Should you use code to print HTML, or otherwise directly create HTML that is displayed?
Should you mix code within your HTML pages?
Generally, it's better to keep presentation (HTML) separate from logic ("back-end" code). Your code is decoupled and easier to maintain this way.
As long as your HTML-writing code is separate from your application logic, and the HTML is guaranteed to be well-formed somehow, you should be okay.
The only code that should be mixed in markup-based pages (i.e, those that contain literal HTML) is the code used for formatting the HTML (e.g., a loop for writing out a list).
There are trade-offs whether you put the code in with the HTML or you use pure code to write the HTML out using quoted string literals.
No, if you want to build good and maintainable software, and to achieve loose coupling.
If I understand the question right, you're asking whether it's a good practice to mix markup with back-end code. No. While this is commonly done, it's still a bad idea.
You should read up on the MVC paradigm, as well as on existing questions on the matter, such as What is the best way to migrate an existing messy webapp to elegant MVC? and Best practices for refactoring classic ASP?
The point is to keep the display logic separate from the rest of the code. In any complex site you'll have code mixed in with your HTML, but the code should be for display purposes only. It shouldn't be doing any complex calculations.
For example, templates will contain loops and conditionals. Plus you'll probably have a library of HTML-specific routines, like printing out an <option> list based on a list object.
Imagine you were writing an application that has two output modes: HTML and something else. How would you write it, to avoid duplicating code? That will probably point you in the right direction.
The HTML that makes up the view has to get sent to the browser in some way. In .net, each server control emits its own HTML markup as part of the page lifecycle. So yes it is OK to use HTML in server side code.
Perhaps you should try following the ASP.net pattern. Create a bunch of controls that represent UI elements and make them responsible for emitting their own HTML based on their state.
Its fugly, and not type safe. But people do it without consequence. I'd prefer using a DOM or, at a minimum, classes designed to write HTML using type safe semantics. Also, its not all that good to mix UI with logic...
If I need methods that generate HTML I usually isolate them in an HtmlHelpers class. That way you keep some level of separation. The ASP.NET MVC Framework does this quite successfully.
If you mean printing out HTML in your code, then no. Unless you have a good reason not to, you should use templates
Even if you think you don't need this now, there's always a good chance you'll need it later. Maybe you want to output in a different format than HTML, or you want different presentation for the same data. You usually have the need for these things further down the road, so it's best to use one from the start.
I hate when developers print() a bunch of html. It's completely unnecessary and looks ugly in any text editor that shows print/echo strings in red.
I agree with everyone else that you should try as hard as you can to separate the HTML/XHTML markup from the application logic. However, sometimes you do need to generate HTML/XHTML in the application logic for various reasons.
In these cases what I have been trying to do is to ensure the bare minimum amount of presentation code is in mixed in with the application logic and try to migrate everything else over to the presentation code. It is worth nothing that is some cases you have situations where you could have everything moved over to the presentation layer, but it might be a bit easier to generate the markup as part of the application logic. In those cases, your best bet is likely to be to go the route that makes the most sense in terms of time.
I don't think there's any excuse for generating HTML inside your business logic. Don't even do it when it's just a "quick fix" or when you'll "go back and fix it later", because that never happens.
To reiterate my position from other questions, using some control logic (conditionals, loops) within HTML to construct it is OK. Do NOT do any data massaging or business logic in the HTML. You have to be disciplined, but it's worth it. Maintenance is much easier if your concerns (like logic and display) are separated.
Ideally you are aiming for a separation of concerns between your presentation (UI) code and your domain (business logic) code.
The reason why you should avoid coupling these two concerns (in either direction) is simple...
You will only have one reason to change a piece of code. whether this is from structural/styling changes in your html design, or from your business rules changing, you should only have to make the change in one place.
To a lesser extent, although many purists would disagree, by sprinkling HTML code through your domain code or vice versa you are creating noise for the next developer who comes along to read/maintain it.
I try to avoid using code to print HTML "directly". It is difficult to maintain, edit, add styles and etc. Some cases like generating an HTML email in the code, I create a text file or HTML file with markers like, [name], [verification code] and etc. I load this from the code and replace those markers. This way, you can edit the style of the email without re-compiling your code. Separating "presentation" and "logic" is a good practice in my opinion.
Mixing code within HTML is generally not a good practice in similar reasons as said in #1. However, I do use code in HTML for things like simple dynamic strings that are displayed multiple times on a page or pages. I think this is better than creating multiple server controls for same exact values to set. Since this is not code "logic" mixed in the HTML, I think this is ok.