Is there an html command to bypass an html filter? - html

I am trying to add an html link to a website but the website strips out the html:
When I view the source I get this:
<a href = "http://www.soandso.com">http://www.soandso.com/</a>
instead of what I actually want which is this:
www.soandso.com
Is there an html command to bypass the filter?

Almost certainly not.
Most sites quite rightly don't just let users inject arbitrary HTML. That's the source of XSS (cross site scripting) vulnerabilities.
If the site strips (or escapes) tags, just put in www.example.com and that will have to do.

No. The filters are there for a reason; to prevent you from adding your own HTML to the website. There is no standard for how the filters work, but they will generally either escape all HTML that isn't allowed, or strip out HTML that isn't allowed. There is no general purpose way to get around the filter.

First check if the site uses any sort of special markup. For instance, Stack Overflow supports a variation of Markdown. Other sites support Textile, or BBCode. If that is the case, check the associated reference on how to include a link. If none of those are the case, then just use the URL without the <a> element wrapper.

Related

How can I prevent Cloudfare from deleting a HTML tag?

A client is using Cloudfare to deliver his web content but their filters are removing tags that are unknown to them. Unfortunately that breaks the service that we provide.
For example, before the DOCTYPE tag we use a tag of our own like <!-- Example --> which tells our server filter to encrypt the HTML that follows. But Cloudfare filters are removing that tag, and thus breaking the service.
Do they have a whitelist or something that can be used to prevent the corruption?
You Just don't minify the HTML,CSS And JavaScript.
just skip them while you adding the domain. it works for me.
So you want to preserve html tags similar to "comments" that don't normally appear on a page?
Page speed modifiers strip such tags because they are not important to a page and thus are not necessary. By removing all comments a few Bytes can be removed from the download of a page. On most pages that will make little difference, but some websites, especially those running a CMS with a multitude of plugins, can contain a lot of comments.
So it is page speed enhancement that you need to disable to preserve such tags.
Cloudfare provides a Control Panel to make adjustments. In the top menu, click on "Rules" to Create a Page Rule for your website. Then enter the URL of the page that you want to exempt. Enclosing the URL in asterisks [*] will cater for all similar urls, like example.com/special. Then pick a setting by selecting "Disable Performance".
This will create a rule to disable pagespeed enhancement of all pages that include "example.com/special" in their URL.

How to protect from HTML <base> tag?

Documentation about the tag says it must located inside of the <head> and only one <base> tag is permitted.
However, this tag successfully replaces base URL for relative links even its been put somewhere inside the <body>.
The behavior was noticed in a support ticket system with many relative links. The system renders emails, and if an email's HTML code contains the <base> tag, after the email is rendered, all relative links change to the base URL specified in the tag.
The behavior confirmed for Firefox, Chromium-based browsers, Edge. IE11 ignores it. You can check a simple sample here.
Is it possible to protect from such behavior without changing website's HTML markup?
Don't blindly insert an email into an HTML document. Treat it like any other potential source for an XSS attack.
If you are going to allow HTML, then run it through a DOM aware white-list based filter (e.g. HTML Purifier if you are using PHP).
<base> should not be allowed on the white-list.
As #Quentin suggested, the best way to protect your HTML from unwanted HTML is to simply get rid of it before displaying it. Unfortunately, sometimes this will break things! Like in this particular example, if the user sent a will containing a <base> and relative links, removing the tag would break all the links.
To circumvent this, one could use an iframe. They are a very useful tool to sandbox foreign code. This should not be used blindly tough...
As to answer OP, if you have no control over the application used to read the e-mails, your only hope is to tinker with the e-mails themselves. You could create a hook on the inbox to strip down any unwanted HTML (using #Quentin's suggestion) before putting it into the monitored inbox.

Having the HTML of a webpage, how to obtain the visible words of that webpage?

Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.

Embed sandboxed HTML on a webpage + SEO

I want to embed some HTML on my website... I would like that:
SEO: that content can be crawled and indexed
Integration: it renders nicely (does not break my DOM trees for instance, or does not inherit my styles)
Security: it remains safe for our user (javascript disabled)
Flexibility: the HTML can be completely free (don't want any BBCode or MarkDown or even TinyMCE, it's our users that are writing the HTML code...)
I saw that I might be able to use the IFrame for that, but I am not sure it is a very good solution concerning my SEO constraint.
Any answer would be greatly appreciated!!! Thanks.
For your requirements (rendering and security, primarily), IFRAME seems to be your only option, especially when we consider no rules are specified for the HTML content except the JS removal. Even some CSS + 'a' tag can bring a serious security risk, like overlaying outgoing links on your standard interface.
For the SEO part, you can use SEO maps to show the search engines the relation between the content and the container, also use html tags like link to make connection.
To make sure the user's html is safe then you should use HTMLPurifer. In terms of the rest of the question, you should split this up into multiple questions.

Equivalent of LaTeX's \label and \ref in HTML

I have an FAQ in HTML (example) in which the questions refer to each other a lot. That means whenever we insert/delete/rearrange the questions, the numbering changes. LaTeX solves this very elegantly with \label and \ref -- you give items simple tags and LaTeX worries about converting to numbers in the final document.
How do people deal with that in HTML?
ADDED: Note that this is no problem if you don't have to actually refer to items by number, in which case you can set a tag with
<a name="foo">
and then link to it with
some non-numerical way to refer to foo.
But I'm assuming "foo" has some auto-generated number, say from an <ol> list, and I want to use that number to refer to and link to it.
There is nothing like this in HTML.
The way you would normally solve this, is by having the HTML for the links generated, by either parsing the HTML itself and inserting the TOC (you can do that on the server, before you send the HTML out to the browser, or on the client, by traversing the DOM with a little piece of ECMAScript and simply collecting and inspecting all <a> elements) or generating the entire HTML document from a higher level source like a database, an XML document, markdown or – why not? – even LaΤΕΧ.
I know it's not widely supported by browsers, but you can do this using CSS counter.
Also, consider using ids instead of names for your anchors.
Instead of \label{key} use <a name="key" />. Then link using Link.
PrinceXML can do that, but that's about it. I suppose it'd be best to use server-side scripting.
Here's how I ended up solving this with a php script:
http://yootles.com/genfaq
It's roughly as convenient as \label and \ref in LaTeX and even auto-generates the index of questions.
And I put it on an etherpad instance which is handy when multiple people are contributing questions to the FAQ.