Would a browser ever try to parse img> - html

Is it likely or possible for img tag, or any other to be parsed, when the < tag is several characters prior, or perhaps omitted? Would this happen in any notable HTML parsers?
For example
<div>$test</div>.
Where $test could be any string containing a >, but not a <. Such as img>, but not <img
Full disclosure: This question is specifically to see whether or not the comment I posted was correct.

You don't technically need either < or >. Load this up in IE, and it'll run a javascript alert. Not sure if it's possible without messing with the charset though.
<HTML>
<HEAD>
<META charset="UTF-7">
</HEAD>
<BODY>
<DIV>+ADw-script+AD4-alert(+ACI-XSS+ACI-)+ADw-/script+AD4-</DIV>
</BODY>
</HTML>
Source: http://securityoverride.org/articles.php?article_id=13

Well, out of curiosity, I changed one of my test pages so its script section began with this:
< script>
The result was completely broken and just printed all of my javascript. This happened in IE9, GC28, and Firefox. I didn't really have an image on-hand to test with, but I think we can derive from this that HTML tags are always required to have no white-space between the angle bracket and tag declaration.
If you'd like even further confirmation, I suggest you browse the W3C standardization documents to see if you can find where they declare the generic pattern for HTML element tags. Many HTML parsers probably base themselves off those documents to ease their coding.

White space is allowed after the tagname
< script> is invalid
while
<script> is valid

Related

Why does `<span/ >` produce an empty span?

In order to break long lines of text in the HTML source (because I prefer source code that approaches human readability) without introducing whitespace when
rendered, I have used source similar to
<! DOCTYPE HTML>
<html>
<body>
<p>
Span<span/
>in<span/
>the<span/
>place<span/
>where<span/
>you<span/
>live.
</p>
</body>
</html>
Which renders something like
Spanintheplacewhereyoulive.
However, I am not sure why this seems to work (using recent Chrome and Firefox, and a version of Konqueror). The standard seems not to cover this situation, unless I have missed it. A related post suggests to me that the above example is not valid, insofar as the <span/ > tags are concerned.
Not sure it matters, but I want to emphasize that there is whitespace between the / and > in <span/ >. This is lexically distinct from <span /> and <span/>, although I don't know if it's semantically different.
Why does <span/ > render, producing an empty span? Am I accessing some browser-specific behavior?

Can I safely replace "<ul>" tags within HTML using regexes?

I am trying to solve this issue, where users paste invalid HTML that we have to deal with, of the form <ol><ul><li>item</li></ul></ol>. We are currently parsing using lxml. In legal HTML, <ol> cannot have a (direct) child of a <ul> (it must be in an <li>) so lxml closes the ol tag too soon to try to "repair" the HTML, producing <div><ol/><ul><li>item</li></ul>.
The user-pasted text also might be invalid XML (e.g., bare <br> tag), so we can't just parse it as XML.
Thus, we can neither parse it as HTML nor XML, because it might be invalid.
To make this certain (common) case of invalid HTML into valid HTML, can we just replace all <ul> tags with <ol> tags using regexes?
If I use lxml to parse <ol><ol><li>item</li></ol></ol>, the output looks fine (does not close a tag too soon).
However, I don't want to break actual user-typed text, and I'm wondering if there are edge cases I haven't thought of (like "<ul>" within a <pre> tag or some other crazy thing that isn't actually a tag, though I've tested that particular case).
Yes, it would change unnumbered lists to numbered lists. I'm okay with that.
Yes, I have read this fun regex answer.
In general, there is no guarantee of a 'non-edge case' transform with HTML and regular expressions. HTML, more so than XML, has rules that make a direct text replacement of things that look like tags problematic.
The following text validates as HTML using w3c.org validation checker without any warnings.
<!DOCTYPE html>
<html lang="en">
<head>
<title><!--<ul>--></title>
<style lang="css">s {content: "<ul>";}</style>
<script>"<ul>"</script>
</head>
<body data-ul="<ul>"></body>
</html>
That aside, using some regular expression heuristics might solve the issue at hand - at least insofar as a reasonable scope. A streaming HTML token parser that does not attempt to apply any validation or DOM/tree building might also be useful for the initial replacement stage.

HTML tag that causes other tags to be rendered as plain text [duplicate]

This question already has answers here:
How to display raw HTML code on an HTML page
(30 answers)
Closed 3 years ago.
I'd like to add an area to a page where all of the dynamic content is rendered as plain text instead of markup. For example:
<myMagicTag>
<b>Hello</b> World
</myMagicTag>
I want the <b> tag to show up as just text and not as a bold directive. I'd rather not have to write the code to convert every "<" to an "<".
I know that <textarea> will do it, but it has other undesirable side effects like adding scroll bars.
Does myMagicTag exist?
Edit: A jQuery or javascript function that does this would also be ok. Can't do it server-side, unfortunately.
You can do this with the script element (bolded by me):
The script element allows authors to include dynamic script and data blocks in their documents.
Example:
<script type="text/plain">
This content has the media type plain/text, so characters reserved in HTML have no special meaning here: <div> ← this will be displayed.
</script>
(Note that the allowed content of the script element is restricted, e.g. you can’t have </script> as text content (it would close the script element).)
Typically, script elements have display:none by default in browser’s CSS, so you’d need to overwrite that in your CSS, e.g.:
script[type="text/plain"] {display:block;}
You can use a function to escape the < >, eg:
'span.name': function(){
return this.name.replace(/</g, '<').replace(/>/g, '>');
}
Also take a look at <plaintext></plaintext>. I haven't used it myself but it is known to render everything that follows as plain text(by everything i mean to say it ignores the closing tag, so all the following code is rendered as text)
The tag used to be <XMP> but in HTML 4 it was already deprecated. Browser's don't seem to have dropped its support but I would not recommend it for anything beyond quick debugging. The MDN article about <XMP> lists two other tags, <plaintext> and <listing>, that were deprecated even earlier. I'm not aware of any current alternative.
Whatever, the code to encode plain text into HTML is pretty straightforward in most programming languages.
Note: the term similar means exactly that—all three are designed to inject plain text into HTML. I'm not implying that they are synonyms or that they behave identically—they don't.
There is no specific tag except the deprecated <xmp>.
But a script tag is allowed to store unformatted data.
Here is the only solution so far showing dynamic content, as you wanted.
Run code snippet for more info.
<script id="myMagicTag" type="text/plain" style="display:block;">
<b>Hello</b> World
</script>
Use Visible Data-blocks
<script>
document.querySelector("#myMagicTag").innerHTML = "<b>Unformatted</b> dynamic content"
</script>
No, that's not possible, you need to HtmlEncode it.
If your using a server-side language, that's not really difficult though.
In .NET you would do something like this:
string encodedtext = HttpContext.Current.Server.HtmlEncode(plaintext);
In my application, I need to prevent HTML from rendering
"if (a<b || c>100) ..."
and
"cout << ...".
Also the entire C++ code region HTML must pass through the GCC compiler with the desired effect. I've hit on two schemes:
First:
//<xmp>
#include <string>
//</xmp>}
For reasons that escape me, the <xmp> tag is deprecated. I find (2016-01-09) that Chrome and FF, at least, render the tag the way I want. While researching my problem, I saw a remark that <xmp> is required in HTML 5.
Second, in <head> ... </head>, insert:
<style type="text/css">
textarea { border: none; }
</style>
Then in <body> ... </body>, write:
//<br /> <textarea rows="4" disabled cols="80">
#include <stdlib.h>
#include <iostream>
#include <string>
//</textarea> <br />
Note: Set "cols="80" to prevent following text from appearing on the right. Set "rows=..." to one more line than you enclose in the tag. This prevents scroll bars. This second technique has several disadvantages:
The "disabled" attribute shades the region
Incomprehensible, complex comments in the code sent to the compiler
Harder to understand
More typing
However, this methhod is neither obsolete nor deprecated. The gods of HTML will make their faces to shine unto you.

Is there a HTML/CSS way to display HTML tags without parsing?

Is there any way that I could display HTML tags without parsing? Tags like XMP worked before perfectly but now it's replaced with PRE that isn't so cool. Take a look at this example:
//This used to NOT PARSE HTML even if you used standard < and >.
<XMP>
<a hred="http://example.com">Link</a>
</XMP>
//New PRE tag requires < and > as replacement for < and >.
<PRE>
<a href="http://example.com">Link</A>
</PRE>
What I'm looking for is equivalent of old XMP tag. New PRE tag will parse code.
You can use a script element with its type set to denote plain text, and set its display property to block. This only affects the parsing behavior: no markup (tags or entity or character references) is recognized, except for the end tag of the element itself </script>. (So it is not quite the same as xmp, where the recognized tag is </xmp>.) You can separately make white space handling similar to that of xmp and pre and/or set the font the monospace as in those elements by default.
Example:
<style>
script {
display: block;
}
</style>
Then within document body:
<script type="text/plain">
<i>é</i>
</script>
Tested on newest versions of IE, Chrome, Firefox, Opera. Didn’t work in IE 8 and IE 7 emulation on IE 9, but that’s probably a bug in the emulation.
However, I don’t see why you would use this instead of xmp, which hasn’t stopped working. It’s not in the specs, but if you are worried about that, you should have always been worried. Mentioned in HTML 2.0 (the first HTML spec ever) as avoidable, it was deprecated in HTML 3.2 and completely removed in HTML 4.0 (long ago: in 1997).
The xmp is making a comeback rather than dying. The W3C HTML5 (characterized as the current HTML specification by W3C staff) declares xmp as obsolete and non-conforming, but it also imposes a requirement on browsers: “User agents must treat xmp elements in a manner equivalent to pre elements in terms of semantics and for purposes of rendering. (The parser has special behavior for this element though.)” The old parsing behavior is thus not explicitly required, but clearly implied.
I personally think using the <code> </code> tags only works in Dream Weaver and the tag <xmp> </xmp> never stopped working unless you put in </xmp> it works fine. Using <textarea> </textarea> makes it so that others can edit your code on the website or the page so I recommend that the tag <xmp> </xmp> is still used and that that tag still lives on.
The modern way is to use textarea with (boolean) attribute readonly. You could use XMP, but that is deprecated, so it may eventually stop being supported.
example:
<textarea readonly='true'>
<p>This is some text</p>
</textarea>
And then... a few years go by, I have the same problem while converting my blog from wordpress to a vuejs spa backed by lambda and dynamodb.
And the answer is; at least in my situation. Escape the entity.
< becomes &lt;
> becomes &gt;
etc. etc.
Hope this helps.
There isn't.
In theory you could use a CDATA block, but no browser supports that in text/html mode.
Use character references.
If you want to be more complex, another way is to create a custom tag using jQuery. For this example, I used <noparse>.
$('noparse').each(function(){
if($(this).attr('tagchecked') != 'true'){ //checks if already changed tag
$(this).text($(this).html()).attr('tagchecked', 'true'); //makes the html into plaintext
}
});
JSFiddle here
I suggest using the html iframe tag and put the text you like to display in the src attribute. you only have to url or base64 encode it first.
example (urlencoded):
<iframe src="data:text/plain,%22%3Chello%3E%22"></iframe>
example (base64):
<iframe src="data:text/plain;base64,IjxoZWxsbz4i"></iframe>
Result displayed as:
"<hello>"
Technically you could use <textarea>, but it would require that there be no </textarea> tag in the code you are trying to show. It'd just easier to escape the <.
Well, one way would be to use jQuery. the jQuery .text() method will encode special characters. And the original un-encoded text will remain if you view source.
<div id="text">
This is an anchor
</div>
<script>
var t = $('#text'); t.html(t.text());
</script>

Are there any issues with always self closing empty tags in html?

Are there any browser issues with always collapsing empty tags in html.
So for example an empty head tag can be written like this
<head></head>
but is can also be written like this
<head/>
Will the second case cause issues in any scenerio?
Thanks
Self-closing <script> tags can mess up some browsers really badly. I remember my whole page disappearing into thin air in IE after I self-closed a script tag - everything after it was read as a script.
Assuming that you are serving your XHTML as XML, no. <head></head> is entirely equivalent to <head />. In fact, an XML parser won't even bother to tell you which one you have.
(There is, however, an issue in that the <head> tag must contain a <title>.)
You shouldn't use minimized form for head in XHTML.
http://www.w3.org/TR/xhtml1/#guidelines
About empty elements:
http://www.w3.org/TR/xhtml1/#C_3
Given an empty instance of an element
whose content model is not EMPTY (for
example, an empty title or paragraph)
do not use the minimized form (e.g.
use <p> </p> and not <p />).
In other words, paragraph should always be closed in XHTML, in HTML you could go with only opening tag. But if the element is supposed to have content it should be properly opened and closed.
For example line break has EMPTY content model and can be written as <br /> (same goes for <hr />) but not <div />.
Also see this SO question.
Empty Elements (XHTML)
Shorthand markup in HTML
Self-closing tags don't exist in HTML. The / is always ignored, that is, <foo/> and <foo> are equivalent. For elements such as br, that's fine, because you want <br>. However, <script src="..." /> means the same as <script src="...">, which is a problem (as noted in other answers). <head/> is less of a problem, because the </head> end tag is optional anyway.
In XML, on the other hand, self-closing tags do what you want. However, you probably aren't using XML, even if you've got an XHTML doctype. Unless you send your documents with a text/xml, application/xml or application/xhtml+xml MIME type (or any other XML MIME type), particularly if you send them as text/html, they will not be treated as XML.
Not that I am aware of. One caveat that has bitten me in the past is self closing my script tag: <script type="text/javascript" src="somefile.js" />
This results in some interesting fail.
In general an empty element can be written as a self closing tag, or opening and closing tags.
However, the HTML4 DTD specifies that the document HEAD must contain a TITLE element.
"Every HTML document must have a TITLE element in the HEAD section."
http://www.w3.org/TR/1999/REC-html401-19991224/struct/global.html#h-7.4.1
I believe some older browsers had problems with the lack of whitespacing - in particular
<head/> would be interpreted as a "head/" tag, whereas <head /> will be interpreted as a "head" tag with a blank attribute "/" which is ignored.
This only affects a few browsers, AFAIK. Either is valid XHTML, but older HTML-only browsers might have trouble.
This is in fact documented in the XHTML guidelines as C.2
Even considering only browser issues (i.e. disregarding validity) and narrowing the question down to the head tag alone, the answer is still yes.
Compare
<head/>
<object>Does this display?</object>
against
<head></head>
<object>Does this display?</object>
each served as text/html to any version of IE.
Does this display? will be shown only in the latter example.