I am trying to solve this issue, where users paste invalid HTML that we have to deal with, of the form <ol><ul><li>item</li></ul></ol>. We are currently parsing using lxml. In legal HTML, <ol> cannot have a (direct) child of a <ul> (it must be in an <li>) so lxml closes the ol tag too soon to try to "repair" the HTML, producing <div><ol/><ul><li>item</li></ul>.
The user-pasted text also might be invalid XML (e.g., bare <br> tag), so we can't just parse it as XML.
Thus, we can neither parse it as HTML nor XML, because it might be invalid.
To make this certain (common) case of invalid HTML into valid HTML, can we just replace all <ul> tags with <ol> tags using regexes?
If I use lxml to parse <ol><ol><li>item</li></ol></ol>, the output looks fine (does not close a tag too soon).
However, I don't want to break actual user-typed text, and I'm wondering if there are edge cases I haven't thought of (like "<ul>" within a <pre> tag or some other crazy thing that isn't actually a tag, though I've tested that particular case).
Yes, it would change unnumbered lists to numbered lists. I'm okay with that.
Yes, I have read this fun regex answer.
In general, there is no guarantee of a 'non-edge case' transform with HTML and regular expressions. HTML, more so than XML, has rules that make a direct text replacement of things that look like tags problematic.
The following text validates as HTML using w3c.org validation checker without any warnings.
<!DOCTYPE html>
<html lang="en">
<head>
<title><!--<ul>--></title>
<style lang="css">s {content: "<ul>";}</style>
<script>"<ul>"</script>
</head>
<body data-ul="<ul>"></body>
</html>
That aside, using some regular expression heuristics might solve the issue at hand - at least insofar as a reasonable scope. A streaming HTML token parser that does not attempt to apply any validation or DOM/tree building might also be useful for the initial replacement stage.
Is it likely or possible for img tag, or any other to be parsed, when the < tag is several characters prior, or perhaps omitted? Would this happen in any notable HTML parsers?
For example
<div>$test</div>.
Where $test could be any string containing a >, but not a <. Such as img>, but not <img
Full disclosure: This question is specifically to see whether or not the comment I posted was correct.
You don't technically need either < or >. Load this up in IE, and it'll run a javascript alert. Not sure if it's possible without messing with the charset though.
<HTML>
<HEAD>
<META charset="UTF-7">
</HEAD>
<BODY>
<DIV>+ADw-script+AD4-alert(+ACI-XSS+ACI-)+ADw-/script+AD4-</DIV>
</BODY>
</HTML>
Source: http://securityoverride.org/articles.php?article_id=13
Well, out of curiosity, I changed one of my test pages so its script section began with this:
< script>
The result was completely broken and just printed all of my javascript. This happened in IE9, GC28, and Firefox. I didn't really have an image on-hand to test with, but I think we can derive from this that HTML tags are always required to have no white-space between the angle bracket and tag declaration.
If you'd like even further confirmation, I suggest you browse the W3C standardization documents to see if you can find where they declare the generic pattern for HTML element tags. Many HTML parsers probably base themselves off those documents to ease their coding.
White space is allowed after the tagname
< script> is invalid
while
<script> is valid
Instead of
<!--
,
I used
<!-
...and it is working.
How?
It's not actually working - it's just interpreting it as an actual tag, and then throwing that tag out as invalid.
<!- foo bar -->
is treated as a tag, <!-foo bar--> which obviously isn't a standard HTML tag, and thus is ignored.
Try this, and you'll see it's not truly working as a comment:
<!- >foo bar-->
Modern browser parsers (i.e. those that use the HTML5 parsing algorithm) work like this. If they are expecting text or a new tag next, and they see <! then they check the next few characters to see if they are -- or DOCTYPE or, if they are processing embedded SVG or MathML, [CDATA[. (See http://dev.w3.org/html5/spec/tokenization.html#markup-declaration-open-state)
If, as in the case of <!- foo, none of these match then the parser enters the bogus comment state where all the characters following, up to the next >, are read and and converted into a comment to be put into the DOM.
Hence the behaviour you see with <!- working like a comment start. Note that such behaviour is "repair" behaviour for broken markup and it's wise not to rely on it.
You can see how such markup forms a DOM here: Live DOM Viewer
Also note that this is different to what #Amber says. It is not treated as a tag in any meaningful sense, and it is certainly not ignored.
I want to create something like
<menu>
<lunch>
<dish>aaa</dish>
<dish>bbb</dish>
</lunch>
<dinner>
<dish>ccc</dish>
</dinner>
</menu>
Can it be done in HTML5?
I know I can do it with
<ul id="menu">
<li>
<ul id="lunch">
<li class="dish">aaa</li>
<li class="dish">bbb</li>
</ul>
</li>
<li>
<ul id="dinner">
<li class="dish">ccc</li>
</ul>
</li>
</ul>
but it is so much less readable :(
You can use custom tags in browsers, although they won’t be HTML5 (see Are custom elements valid HTML5? and the HTML5 spec).
Let's assume you want to use a custom tag element called <stack>. Here's what you should do...
STEP 1
Normalize its attributes in your CSS Stylesheet (think css reset) -
Example:
stack{display:block;margin:0;padding:0;border:0; ... }
STEP 2
To get it to work in old versions of Internet Explorer, you need to append this script to the head (Important if you need it to work in older versions of IE!):
<!--[if lt IE 9]>
<script> document.createElement("stack"); </script>
<![endif]-->
Then you can use your custom tag freely.
<stack>Overflow</stack>
Feel free to set attributes as well...
<stack id="st2" class="nice"> hello </stack>
I'm not so sure about these answers. As I've just read:
"CUSTOM TAGS HAVE ALWAYS BEEN ALLOWED IN HTML."
http://www.crockford.com/html/
The point here being, that HTML was based on SGML. Unlike XML with its doctypes and schemas, HTML does not become invalid if a browser doesn't know a tag or two. Think of <marquee>. This has not been in the official standard. So while using it made your HTML page "officially unapproved", it didn't break the page either.
Then there is <keygen>, which was Netscape-specific, forgotten in HTML4 and rediscovered and now specified in HTML5.
And also we have custom tag attributes now, like data-XyZzz="..." allowed on all HTML5 tags.
So, while you shouldn't invent a whole custom unspecified markup salad of your own, it's not exactly forbidden to have custom tags in HTML. That is however, unless you want to send it with an +xml Content-Type or embed other XML namespaces, like SVG or MathML. This applies only to SGML-confined HTML.
I just want to add to the previous answers that there is a meaning to use only two-words tags for custom elements.
They should never be standardised.
For example, you want to use the tag <icon>, because you don't like <img>, and you don't like <i> neither...
Well, keep in mind that you're not the only one. Maybe in the future, w3c and/or browsers will specify/implement this tag.
At this time, browsers will probably implements native style for this tag and your website's design may break.
So I'm suggesting to use (according to this example) <img-icon>.
As a matter of fact, the tag <menu> is well defined ie not so used, but defined. It should contain <menuitem> which behave like <li>.
As Michael suggested in the comments, what you want to do is quite possible, but your nomenclature is wrong. You aren't "adding tags to HTML 5," you are creating a new XML document type with your own tags.
I did this for some projects at my last job. Some practical advice:
When you say you want to "add these to HTML 5," I assume what you really mean is that you want the pages to display correctly in a modern browser, without having to do a lot of work on the server side. This can be accomplished by inserting a "stylesheet processing instruction" at the top of the xml file, like <?xml-stylesheet type="text/xsl" href="menu.xsl"?>. Replace "menu.xsl" with the path to the XSL stylesheet that you create to convert your custom tags into HTML.
Caveats: Your file must be a well-formed XML document, complete with XML header <xml version="1.0">. XML is pickier than HTML about things like mismatched tags. Also, unlike HTML, tags are case-sensitive. You must also make sure that the web server is sending the files with the appropriate mime type "application/xml". Often the web server will be configured to do this automatically if the file extension is ".xml", but check.
Big Caveat: Finally, using the browsers' automatic XSL transformation, as I've described, is really best only for debugging and for limited applications where you have a lot of control. I used it successfully in setting up a simple intranet at my last employer, that was accessed only by a few dozen people at most. Not all browsers support XSL, and those that do don't have completely compatible implementations. So if your pages are to be released into the "wild," it's best to transform them all into HTML on the server side, which can be done with a command line tool, or with a button in many XML editors.
Creating your own tag names in HTML is not possible / not valid. That's what XML, SGML and other general markup languages are for.
What you probably want is
<div id="menu">
<div id="lunch">
<span class="dish">aaa</span>
<span class="dish">bbb</span>
</div>
<div id="dinner">
<span class="dish">ccc</span>
</div>
</div>
Or instead of <div/> and <span/> something like <ul/> and <li/>.
In order to make it look and function right, just hook up some CSS and Javascript.
Custom tags can be used in Safari, Chrome, Opera, and Firefox, at least as far as using them in place of "class=..." goes.
green {color: green} in css works for
<green>This is some text.</green>
<head>
<lunch>
<style type="text/css">
lunch{
color:blue;
font-size:32px;
}
</style>
</lunch>
</head>
<body>
<lunch>
This is how you create custom tags like what he is asking for its very simple just do what i wrote it works yeah no js or convoluted work arounds needed this lets you do exactly what he wrote.
</lunch>
</body>
For embedding metadata, you could try using HTML microdata, but it's even more verbose than using class names.
<div itemscope>
<p>My name is <span itemprop="name">Elizabeth</span>.</p>
</div>
<div itemscope>
<p>My name is <span itemprop="name">Daniel</span>.</p>
</div>
Besides writing an XSL stylesheet, as I described earlier, there is another approach, at least if you are certain that Firefox or another full-fledged XML browser will be used (i.e., NOT Internet Explorer). Skip the XSL transform, and write a complete CSS stylesheet that tells the browser how to format the XML directly. The upside here is that you wouldn't have to learn XSL, which many people find to be a difficult and counterintuitive language. The downside is that your CSS will have to specify the styling very completely, including what are block nodes, what are inlines, etc. Usually, when writing CSS, you can assume that the browser "knows" that <em>, for instance, is an inline node, but it won't have any idea what to do with <dish>.
Finally, its been a few years since I tried this, but my recollection is that IE (at least a few versions back) refused to apply CSS stylesheets directly to XML documents.
The point of HTML is that the tags included in the language have an agreed meaning, that everyone in the world can use and base decisions on - like default styling, or making links clickable, or submitting a form when you click on an <input type="submit">.
Made-up tags like yours are great for humans (because we can learn English and thus know, or at least guess, what your tags mean), but not so good for machines.
Polymer or X-tags allow you to build your own html tags. It is based on native browser's "shadow DOM".
In some circumstances, it may look like creating your own tag names just works fine.
However, this is just your browser's error handling routines at work. And the problem is, different browsers have different error handling routines!
See this example.
The first line contains two made-up elements, what and ever, and they get treated differently by different browsers. The text comes out red in IE11 and Edge, but black in other browsers.
For comparison, the second line is similar, except it contains only valid HTML elements, and it will therefore look the same in all browsers.
body {color:black; background:white;} /* reset */
what, ever:nth-of-type(2) {color:red}
code, span:nth-of-type(2) {color:red}
<p><what></what> <ever>test</ever></p>
<p><code></code> <span>test</span></p>
Another problem with made-up elements is that you won't know what the future holds. If you created a website a couple of years ago with tag names like picture, dialog, details, slot, template etc, expecting them to behave like spans, are you in trouble now!
This is not an option in any HTML specification :)
You can probably do what you want with <div> elements and classes, from the question I'm not sure exactly what you're after, but no, creating your own tags is not an option.
As Nick said, custom tags are not supported by any version of HTML.
But, it won't give any error if you use such markup in your HTML.
It seems like you want to create a list. You can use unordered list <ul> to create the rool elements, and use the <li> tag for the items underneath.
If that's not what you want to achieve, please specify exactly what you want. We can come up with an answer then.
You can add custom attribute through HTML 5 data- Attributes.
For example: Message
That is valid for HTML 5. See http://ejohn.org/blog/html-5-data-attributes/ to get details.
You can just do some custom css styling, this will create a tag that will make the background color red:
redback {background-color:red;}
<redback>This is red</redback>
you can use this:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>MyExample</title>
<style>
bloodred {color: red;}
</style>
</head>
<body>
<bloodred>
this is BLOODRED (not to scare you)
</bloodred>
</body>
<script>
var btn = document.createElement("BLOODRED")
</script>
</html>
I found this article on creating custom HTML tags and instantiating them. It simplifies the process and breaks it down into terms anyone can understand and utilize immediately -- but I'm not entirely sure the code samples it contains are valid in all browsers, so caveat emptor and test thoroughly. Nevertheless, it's a great introduction to the subject to get started.
Custom Elements : Defining new elements in HTML
Are there any browser issues with always collapsing empty tags in html.
So for example an empty head tag can be written like this
<head></head>
but is can also be written like this
<head/>
Will the second case cause issues in any scenerio?
Thanks
Self-closing <script> tags can mess up some browsers really badly. I remember my whole page disappearing into thin air in IE after I self-closed a script tag - everything after it was read as a script.
Assuming that you are serving your XHTML as XML, no. <head></head> is entirely equivalent to <head />. In fact, an XML parser won't even bother to tell you which one you have.
(There is, however, an issue in that the <head> tag must contain a <title>.)
You shouldn't use minimized form for head in XHTML.
http://www.w3.org/TR/xhtml1/#guidelines
About empty elements:
http://www.w3.org/TR/xhtml1/#C_3
Given an empty instance of an element
whose content model is not EMPTY (for
example, an empty title or paragraph)
do not use the minimized form (e.g.
use <p> </p> and not <p />).
In other words, paragraph should always be closed in XHTML, in HTML you could go with only opening tag. But if the element is supposed to have content it should be properly opened and closed.
For example line break has EMPTY content model and can be written as <br /> (same goes for <hr />) but not <div />.
Also see this SO question.
Empty Elements (XHTML)
Shorthand markup in HTML
Self-closing tags don't exist in HTML. The / is always ignored, that is, <foo/> and <foo> are equivalent. For elements such as br, that's fine, because you want <br>. However, <script src="..." /> means the same as <script src="...">, which is a problem (as noted in other answers). <head/> is less of a problem, because the </head> end tag is optional anyway.
In XML, on the other hand, self-closing tags do what you want. However, you probably aren't using XML, even if you've got an XHTML doctype. Unless you send your documents with a text/xml, application/xml or application/xhtml+xml MIME type (or any other XML MIME type), particularly if you send them as text/html, they will not be treated as XML.
Not that I am aware of. One caveat that has bitten me in the past is self closing my script tag: <script type="text/javascript" src="somefile.js" />
This results in some interesting fail.
In general an empty element can be written as a self closing tag, or opening and closing tags.
However, the HTML4 DTD specifies that the document HEAD must contain a TITLE element.
"Every HTML document must have a TITLE element in the HEAD section."
http://www.w3.org/TR/1999/REC-html401-19991224/struct/global.html#h-7.4.1
I believe some older browsers had problems with the lack of whitespacing - in particular
<head/> would be interpreted as a "head/" tag, whereas <head /> will be interpreted as a "head" tag with a blank attribute "/" which is ignored.
This only affects a few browsers, AFAIK. Either is valid XHTML, but older HTML-only browsers might have trouble.
This is in fact documented in the XHTML guidelines as C.2
Even considering only browser issues (i.e. disregarding validity) and narrowing the question down to the head tag alone, the answer is still yes.
Compare
<head/>
<object>Does this display?</object>
against
<head></head>
<object>Does this display?</object>
each served as text/html to any version of IE.
Does this display? will be shown only in the latter example.