My question is:
Which html elements can contain other elements with the same tag name. (Like a <div> inside another <div> which is allowed.) And which html-elements (among them who are able to have content) are not allowed to have elements with the same tag name among it's descendants. (Like <p> inside another <p> which is not allowed.)
background
I want to write an html-parser (a lexer to be more precise) to be able to automatically process html-documents that my script reads from internet. I know there are out-of-the-box parsers (and lexers) for almost every language, but I want to try to write my own.
When doing so, there is the problem to handle malformed html, and one of the problems is to close html-elements that have a valid opening-tag, but no close-tag. So you have to make an educated guess where a <div> without matching </div> ends, and where a <p> without matching </p> has it's end.
You can split html-elements into three classes:
Elements that per definition can't have any content, like <img> or <br>
Elements that can contain descendants of the same type (<div> in <div> is allowed)
Elements that can have content, but not of the same element-type (<p> can contain text, <a> and many other elements, but <p> in <p> is not allowed)
Here I'm not interested in the "void elements" as described in 1 because those elements can't have close tags (and so, they never will miss closing tags).
When it comes to create closing tags that was missing, types 2 and 3 must be handled differently.
If your receive a document that contains this:
<body> a <div> b <div> c </body>
most browser will internally transform it into something like this:
<body>
a
<div>
b
<div>
c
</div> <!-- inserted -->
</div> <!-- inserted -->
</body>
All divs are closed at the same point, just before the first existing non-div-closing-tag that, together with its matching opening-tag, embraces the div-tags who's closing tags are missing. This algorithm gives nested elements where each unclosed element becomes the child of its previous unclosed fellow.
But if you get this
<body> a <p> b <p> c </body>
most browsers will convert it into this:
<body>
a
<p>
b
</p> <!-- inserted -->
<p>
c
</p> <!-- inserted -->
</body>
In this case one p-element is closed when the next p-element begins, or when a non-p-closing-tag is detected, who's opening-partner lays before the opening p-tag who's closing tag is missing. This algorithm does not produce nested elements of the same type, but produces siblings who are children of the same parent.
And to be able to decide which algorithm should be used to close elements, I need to know which elements belong to which class.
Related
As far as I know, this is right:
<div>
<p>some words</p>
</div>
But this is wrong:
<p>
<div>some words</div>
</p>
The first one can pass the W3C validator (XHTML 1.0), but the second can't. I know that nobody will write code like the second one. I just want know why.
And what about other tags' containment relationship?
An authoritative place to look for allowed containment relations is the HTML spec. See, for example, http://www.w3.org/TR/html4/sgml/dtd.html. It specifies which elements are block elements and which are inline. For those lists, search for the section marked "HTML content models".
For the P element, it specifies the following, which indicates that P elements are only allowed to contain inline elements.
<!ELEMENT P - O (%inline;)* -- paragraph -->
This is consistent with http://www.w3.org/TR/html401/struct/text.html#h-9.3.1, which says that the P element "cannot contain block-level elements (including P itself)."
In short, it is impossible to place a <div> element inside a <p> in the DOM because the opening <div> tag will automatically close the <p> element.
According to HTML5, the content model of div elements is flow content
Most elements that are used in the body of documents and applications are categorized as flow content.
That includes p elements, which can only be used where flow content is expected.
Therefore, div elements can contain p elements.
However, the content model of p elements is Phrasing content
Phrasing content is the text of the document, as well as elements that
mark up that text at the intra-paragraph level. Runs of phrasing
content form paragraphs.
That doesn't include div elements, which can only be used where flow content is expected.
Therefore, p elements can't contain div elements.
Since the end tag of p elements can be omitted when the p element is immediately followed by a div element (among others), the following
<p>
<div>some words</div>
</p>
is parsed as
<p></p>
<div>some words</div>
</p>
and the last </p> is an error.
Look at this example from the HTML spec
<!-- Example of data from the client database: -->
<!-- Name: Stephane Boyera, Tel: (212) 555-1212, Email: sb#foo.org -->
<DIV id="client-boyera" class="client">
<P><SPAN class="client-title">Client information:</SPAN>
<TABLE class="client-data">
<TR><TH>Last name:<TD>Boyera</TR>
<TR><TH>First name:<TD>Stephane</TR>
<TR><TH>Tel:<TD>(212) 555-1212</TR>
<TR><TH>Email:<TD>sb#foo.org</TR>
</TABLE>
</DIV>
Did you notice something? : There was no closing tag of the <p> element. a mistake in the specs ? No.
Tip #1: The closing tag of <p> is OPTIONAL
You may ask: But then how would a <p> element knows where to stop?
From w3docs:
If the closing tag is omitted, it is considered that the end of the paragraph matches with the start of the next block-level element.
In simple words: a <div> is a block element and its opening tag will cause the parent <p> to be closed, thus <div> can never be nested inside <p>.
BUT what about the inverse situation ? you may ask
well ...
Tip #2: The closing tag of the <div> element is REQUIRED
According to O’Reilly HTML and XHTML Pocket Reference, Fourth Edition (page 50)
<div> . . . </div>
Start/End Tags
Required/Required
That is, the <div> element's end will only be determined by its closing tag </div> hence a <p> element inside is will NOT break it.
After the X HTML, the conventions has been changed, and now it's a mixture of conventions of XML and HTML, so that is why the second approach is wrong and the W3C validator accepts the things correct that are according to the standards and conventions.
Because the div tag has higher precedence than the p tag. The p tag represents a paragraph tag whereas the div tag represents a document tag.
You can write many paragraphs in a document tag, but you can't write a document in a paragraph. The same as a DOC file.
As far as I know, this is right:
<div>
<p>some words</p>
</div>
But this is wrong:
<p>
<div>some words</div>
</p>
The first one can pass the W3C validator (XHTML 1.0), but the second can't. I know that nobody will write code like the second one. I just want know why.
And what about other tags' containment relationship?
An authoritative place to look for allowed containment relations is the HTML spec. See, for example, http://www.w3.org/TR/html4/sgml/dtd.html. It specifies which elements are block elements and which are inline. For those lists, search for the section marked "HTML content models".
For the P element, it specifies the following, which indicates that P elements are only allowed to contain inline elements.
<!ELEMENT P - O (%inline;)* -- paragraph -->
This is consistent with http://www.w3.org/TR/html401/struct/text.html#h-9.3.1, which says that the P element "cannot contain block-level elements (including P itself)."
In short, it is impossible to place a <div> element inside a <p> in the DOM because the opening <div> tag will automatically close the <p> element.
According to HTML5, the content model of div elements is flow content
Most elements that are used in the body of documents and applications are categorized as flow content.
That includes p elements, which can only be used where flow content is expected.
Therefore, div elements can contain p elements.
However, the content model of p elements is Phrasing content
Phrasing content is the text of the document, as well as elements that
mark up that text at the intra-paragraph level. Runs of phrasing
content form paragraphs.
That doesn't include div elements, which can only be used where flow content is expected.
Therefore, p elements can't contain div elements.
Since the end tag of p elements can be omitted when the p element is immediately followed by a div element (among others), the following
<p>
<div>some words</div>
</p>
is parsed as
<p></p>
<div>some words</div>
</p>
and the last </p> is an error.
Look at this example from the HTML spec
<!-- Example of data from the client database: -->
<!-- Name: Stephane Boyera, Tel: (212) 555-1212, Email: sb#foo.org -->
<DIV id="client-boyera" class="client">
<P><SPAN class="client-title">Client information:</SPAN>
<TABLE class="client-data">
<TR><TH>Last name:<TD>Boyera</TR>
<TR><TH>First name:<TD>Stephane</TR>
<TR><TH>Tel:<TD>(212) 555-1212</TR>
<TR><TH>Email:<TD>sb#foo.org</TR>
</TABLE>
</DIV>
Did you notice something? : There was no closing tag of the <p> element. a mistake in the specs ? No.
Tip #1: The closing tag of <p> is OPTIONAL
You may ask: But then how would a <p> element knows where to stop?
From w3docs:
If the closing tag is omitted, it is considered that the end of the paragraph matches with the start of the next block-level element.
In simple words: a <div> is a block element and its opening tag will cause the parent <p> to be closed, thus <div> can never be nested inside <p>.
BUT what about the inverse situation ? you may ask
well ...
Tip #2: The closing tag of the <div> element is REQUIRED
According to O’Reilly HTML and XHTML Pocket Reference, Fourth Edition (page 50)
<div> . . . </div>
Start/End Tags
Required/Required
That is, the <div> element's end will only be determined by its closing tag </div> hence a <p> element inside is will NOT break it.
After the X HTML, the conventions has been changed, and now it's a mixture of conventions of XML and HTML, so that is why the second approach is wrong and the W3C validator accepts the things correct that are according to the standards and conventions.
Because the div tag has higher precedence than the p tag. The p tag represents a paragraph tag whereas the div tag represents a document tag.
You can write many paragraphs in a document tag, but you can't write a document in a paragraph. The same as a DOC file.
Let's say I want to create a simple responsive one page homepage. I find several alternatives to do this, but what is the best option? I have read several articles on the net including the ones fron W3C, but I don't get a clear answer!
I'm going to have two column layout with text to the left and an image to the right. On a desktop computer they will be besides each other, styled left and right. But in smaller devices like a mobile, the right column will be changed to left and be placed below the text column.
Is alternative 1 bad in a HTML5 point of view? My thought was to devide the page with several parts of alternative 1 or 2. There is also a third alternative(I guess there almost endless with other options aswell) to use two article elements inside the section element and use a article element for the image instead of the aside element.
I guess some of you might also suggest me to use article element instead of section elements and use nested article. It's confusing with all this options!
Should I also use article and header element in alternative 1?
Preciate some feedback and guidelines! Sorry for all my questions, I just want to improve my coding skills!
Alternative 1:
<div id="intro">
<div class="content-left">
<h2>Headline</h2>
<p>Text</p>
</div><!-- end class content-left -->
<div class="content-right">
<img src="...."/>
</div><!-- end class content-right -->
</div><!-- end id intro -->
Alternative 2 with HTML5 elements:
<section id="intro">
<article>
<header>
<h1>Headline</h1>
</header>
<p>Text</p>
</article>
<aside>
<img src="...."/>
</aside>
</section>
The answer is: it doesn't really matter much, apart from code readability. Please see Why use HTML5 tags? for more on that.
You could have a <section class="articles"> that contains all <article> elements. You could have a <div class="articles"> that contains all <div class="article"> elements. I think it's safe to say there's no doubt the first one is easier to read for developers. Your pick.
There is, however, one issue: you self-close <img> -- no need for that in html5 anymore. See Are (non-void) self-closing tags valid in HTML5?.
In HTML 5, <foo /> means <foo>, the start tag. It is not a "self-closing tag". Instead, certain elements are designated as having no end tag, for example <br>. These are collectively called void elements. The slash is just syntactic sugar for people who are addicted to XML. Using the slash in a non-void element tag is invalid, but browsers parse it as the start tag anyway, leading to a mismatch in end tags.
My HTML is as as below. I have opened all elements and closed them. Still when I check it on w3c it shows error. I cant figure it out.
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Untitled Document</title>
</head>
<body>
<p>
<div class="inr_content clearfix">
<div class="col2 first fl">
to provide a drive-in services.
</div>
<div class="col2 last fr">
to provide a drive-in services.
</div>
</div>
</p>
</body>
</html>
That's because you are nesting a block level element inside the p tag which is invalid. You can only nest inline elements such as span, a and img inside p tag. So your markup is invalid, consider making something like
<div class="inr_content clearfix">
<div class="col2 first fl">
<p>to provide a drive-in services.</p>
</div>
<div class="col2 last fr">
<p>to provide a drive-in services.</p>
</div>
</div>
From W3C[1] :
The P element represents a paragraph. It cannot contain block-level
elements (including P itself).
1 - Reference
Since the syntax of the p element does not allow a div child, and the end tag </p> may be omitted, the validator (and a browser) implies </p> when it encounters a <div> tag when parsing a p element. That is, when p is being parsed (or “is open”), the start tag of a div element implicitly closes it, as if the markup had:
<body>
<p>
</p><div class="inr_content clearfix">
<div class="col2 first fl">
This means that there is a p element with only whitespace in it. The </p> tag that appears later thus has no matching start tag, and it is reported as invalid. Browsers ignore such homeless end tags, but validators have to report them.
The minimal change is to remove the </p> tag. Whether this is adequate depends on what you want. Removing the <p> tag as well would remove the p element, and this would affect rendering. Even though the p element has no content rendered (content height is 0), it has default top and bottom margin, possible creating some empty vertical space.
If you do not want such space, just remove the <p> tag (along with </p> of course). If you do want some space, it is usually still best to remove the tag, but then you would additionally set, in CSS, some suitable margin-top value on the top-level div element.
Even though p elements containing only whitespace are allowed in HTML5, they are not recommended. This is part of the general recommendation related to so-called palpable content: “elements whose content model allows any flow content or phrasing content should have at least one node in its contents that is palpable content”. And text is usually palpable content, but not if it consists of whitespace only.
You can't semantically put a div inside a <p> tag.
I have always used either a <br /> or a <div/> tag when something more advanced was necessary.
Is use of the <p/> tag still encouraged?
Modern HTML semantics are:
Use <p></p> to contain a paragraph of text in a document.
Use <br /> to indicate a line break inside a paragraph (i.e. a new line without the paragraph block margins or padding).
Use <div></div> to contain a piece of application UI that happens to have block layout.
Don't use <div /> or <p /> on their own. Those tags are meant to contain content. They appear to work as paragraph breaks only because when the browser sees them, and it "helpfully" closes the current block tag before opening the empty one.
A <p> tag wraps around something, unlike an <input/> tag, which is a singular item. Therefore, there isn't a reason to use a <p/> tag..
I've been told that im using <br /> when i should use <p /> instead. – maxp 49 secs ago
If you need to use <p> tags, I suggest wrapping the entire paragraph inside a <p> tag, which will give you a line break at the end of a paragraph. But I don't suggest just substituting something like <p/> for <br/>
<p> tags are for paragraphs and signifying the end of a paragraph. <br/> tags are for line breaks. If you need a new line then use a <br/> tag. If you need a new paragraph, then use a <p> tag.
Paragraph is a paragraph, and break is a break.
A <p> is like a regular Return in Microsoft Office Word.
A <br> is like a soft return, Shift + Return in Office Word.
The first one sets all paragraph settings/styles, and the second one barely breaks a line of text.
Yes, <p> elements are encouraged and won't get deprecated any time soon.
A <p> signifies a paragraph. It should be used only to wrap a paragraph of text.
It is more appropriate to use the <p> tag for this as opposed to <div>, because this is semantically correct and expected for things such as screen readers, etc.
Using <p /> has never been encouraged:
From XHTML HTML Compatibility Guidelines
C.3. Element Minimization and Empty Element Content
Given an empty instance of an element whose content model is not
EMPTY (for example, an empty title or
paragraph) do not use the minimized
form (e.g. use <p> </p> and not <p />).
From the HTML 4.01 Specification:
We discourage authors from using empty P elements. User agents should ignore empty P elements.
While they are syntactically correct, empty p elements serve no real purpose and should be avoided.
The HTML DTD does not prohibit you from using an empty <p> (a <p> element may contain PCDATA including the empty string), but it doesn't make much sense to have an empty paragraph.
Use it for what? All tags have their own little purpose in life, but no tag should be used for everything. Find out what you are trying to make, and then decide on what tag fits that idea best:
If it is a paragraph of text, or at least a few lines, then wrap it in <p></p>
If you need a line break between two lines of text, then use <br />
If you need to wrap many other elements in one element, then use the <div></div> tags.
The <p> tag defines a paragraph. There's no reason for an empty paragraph.
For any practical purpose, you don’t need to add the </p> into your markup. But if there is a string XHTML adheration requirement, then you would probably need to close all your markup tags, including <p>. Some XHTML analyzer would report this as an error.