Related
Say I'm building a typical document editor:
Where the preview (in red) is an up-to-date, formatted vue of the form's data.
The preview element contains semantic elements (e.g. h1, h2, main, header, etc.). It's kind of a document in itself, which does make sense, conceptually. But this makes the structure of the real document quite confusing for crawlers and screen readers. There might be, for instance, two h1 or main elements. I'm looking for a way to avoid that.
Plus, there's the problem of repetitive content (see image).
For the accessibility part of the problem, I could just add an aria-hidden="true" attribute to the preview element. In fact, visually-impaired people don't need the preview, it's just redundancy to them, they just need the form.
But for crawlers, here are my options:
Don't use semantic elements inside the preview element, use divs instead (😥).
Host the preview at an other URL and insert it via an iframe (that's what I'm doing right now, but it seems hacky to me).
Leave it like that, crawlers don't care.
Any idea/resource/suggestion?
As long as your preview area is clearly indicated for assistive technology, it's perfectly fine to have redundant information. If you have an <iframe>, make sure there's a title attribute on it.
<iframe title="preview area"...>
However, you might have validator issues with multiple structure elements.
For example, HTML only allows one <main> element:
A document must not have more than one main element that does not have the hidden attribute specified.
You can have multiple <header> elements but a <header> has a default role of banner and the banner role says:
Within any document or application, the author SHOULD mark no more than one element with the banner role.
The key here is "should", meaning it's a strong recommendation but not required. You can also get away with multiple banner roles if your preview section has role="document".
I would recommend not using non-semantic elements (div) because an assistive technology user might want to check the actual semantic structure of what's generated, although I suppose you could also have a "show in new tab" option for the preview that uses all full semantics, kind of like your second bullet but not using an iframe.
After reading thousand of posts, questions, blog articles and opinions, I'm still a bit confused about how to markup a web page with microdata. If the main purpose of microdata is to help search engine to better understand the content of a web page (and web page is assumed implicitly), is it correct to start with itemtype Webpage in the body element, and then continue to markup the rest of nested elements defining which is the main entity, or is it better to start with an itemtype that is ideally the main topic of the web page and associate properties at the top level, or is better to have different itemtype at the top level (i.e. webpage, blog post and main topic of the page)?
An example will explain better my question: if I have to markup a webpage that contains a blog post about a specific topic (let's say about wireless technology), what should be the item at the top level? Should be webpage, blogposting, or wireless technology?
The more the better (with exceptions)
When it comes to structured data, the guideline should be, in the typical case: the more the better. If you provide more structured data (i.e., you make things explicit instead of keeping them implicit), the chance is higher that a consumer finds something it can make use of.
Reasons not to follow this guideline might include:
You know exactly which consumers you want to support, and what they look for, and you don’t care about other (e.g., unknown or new) consumers.
You know that a consumer is bugged in a way that it can’t cope with certain structures.
You need to save as many characters as possible (bandwith/performance).
It’s too complex/expensive to provide additional structured data.
The structured data is most likely useless to any conceivable consumer.
…
What WebPage offers
So unless you have a reason not to, it’s probably a good idea to provide the WebPage type … if you can provide possibly interesting data. For example:
It allows you to provide different URIs for the page and the thing(s) on the page, or what the page represents, like a person, a building, etc. (see why this can be useful and a slightly more technical answer with details).
hasPart allows you to connect items which might otherwise be top-level items, for which it wouldn’t necessarily be clear in which relation they are.
isPartOf allows you to make this WebPage part of something else (e.g., of the website if you provide a WebSite item, or of a CollectionPage).
You have breadcrumbs on the page: use breadcrumb to make clear that they represent the breadcrumbs for this page.
You provide accessibility information: use accessibilityAPI, accessibilityControl, accessibilityFeature, accessibilityHazard
The author/contributor/copyrightHolder/editor/funder/etc. of the page is different from the author/… of e.g. the page’s main content.
The page has a different license than some of the parts included in the page.
You provide actions that can be done on/with the page: use potentialAction.
…
Of course it also allows you to use mainEntity, but if this were the only thing you need the WebPage item for, you could as well use the inverse property mainEntityOfPage.
More specific WebPage types
And the same is true for the more specific types, which give additional signals:
AboutPage if it’s a page about e.g. the site, you, or your organization.
CheckoutPage if it’s the checkout page in a web shop.
CollectionPage if it’s a page about multiple things (e.g., a pagination page listing blog posts, a gallery, a product category, …).
ContactPage if it’s the contact page.
ItemPage if it’s about a single thing (e.g., a blog posting, a photograph, …).
ProfilePage e.g. for user profiles.
QAPage if it’s … well, this very page.
SearchResultsPage for the result pages of your search function.
…
Your example
Your three cases are:
<!-- A - only the topic -->
<div itemscope itemtype="http://schema.org/Thing">
<span itemprop="name">wireless technology</span>
</div>
<!-- B - the blog post + the topic -->
<div itemscope itemtype="http://schema.org/BlogPosting">
<div itemprop="about" itemscope itemtype="http://schema.org/Thing">
<span itemprop="name">wireless technology</span>
</div>
</div>
<!-- C - the web page + the blog post + the topic -->
<div itemscope itemtype="http://schema.org/ItemPage">
<div itemprop="mainEntity" itemscope itemtype="http://schema.org/BlogPosting">
<div itemprop="about" itemscope itemtype="http://schema.org/Thing">
<span itemprop="name">wireless technology</span>
</div>
</div>
</div>
A conveys: there is something with the name "wireless technology".
B conveys: there is a blog post about "wireless technology".
C conveys: there is a web page that contains a single blog post (as main content for that page) about "wireless technology".
While I wouldn’t recommend to use A, using B is perfectly fine and probably sufficient for most use cases. While C already provides more details than B (namely that the page is for a single thing, and that this thing is the blog post, and not some other item that might also be on the page), it’s probably not needed for such a simple case. But this changes as soon as you can provide more data, in which case I’d go with C.
I'm pretty inexperienced as far as html goes and even less so with html5.
I have a question regarding modal popups - page sections that are interacted with using javascript/ajax, but not necessarily displayed on the page all the time. These are not generally in the main html flow - I might for instance place all my modal code at the end of the page for maintainability. The question is - should I be declaring these chunks of the page using html section tags, or something else?
To shed more light on the situation I'm describing, I have an application page. This contains a number of sections (I'm not referring to html5 here). The first section is modal on entering the page - it's a "click to continue if you agree" section. The next 5 chunks belong to a stepped application form - each step is displayed on at a time using a multiview control. Then another modal - a UI block, followed by a final decision section.
Since they are modal, and appear out of the flow, it is probably most suitable to use a div for them. If you do want to use a semantic block, then which you use will depend on what the content is, and how it relates to the rest of the page. The following articles should help you make that decision:
http://html5doctor.com/the-section-element/
http://html5doctor.com/the-article-element/
http://html5doctor.com/avoiding-common-html5-mistakes/ (particularly the first section of that article - "Don’t use section as a wrapper for styling")
Edit: Have added that 3rd link, since I now have enough rep to do so :-) yay!
The question is - should I be declaring these chunks of the page as sections, or something else
One of the big advantages of HTML5 is it's sematically readable. If you feel that your modal pop ups are better described by something like an article tag, then use an article. Use the tag you feel most accurately describes your functionality.
For example, let's say I have a sample page like so:
<html>
<head></head>
<body>
<article>
<!-- Some stuff here -->
</article>
</body>
</html>
I would expect the content of that article tag to fit this definition:
The article element represents a component of a page that consists of a self-contained composition in a document, page, application, or site and that is intended to be independently distributable or reusable, e.g. in syndication. This could be a forum post, a magazine or newspaper article, a blog entry, a user-submitted comment, an interactive widget or gadget, or any other independent item of content.
W3C Specification. The Article Element.
Note: In this context, an article is designed to represent flow content. Given that your aim is not to write flow content (as you correctly put) this is not a good example. This is very clear from the definition I've provided.
Similarly, if I replaced article with section, I would expect it to fit this definition:
Examples of sections would be chapters, the various tabbed pages in a tabbed dialog box, or the numbered sections of a thesis. A Web site's home page could be split into sections for an introduction, news items, and contact information.
W3C Specification. The Section Element
If I were you I would have a look through the spec and think the following questions:
What does my content actually mean to the user?
How will my content appear to other programmers?
Does the use of this content give me a hint at the correct semantics?
It depends what you have in your modal.
You could have a login form, subscribe stuff, advertisements, articles, a frame of another page, so it would only make sense to use <section> if they are actually an interesting section of the page, for example, you have an article and then you want to display the autor info in a modal box, then I would say that it would acceptable to use <section>.
So overall if it is part of the content then sounds ok to use that, if is is not you should use a <div>.
I would also say that no one has the answer for this as it is purely opinionated, and quite frankly doesn't matter.
There is also another way to incorporate modals. As they are dependent of JavaScript you could also load the popup contents via AJAX without having them in the document flow. A recent project I worked on, first renders links to a normal and complete HTML page for popup contents (e.g. contact forms). If JS is enabled, a parameter is added to the links to load only the main content without header, menu and sidebars via AJAX.
As the modal content does not really belong to your site content (if it does it shouldn't be a popup but within the documents main content) it shouldn't get marked up with some section, main or article tag. Instead use a div to render the popups or use an iframe if that is admissible for your project.
It doesn't really matter what tag is used for a modal, as long as it's appropriate the purpose (don't use a <fieldset> for example). Usually we see a <div> representing a modal.
You can use the role attribute for semantic information about the purpose of an element. In this case role="dialog" would be appropriate. You can find more info on the role attribute in HTML5 here.
Also note ARIA attributes: They enhance accessibility. For example aria-hidden="true" specifies that the element isn't visible. Screen-readers use this to skip the content.
I am using section tag for grouping topics and replies on the forum page. In cases that I need to load the topic and its replies on other article page, I use div tag for the same block and change topic title from h1 to h2. Although it is valid. But, for assistive technology, will this make navigating a bit confusing?
Assuming that the assistive technology you are talking about concerns mainly screenreaders, the best way for you to know how accessible your pages are is by downloading one yourself and testing it out. A free screenreader that I have used to do this is called NVDA but there are more out there.
In general, screenreaders work best when a page has a logical structure behind it. If you are displaying several articles, make sure that each article is located in a similar heirarchical location on the page and that each article itself resembles the others in terms of its structure. Using HTML5 semantic tags like article, aside and the like can be helpful but are not necessary. Screenreaders and other assistive technologies have made due for a lot longer than these tags have been around. They are certainly good to use when possible, but there are other more important ways to make your page accessible to as wide an audience as possible.
Another good thing to do is to use header tags for titles, and to use them in order. Screenreaders often give the option to users to skip from heading to heading in order to get a summary of what is on the page. You can also include visually invisible (via placing them far off the edge of the page using CSS) links at the top of the page, or in sections where placing a heading may not be appropriate visually. These will be read in context by screenreaders without your non-visually-impaired users seeing them.
If you are concerned about accessibility, a good way to get a clearer picture of how accessible your pages are is by following the WCAG (Web Content Accessibility Guidelines) standard recommendations. WCAG is managed by the W3C, and has various levels of accessibility that you can consider respecting when developing your content. The W3C has a list of validators that can be found here.
To answer your question from comments:
How it sound when read a topic title as h2, click it, then arrive the forum page and this topic title become h1?
This shouldn't confuse most people, especially if you do it consistently. I am assuming that you are making a news-like site.
Above Levi mentioned article tags. I would recommend using them if you are having multiple stories per page. The div tag is roughly the garbage can of the HTML world, you only should use it when nothing else is available. Article tags both give your code better syntaxical value as well as they have another feature, called a role. Roles allow a person using a screen reader to jump around a page, like they can with heading tags.
What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.