Multi-level numbered headings in HTML5 for outlining the document - html

So I am writing a manual in html5, and it's going to need numbering.
The headings will need to be numbered eg "Section 4: Some Stuff"
Some subheadings will need to be numbered eg "4.01: the first point
you need to know about some stuff"
Just to be difficult, the manual will have tables and images, so they will need to be numbered, also eg
"Fig 4.03 A cat. Most of the images on the internet are of cats."
Also, there are lots of process lists in the manual. It would be nice if these were numbered under the subheadings eg
4.05 A simple process
4.05.01 Pull a leaf from the tree
4.05.02 Eat it
4.05.03 Now you are a caterpillar
4.05.04 Turn into a beautiful butterfly
I've been researching the different ways to number my headings, subheadings, figures, and lists. I'm finding answers, just not good answers.
imperfect solution 1: use CSS counters
These can't be copied to editing programs (word etc)
They also apparently don't work with screen-readers
imperfect solution 2: Use ordered lists
These won't 'fail gracefully' afaik - if all my headings are a 'heading' class of ordered list, They will just look like a plain list without CSS.
Has someone solved this problem already? What's the solution?
Super extra kudos for anyone for anyone who can supply a smart way of auto-updating my figure cross references!

Use text.
<h3>4.01: the first point you need to know about some stuff</h3>
The numbering is not just styling (you might want to reference these numbers, right?), so a CSS solution is out of question.
Using ol can work in some cases, but it has many drawbacks:
You don’t want to use an ol for your whole document, do you?
User-agents don’t have to render any numbers at all.
Many user-agents will not allow to search for or copy the numbers.
You can’t get the exact kind of numbering scheme you want to have (e.g., nested ol typically don’t render a delimiter like . but start again with the first value).

Related

Filtering (hide or show) schema in HTML5 only

Using html5 only, without css or anything else, is it possible to have a sentence like this,
She loves him.
be schema'd with HTML5 so that the word "She" is tagged with metadata such as "Subject", "loves" is tagged with "verb", "him" is tagged with "object", and all of it together including the period, "She loves him." is tagged with "complete sentence".
Then, a user could filter what they want to display - elements tagged as Subjects, verbs, objects, or complete sentences... or mixtures of these.
For example, maybe you want to see all complete sentences and all objects, regardless if they are in a complete sentence or not.
Another example, if you had a bunch of sentences on a webpage, you could quickly filter to see only the verbs.
I'm looking for a way to accomplish this - a game plan for how to structure the tags. If I use divs, will that be robust enough to let me tag complete sentences and the individual words inside them, or is there a cleaner way, a more concise way, etc?
You're not going to be able to do this with just HTML. You're going to need to use Javascript to show and hide those elements. You'll need CSS to make it look good.
A way you could get started is to put spans around different parts of the sentence.
<span class="full-sentence><span class="subject">She</span><span class="verb">loves</span><span class="object">him</span>.
Then you need to look into using jQuery to show and hide the different parts of the sentence.
If you don't want to use Javascript, you could alternatively use CSS on the span classes to color the subject, verb and object.

Paragraphs vs ordered lists

I Have coded a "terms and conditions" page with numbered paragraphs as follows:
<p>1. content</p>
<p>2. more content</p>
<p>3. even more content</p>
as opposed to:
<ol>
<li>content</li>
<li>more content</li>
<li>even more content</li>
</ol>
I have been told by someone that this is extremely bad practice and is generally wrong.
Now, My question to you lot is - why?
This still validates and it is still picked up by search engines etc.
Am I being stupid or is the other person just being picky?
It's mainly a question of presentation versus contents.
The first approach makes you maintain the numbering as contents, whereas the second approach displays the numbering by itself, also allowing you to customize the presentation via CSS.
In the case you have to modify the text, for example to add or remove paragraphs, using numbers explicitely will force you to edit every paragraph to either increment or decrement numbers, possibly causing errors.
I can't be sure about SEO, but if it's important that the numbering appears in Google results, for instance if people search for legal article paragraphs via official numbers, then it's important that the number appears explicitely in the text, since the ordered list doesn't specify the number format that the crawler must understand.
It comes down to semantics. Yes the other person was being picky. Yes what you were doing is not the best practice. No, what you're doing is not WRONG technically, but you can ADD VALUE to the content by using the ordered list.
Since Ordered List is a tag, it is by definition machine-readable, without any special parsing. Using just 1. 2. etc as plain text embedded withing Paragraph content is not immediately exposed to whatever machine readers / code / other automation might want to work with your content.
The OL identifies that section of your content as an ordered list. That's adding (a little) value to the content because now it's providing a little bit of meta-data about what KIND of content it is. It's a numbered list.
I know, you're saying so what?
Well, let's suppose this content is being "read" by a blind person using a text reader software, where the computer reads the HTML page, parses the content and then speaks it to the user. The user may have enabled a setting saying - okay, for every ordered list, i want you to read me one item at a time and then pause and wait for me to say "next". Or whatever - just an arbitrary example.
Basically, by providing additional (semantic) information about your content, you are enabling software, devices and other processes to work with your content in more meaningful, contextual ways.
I highly doubt it has any effect on your rankings.
You should use an ordered list because that is what it was made for. Use a paragraph for paragraphs and unordered lists for lists of things with no order, etc...
The ordered list is semantically correct: the markup describes the list as a list. In the stack-of-paragraphs example, the markup is semantically meaningless.
Personally, I find semantic markup easier to style, scan, and maintain (SirDarius explained the last point perfectly.)

What does "semantically correct" mean?

I have seen it a lot in css talk. What does semantically correct mean?
Labeling correctly
It means that you're calling something what it actually is. The classic example is that if something is a table, it should contain rows and columns of data. To use that for layout is semantically incorrect - you're saying "this is a table" when it's not.
Another example: a list (<ul> or <ol>) should generally be used to group similar items (<li>). You could use a div for the group and a <span> for each item, and style each span to be on a separate line with a bullet point, and it might look the way you want. But "this is a list" conveys more information.
Fits the ideal behind HTML
HTML stands for "HyperText Markup Language"; its purpose is to mark up, or label, your content. The more accurately you mark it up, the better. New elements are being introduced in HTML5 to more accurately label common web page parts, such as headers and footers.
Makes it more useful
All of this semantic labeling helps machines parse your content, which helps users. For instance:
Knowing what your elements are lets browsers use sensible defaults for how they should look and behave. This means you have less customization work to do and are more likely to get consistent results in different browsers.
Browsers can correctly apply your CSS (Cascading Style Sheets), describing how each type of content should look. You can offer alternative styles, or users can use their own; as long as you've labeled your elements semantically, rules like "I want headlines to be huge" will be usable.
Screen readers for the blind can help them fill out a form more easily if the logical sections are broken into fieldsets with one legend for each one. A blind user can hear the legend text and decide, "oh, I can skip this section," just as a sighted user might do by reading it.
Mobile phones can switch to a numeric keyboard when they see a form input of type="tel" (for telephone numbers).
Semantics basically means "The study of meaning".
Usually when people are talking about code being semantically correct, they're referring to the code that accurately describes something.
In (x)HTML, there are certain tags that give meaning to the content they contain. For example:
An H1 tag describes the data it contains as a level-1 heading. An H2 tag describes the data it contains as a level-2 heading. The implied meaning behind this is that each H2 under an H1 is in some way related (i.e. heading and subheading).
When you code in a semantic way, you basically give meaning to the data you're describing.
Consider the following 2 samples of semantic VS non-semantic:
<h1>Heading</h1>
<h2>Subheading</h2>
VS a non-semantic equivalent:
<p><strong>Heading</strong></p>
<p><em>Subheading</em></p>
Sometimes you might hear people in a debate saying "You're just talking semantics now" and this usually refers to the act of saying the same meaning as the other person but using different words.
"Semantically correct usage of elements means that you use them for what they are meant to be used for. It means that you use tables for tabular data but not for layout, it means that you use lists for listing things, strong and em for giving text an emphasis, and the like."
From: http://www.codingforums.com/archive/index.php/t-53165.html
HTML elements have meaning. "Semantically correct" means that your elements mean what they are supposed to.
For instance, you definition lists are represented by <dl> lists in code, your abbreviations are <abbr>s etc.
It means that HTML elements are used in the right context (not like tables are used for design purposes), CSS classes are named in a human-understandable way and the document itself has a structure that can be processed by non-browser clients like screen-readers, automatic parsers trying to extract the information and its structure from the document etc.
For example, you use lists to build up menus. This way a screen reader for disabled people will know these list items are parts of the same menu level, so it will read them in sequence for a person to make choice.
I've never heard it in a purely CSS context, but when talking about CSS and HTML, it means using the proper tags (for example, avoiding the use of the table tag for non-tabular data), providing proper values for the class and id that identify what the contained data is (and using microformats as appropriate), and so on.
It's all about making sure that your data can be understood by humans (everything is displayed properly) and computers (everything is properly identified and marked up).

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.

What is semantic markup, and why would I want to use that?

Like it says.
Using semantic markup means that the (X)HTML code you use in a page contains metadata describing its purpose -- for example, an <h2> that contains an employee's name might be marked class="employee-name". Originally there were some people that hoped search engines would use this information, but as the web has evolved semantic markup has been mostly used for providing hooks for CSS.
With CSS and semantic markup, you can keep the visual design of the page separate from the markup. This results in bandwidth savings, because the design only has to be downloaded once, and easier modification of the design because it's not mixed in to the markup.
Another point is that the elements used should have a logical relationship to the data contained within them. For example, tables should be used for tabular data, <p> should be used for textual paragraphs, <ul> should be used for unordered lists, etc. This is in contrast to early web designs, which often used tables for everything.
Semantics literally means using "meaningful" language; in Web Development, this basically means using tags and identifiers which describe the content.
For example, applying IDs such as #Navigation, #Header and #Content to your <div> tags, rather than #Left, and #Main, or using unordered lists for a list of navigational links, rather than a table.
The main benefits are in future maintenance; you can easily change the layout or the presentation without losing the meaning of your content. You navigation bar can move from the left to the right, or your links displayed horizontally rather than vertically, without losing the meaning.
From http://www.digital-web.com/articles/writing_semantic_markup/ :
semantic markup is markup that is descriptive enough to allow us and the machines we program to recognize it and make decisions about it. In other words, markup means something when we can identify it and do useful things with it. In this way, semantic markup becomes more than merely descriptive. It becomes a brilliant mechanism that allows both humans and machines to “understand” the same information.
Besides the already mentioned goal of allowing software to 'understand' the data, there are more practical applications in using it to translate between ontologies, or for mapping between dis-similar representations of data - without having to translate or standardize the data (which can result in a loss of information, and typically prevents you from improving your understanding in the future).
There were at least 2 sessions at OSCon this year related to the use of semantic technologies. One was on BigData (slides are available here: http://en.oreilly.com/oscon2008/public/schedule/proceedings, the other was the guys from FreeBase.
BigData was using it to map between two dis-similar data models (including the use of query languages which were specifically created for working with semantic data sets). FreeBase is mapping between different data sets and then performing further analysis to derive meaning across those data sets.
Related topics to look into: OWL, OQL, SPARQL, Franz (AllegroGraph, RacerPRO and TopBraid).
Here is an example of a HTML5, semantically tagged website that I've been working on that uses the recently accepted Micro-formats as specified at http://schema.org along with the new more semantic tagging elements of HTML5.
http://blog-to-book.com/view/stuff/about/semantic%20web
Googles has a handy Semantic tagging test tool that will show you how adding semantic tags to content enables search engines to 'understand' far more about your web pages.
Here is the test tool: http://www.google.com/webmasters/tools/richsnippets?url=http%3A%2F%2Fblog-to-book.com%2Fview%2Fstuff%2Fabout%2Fsemantic+web&view=
Notice how google now knows that the 'things' on the page are books, and they have an isbn13 identifier. Adding additional metadata, such as price and author enables further inferences to be made.
Hope this points you in some interesting directions. More detailed semantic tagging can be achieved using the Good Relations Ontology which is pretty much the most comprehensive I can think of right now.