I Have coded a "terms and conditions" page with numbered paragraphs as follows:
<p>1. content</p>
<p>2. more content</p>
<p>3. even more content</p>
as opposed to:
<ol>
<li>content</li>
<li>more content</li>
<li>even more content</li>
</ol>
I have been told by someone that this is extremely bad practice and is generally wrong.
Now, My question to you lot is - why?
This still validates and it is still picked up by search engines etc.
Am I being stupid or is the other person just being picky?
It's mainly a question of presentation versus contents.
The first approach makes you maintain the numbering as contents, whereas the second approach displays the numbering by itself, also allowing you to customize the presentation via CSS.
In the case you have to modify the text, for example to add or remove paragraphs, using numbers explicitely will force you to edit every paragraph to either increment or decrement numbers, possibly causing errors.
I can't be sure about SEO, but if it's important that the numbering appears in Google results, for instance if people search for legal article paragraphs via official numbers, then it's important that the number appears explicitely in the text, since the ordered list doesn't specify the number format that the crawler must understand.
It comes down to semantics. Yes the other person was being picky. Yes what you were doing is not the best practice. No, what you're doing is not WRONG technically, but you can ADD VALUE to the content by using the ordered list.
Since Ordered List is a tag, it is by definition machine-readable, without any special parsing. Using just 1. 2. etc as plain text embedded withing Paragraph content is not immediately exposed to whatever machine readers / code / other automation might want to work with your content.
The OL identifies that section of your content as an ordered list. That's adding (a little) value to the content because now it's providing a little bit of meta-data about what KIND of content it is. It's a numbered list.
I know, you're saying so what?
Well, let's suppose this content is being "read" by a blind person using a text reader software, where the computer reads the HTML page, parses the content and then speaks it to the user. The user may have enabled a setting saying - okay, for every ordered list, i want you to read me one item at a time and then pause and wait for me to say "next". Or whatever - just an arbitrary example.
Basically, by providing additional (semantic) information about your content, you are enabling software, devices and other processes to work with your content in more meaningful, contextual ways.
I highly doubt it has any effect on your rankings.
You should use an ordered list because that is what it was made for. Use a paragraph for paragraphs and unordered lists for lists of things with no order, etc...
The ordered list is semantically correct: the markup describes the list as a list. In the stack-of-paragraphs example, the markup is semantically meaningless.
Personally, I find semantic markup easier to style, scan, and maintain (SirDarius explained the last point perfectly.)
Related
So I am writing a manual in html5, and it's going to need numbering.
The headings will need to be numbered eg "Section 4: Some Stuff"
Some subheadings will need to be numbered eg "4.01: the first point
you need to know about some stuff"
Just to be difficult, the manual will have tables and images, so they will need to be numbered, also eg
"Fig 4.03 A cat. Most of the images on the internet are of cats."
Also, there are lots of process lists in the manual. It would be nice if these were numbered under the subheadings eg
4.05 A simple process
4.05.01 Pull a leaf from the tree
4.05.02 Eat it
4.05.03 Now you are a caterpillar
4.05.04 Turn into a beautiful butterfly
I've been researching the different ways to number my headings, subheadings, figures, and lists. I'm finding answers, just not good answers.
imperfect solution 1: use CSS counters
These can't be copied to editing programs (word etc)
They also apparently don't work with screen-readers
imperfect solution 2: Use ordered lists
These won't 'fail gracefully' afaik - if all my headings are a 'heading' class of ordered list, They will just look like a plain list without CSS.
Has someone solved this problem already? What's the solution?
Super extra kudos for anyone for anyone who can supply a smart way of auto-updating my figure cross references!
Use text.
<h3>4.01: the first point you need to know about some stuff</h3>
The numbering is not just styling (you might want to reference these numbers, right?), so a CSS solution is out of question.
Using ol can work in some cases, but it has many drawbacks:
You don’t want to use an ol for your whole document, do you?
User-agents don’t have to render any numbers at all.
Many user-agents will not allow to search for or copy the numbers.
You can’t get the exact kind of numbering scheme you want to have (e.g., nested ol typically don’t render a delimiter like . but start again with the first value).
You can use <h1> to <h6> to structure your text. But:
What's the correct / best way, to make a separate Index (Table of Contents) to the text? (on the same page as the text)
Usually, I would just make a <ul><li>...</li></ul> list, listing every title from every <h1> ... <h6>, but that probably wouldnt be a good solution, right? I'm especially talking about external readers (for blind people, for example, or maybe other programs) or even Search engines that read the HTML - they basically would see every title twice then...
I'm just wondering, if there is a predefined / best practice way to do that.
Table of contents and index are two different things. You probably mean the former.
It matters very little which markup you use, as long as you use normal links with descriptive link text (normally echoing the headings). You normally don’t want any browser-generated bullets or numbers, but the simplest choice is to use ul with list-style-type: none, using nested ul elements for different levels of headings.
A table of content means duplication, of course, and assistive software might be capable of constructing a table of contents from the heading elements. Still, for most purposes, an explicit table of content at the start is good usability and accessibility for any longish page. It lets the reader (seeing or blind) go through the headings first and then choose the jump somewhere or read sequentially.
I'm transforming some XML, which I have no control over, to XHTML. The XML schema defines a <para> tag for paragraphs and <unordered-list> and <ordered-list> for lists.
Frequently in this XML, I find lists nested within paragraphs. So, a straight-forward transformation causes <ul>s to get nested within <p>s, which is illegal in XHTML.
I've created a list of ways to deal with it and here are the most obvious:
Just don't worry about it. The browsers will do fine. Who cares. (I don't like this option, but it's an option!)
Write a fancy-pants component to my transform that makes sure all <para> tags get closed before unordered lists start, and re-opened afterward. (I like this option the most, but it's complicated due to multiple levels of nesting, and we may not have the budget for this)
Just transform <para> to <div> and set the margins on the divs so it looks like a paragraph in the browser. This is the easiest solution that emits valid XHTML, but it takes from the semantic value of the markup.
My questions are:
how much value do I lose if I go with option 3?
Does it really matter?
What is the actual effect on the user experience?
If you can cite references, please do (this is easy to speculate on). For example, I was thinking it might affect search results from a Google Search Appliance that we are using.
If search terms appear in divs, do they carry less weight?
Or is there less of an association between them and preceding header tags?
How can I find this out?
I've come up against this too.
Personally, I consider it a grave mistake on part of the standard that a p cannot contain lists. I think it's typographically legal, so it should be legal in what was originally intended to be a markup for text.
I may be flamed for this, but XHTML has crashed and burned in the real world, regardless of whether it was a good idea or not. The often horrible tag soup that is today's HTML markup will continue to survive for a goodly long time, if only because bad markup and lenient browsers will continue to perpetuate each other forever.
Thus, I tend to go with Option 1.
Option 3 is also viable, in my opinion. While I don't have proof, I'm pretty sure no search engine is crazy enough to actually put any trust in most of the formatting tags we apply to our HTML. meta and a tags are obvious exceptions, of course.
First of all, unless you set every CSS property available now plus every one possibly available in the future, then you can't guarantee your <div> will match up, WRT styles, with <p>. (Though I agree you can get close and this is probably good enough, but read on.) I don't know of any visual browsers or other tools that would seriously treat them differently, but this is just as much an artifact, IMHO, of the current widespread loose interpretation on the web, as it is of them being close in meaning.
Is <ul> the right transformation for every <unordered-list> in your source data? If they are always displayed as block-level content instead of 1) an, 2) inline, 3) list; then that's a safe bet. If so, you can break the paragraph into two (and wrap the whole thing in <div> if you like).
Example input:
<para>Yadda yadda: <unordered-list/> And so fin.</para>
Output:
<div>
<p>Yadda yadda:</p>
<ul/>
<p>And so fin.</p>
</div>
The good news is that any of these 3 options would work.
There are many, many people on SO that will tell you "if it works, forget semantics and do it." So Option 1 would probably be a site favorite if everyone here was asked.
Option 2 is my favorite and would be the best semantically. I would definetely do it if time/budget allows.
However, Option 3 is a close second and hopefully this will answer your question: The <div> element and the <p> element are near-identical. In fact, the biggest difference is semantics. They each have only one rule applied to them in most browsers' CSS specification: display: block.
I have seen it a lot in css talk. What does semantically correct mean?
Labeling correctly
It means that you're calling something what it actually is. The classic example is that if something is a table, it should contain rows and columns of data. To use that for layout is semantically incorrect - you're saying "this is a table" when it's not.
Another example: a list (<ul> or <ol>) should generally be used to group similar items (<li>). You could use a div for the group and a <span> for each item, and style each span to be on a separate line with a bullet point, and it might look the way you want. But "this is a list" conveys more information.
Fits the ideal behind HTML
HTML stands for "HyperText Markup Language"; its purpose is to mark up, or label, your content. The more accurately you mark it up, the better. New elements are being introduced in HTML5 to more accurately label common web page parts, such as headers and footers.
Makes it more useful
All of this semantic labeling helps machines parse your content, which helps users. For instance:
Knowing what your elements are lets browsers use sensible defaults for how they should look and behave. This means you have less customization work to do and are more likely to get consistent results in different browsers.
Browsers can correctly apply your CSS (Cascading Style Sheets), describing how each type of content should look. You can offer alternative styles, or users can use their own; as long as you've labeled your elements semantically, rules like "I want headlines to be huge" will be usable.
Screen readers for the blind can help them fill out a form more easily if the logical sections are broken into fieldsets with one legend for each one. A blind user can hear the legend text and decide, "oh, I can skip this section," just as a sighted user might do by reading it.
Mobile phones can switch to a numeric keyboard when they see a form input of type="tel" (for telephone numbers).
Semantics basically means "The study of meaning".
Usually when people are talking about code being semantically correct, they're referring to the code that accurately describes something.
In (x)HTML, there are certain tags that give meaning to the content they contain. For example:
An H1 tag describes the data it contains as a level-1 heading. An H2 tag describes the data it contains as a level-2 heading. The implied meaning behind this is that each H2 under an H1 is in some way related (i.e. heading and subheading).
When you code in a semantic way, you basically give meaning to the data you're describing.
Consider the following 2 samples of semantic VS non-semantic:
<h1>Heading</h1>
<h2>Subheading</h2>
VS a non-semantic equivalent:
<p><strong>Heading</strong></p>
<p><em>Subheading</em></p>
Sometimes you might hear people in a debate saying "You're just talking semantics now" and this usually refers to the act of saying the same meaning as the other person but using different words.
"Semantically correct usage of elements means that you use them for what they are meant to be used for. It means that you use tables for tabular data but not for layout, it means that you use lists for listing things, strong and em for giving text an emphasis, and the like."
From: http://www.codingforums.com/archive/index.php/t-53165.html
HTML elements have meaning. "Semantically correct" means that your elements mean what they are supposed to.
For instance, you definition lists are represented by <dl> lists in code, your abbreviations are <abbr>s etc.
It means that HTML elements are used in the right context (not like tables are used for design purposes), CSS classes are named in a human-understandable way and the document itself has a structure that can be processed by non-browser clients like screen-readers, automatic parsers trying to extract the information and its structure from the document etc.
For example, you use lists to build up menus. This way a screen reader for disabled people will know these list items are parts of the same menu level, so it will read them in sequence for a person to make choice.
I've never heard it in a purely CSS context, but when talking about CSS and HTML, it means using the proper tags (for example, avoiding the use of the table tag for non-tabular data), providing proper values for the class and id that identify what the contained data is (and using microformats as appropriate), and so on.
It's all about making sure that your data can be understood by humans (everything is displayed properly) and computers (everything is properly identified and marked up).
What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.