can tags be replacement of taxonomy? - usability

My Question is around usability. In most of the sites i have seen and developed i see taxonomy as a way a user would find something he is looking for in the site. But quite recently i have seen the concept of tagging. Where products services questions are tagged and can be found with the tagname. Is tagging an alternative to taxonomy or they should work together.

I'd say that like most things, it depends on what kind of information you're trying to organize.
For example, here on Stack Overflow, there isn't really a rigid hierarchy by which to sort the questions. They're much more organic in the sense that they can span multiple, and even unrelated, disciplines or fields and create a whole host of dynamic connections. For organizing this type of information, I think tags are an appropriate replacement for traditional, hierarchical taxonification. The decentralized, dehierarchized nature of tagging dovetails perfectly with the general organization of the site's content, especially when the site's users/community is encouraged to participate in cataloguing and organizing the information. Many blogs and social networking sites like Delicious organize their content with a series of tags as well.
Conversely, if you're trying to sell products or provide technical support, you'll probably find that tagging is not a suitable replacement for traditional taxonomic organization. If you're familiar with MSDN, which provides online documentation for developers in the Microsoft ecosystem, you'll observe that most of its content is organized into a natural hierarchy by technology/language, feature, sub-feature, etc. If you want to buy a computer from Dell, you start by narrowing down your choices: do you want a desktop, notebook, or tablet? Do you want a performance-oriented notebook, a desktop-replacement notebook, or an ultra-portable? Etc. Of course, that doesn't mean that you shouldn't consider implementing tags as an alternative way for users to explore the information that you have available, but in the best of cases, they will work together.
Think about the type of content you plan to host on your site and consider the most natural way to organize that information. Your users will appreciate more than anything a site that is intuitive and where they feel it is easy to locate exactly what they're looking for.

That is an argument I always found interesting, and basically I reduce to this question:
In order to found something, is better to have a hierarchical taxonomy or a flat tag-based taxonomy (maybe collaborative i.e. Folksonomy) ?
Well, there's no unique answer, but, depending on the search context, sometimes the former is more convenient and sometimes the latter is.
The best thing would be to have both kind of taxonomies, but could be difficult to manage, in particular if contents are created by people and so the classification is up to them.
One solution could be have tags inheritance, like in drupal taxonomy system.
So for instance when you want to classify a picture of your dog, you just have to select the tag: 'dogs' and automatically your picture will belong to tags: 'dogs' --> 'animals' --> 'living beings' and so on.

This question is an issue related to the human thinking:
Sure it is better, if you can find something by a tagged word. If you dont know the word/tag perfectly, you are not able to find it. Others may have taged the thing you search for with a similar, but other tag. In this case a (binary) tag search will not give you the correct (or whole) awnser.
Anyway, there is a possibility to extract a taxonomy (as long as words/tags are related) out of tags. This concept (combined with a vecor-orientated-search) can be presented to the user and will help him to find what he needs.

Although I'd just upvote Cody's answer (I did), I would also like to add something:
The field of usability used to be within the realm of ergonomics before it grew up. So I think it is appropriate to refer to one of ergonomics' core principles.
Every person has a unique set of dimensions, so there is no single set of “correct dimensions” for e.g. a chair. The best dimensions are adjustable dimensions that provide a reasonable range of variability.
It is possible to apply this principle to website navigation as well and provide multiple ways of reaching the same content, so that people with different habits can find stuff using the way they are most comfortable with.

Related

When should I use HTML5 Microdata for SEO?

So I've been looking into this HTML 5 Microdata, but I'm not sure if or when it is appropriate to use. I know that if used with rating and you search a website it will pull up things like video rating and article ratings etc. But for Microdata like People or Places, is that so useful that I should start implementing it into all my websites - big and small? How big of an impact will this really have on my SEO if I start using Microdata on everything?
Maybe using something like http://schema.org/ as my standard term dictionary. I think that is what Google suggests using. Here's a link to the dev of microdata http://dev.w3.org/html5/md/ which will be helpful if you are unfamiliar with microdata
Following to that Schema.org - Why You're Behind if You're Not Using It... article on SEOMoz, I must say this question is not just about microdata and Google SERPs positions. I think it has to be taken in a much wider meaning:
Some advantages:
Implementing microdata on a website DOES increase CHANCE for Rich
Snippets displayed next to your site on Google search results. You can't say 'microdata = rich snippets', but you also can't say 'no microdata = no rich snippets' :)
Having rich snippets increases users' attention to that single search result and it CAN result in more clicks => visitors on your page.
Some cons:
Some rich snippets, which can be a result of using microdata, can let users find information they're looking for directly on the search results, without actually reaching your page. eg. if user is looking for a phone number and see it on rich snippet, he doesn't have to click and visit your page.
You have to decide on your own if you can take that risk. From my own experience (and that article comments as well), that risk is quite small and if you can, you should implement microdata. Of course, 'if you can' should really mean: 'if you can and it won't need the whole site to be rebuilt' :) If you have more serious things to do on your site, you should put them in front of a queue. Today, it's only 'nice-to-have', not 'must-have'.
And just for the end - I know my answer is not just yes or not the answer, but it's because the question is not that kind of question. However, I hope it could help you make your own decision.
My answer would be "Always."
It's the emerging standard for categorizing all forms of information on the web.
Raven Tools (no affiliation) has a schema.org microdata generator that's a good place to start:
http://schema-creator.org/product.php
They have a couple stock schema templates on that page (look on the left column).

Tree view navigation, good idea?

I'm thinking of using a tree view for page navigation in my web application, similar to Windows Explorer. There are a lot of things for administrators to configure in the application so I figured listing all links in a single page in tree form would keep things organized. Related page links are grouped in a "folder", and all folders will show closed initially.
Obviously, this page is for administrators only, so they'd be provided with some training. That being said, is this a good design from user's point of view? Do you see any usability or potential implementation issues?
The best answer involves empirical evidence. A yes or no answer could really vary based on the specific task and your intended audience. Try doing a simple 5 minute usability test with your users. Draw out your page layouts on paper and have a couple of users pretend to use the site (see Paper Protyping). Give them a few simple tasks to complete using your interface and observe what they do.
If they get confused or have trouble with the concept, then it's probably best to find another way to provide navigation.
It totally depends on how your users are using your site. If they're often jumping from one part of the site to a completely different, unrelated place in the site, a tree may be the best way to let them quickly find that "other page" they were looking for.
However, for the vast majority of websites I've ever seen or used, I'd prefer to find what I'm looking for either via Search functionality, or by links on the page I'm looking at that lead me to related data.

Why should I use XFN in my HTML?

What is the benefit of using XFN (XHTML Friends Network)? I've seen this on multiple blogs and social networking sites but I don't really understand why it's useful. Other than being able to style these elements with CSS3 and select them with JavaScript, what's the benefit? Do you know of any sites out there that really utilize XFN to enhance the user experience? Also, are there similar alternatives to XFN?
Do you know of any sites out there
that really utilize XFN to enhance the
user experience?
Microformats aren't meaned to show extra information on the website itself, if it was, it could be used like John. You should think in another direction, for example, maybe browsers will support microformats one day.
Search engines may find this XFN-information interesting for one or another reason to see how the world is connected; I'm not sure what they actually could do with this information. You can read about that on Wikipedia
By the way, you can find out who your friends on the web are using Google's Social Graph API
Also, are there similar alternatives to XFN?
Take a look at microformat.org's wiki

What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice?
Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for.
Postscript the first: To be precise, the kind of survey I'm after would be a paper (published, unpublished, whatever) that discusses both criteria from the scholarly literature, and a number of existing implementations, and analyses how unsuccessful the implementations are from the viewpoint of the criteria. And, really, a post to a mailing list would work for me too.
Postscript the second To be clear, after Peter Rowell's answer, which I have accepted, we can see that this question leads to two subquestions: (i) the solved problem of cleaning up non-conformant HTML, for which Beautiful Soup is the most recommended solution, and (ii) the unsolved problem or separating cruft (mostly site-added boilerplate and promotional material) from meat (the contentthat the kind of people who think the page might be interesting in fact find relevant. To address the state of the art, new answers need to address the cruft-from-meat peoblem explicitly.
Extraction can mean different things to different people. It's one thing to be able to deal with all of the mangled HTML out there, and Beautiful Soup is a clear winner in this department. But BS won't tell you what is cruft and what is meat.
Things look different (and ugly) when considering content extraction from the point of view of a computational linguist. When analyzing a page I'm interested only in the specific content of the page, minus all of the navigation/advertising/etc. cruft. And you can't begin to do the interesting stuff -- co-occurence analysis, phrase discovery, weighted attribute vector generation, etc. -- until you have gotten rid of the cruft.
The first paper referenced by the OP indicates that this was what they were trying to achieve -- analyze a site, determine the overall structure, then subtract that out and Voila! you have just the meat -- but they found it was harder than they thought. They were approaching the problem from an improved accessibility angle, whereas I was an early search egine guy, but we both came to the same conclusion:
Separating cruft from meat is hard. And (to read between the lines of your question) even once the cruft is removed, without carefully applied semantic markup it is extremely difficult to determine 'author intent' of the article. Getting the meat out of a site like citeseer (cleanly & predictably laid out with a very high Signal-to-Noise Ratio) is 2 or 3 orders of magnitude easier than dealing with random web content.
BTW, if you're dealing with longer documents you might be particularly interested in work done by Marti Hearst (now a prof at UC Berkely). Her PhD thesis and other papers on doing subtopic discovery in large documents gave me a lot of insight into doing something similar in smaller documents (which, surprisingly, can be more difficult to deal with). But you can only do this after you get rid of the cruft.
For the few who might be interested, here's some backstory (probably Off Topic, but I'm in that kind of mood tonight):
In the 80's and 90's our customers were mostly government agencies whose eyes were bigger than their budgets and whose dreams made Disneyland look drab. They were collecting everything they could get their hands on and then went looking for a silver bullet technology that would somehow ( giant hand wave ) extract the 'meaning' of the document. Right. They found us because we were this weird little company doing "content similarity searching" in 1986. We gave them a couple of demos (real, not faked) which freaked them out.
One of the things we already knew (and it took a long time for them to believe us) was that every collection is different and needs it's own special scanner to deal with those differences. For example, if all you're doing is munching straight newspaper stories, life is pretty easy. The headline mostly tells you something interesting, and the story is written in pyramid style - the first paragraph or two has the meat of who/what/where/when, and then following paras expand on that. Like I said, this is the easy stuff.
How about magazine articles? Oh God, don't get me started! The titles are almost always meaningless and the structure varies from one mag to the next, and even from one section of a mag to the next. Pick up a copy of Wired and a copy of Atlantic Monthly. Look at a major article and try to figure out a meaningful 1 paragraph summary of what the article is about. Now try to describe how a program would accomplish the same thing. Does the same set of rules apply across all articles? Even articles from the same magazine? No, they don't.
Sorry to sound like a curmudgeon on this, but this problem is genuinely hard.
Strangely enough, a big reason for google being as successful as it is (from a search engine perspective) is that they place a lot of weight on the words in and surrounding a link from another site. That link-text represents a sort of mini-summary done by a human of the site/page it's linking to, exactly what you want when you are searching. And it works across nearly all genre/layout styles of information. It's a positively brilliant insight and I wish I had had it myself. But it wouldn't have done my customers any good because there were no links from last night's Moscow TV listings to some random teletype message they had captured, or to some badly OCR'd version of an Egyptian newspaper.
/mini-rant-and-trip-down-memory-lane
One word: boilerpipe.
For the news domain, on a representative corpus, we're now at 98% / 99% extraction accuracy (avg/median)
Demo: http://boilerpipe-web.appspot.com/
Code: http://code.google.com/p/boilerpipe/
Presentation: http://videolectures.net/wsdm2010_kohlschutter_bdu/
Dataset and slides: http://www.l3s.de/~kohlschuetter/boilerplate/
PhD thesis: http://www.kohlschutter.com/pdf/Dissertation-Kohlschuetter.pdf
Also quite language independent (today, I've learned it works for Nepali, too).
Disclaimer: I am the author of this work.
Have you seen boilerpipe? Found it mentioned in a similar question.
I have come across http://www.keyvan.net/2010/08/php-readability/
Last year I ported Arc90′s Readability
to use in the Five Filters project.
It’s been over a year now and
Readability has improved a lot —
thanks to Chris Dary and the rest of
the team at Arc90.
As part of an update to the Full-Text
RSS service I started porting a more
recent version (1.6.2) to PHP and the
code is now online.
For anyone not familiar, Readability
was created for use as a browser addon
(a bookmarklet). With one click it
transforms web pages for easy reading
and strips away clutter. Apple
recently incorporated it into Safari
Reader.
It’s also very handy for content
extraction, which is why I wanted to
port it to PHP in the first place.
there are a few open source tools available that do similar article extraction tasks.
https://github.com/jiminoc/goose which was open source by Gravity.com
It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.
I've worked with Peter Rowell down through the years on a wide variety of information retrieval projects, many of which involved very difficult text extraction from a diversity of markup sources.
Currently I'm focused on knowledge extraction from "firehose" sources such as Google, including their RSS pipes that vacuum up huge amounts of local, regional, national and international news articles. In many cases titles are rich and meaningful, but are only "hooks" used to draw traffic to a Web site where the actual article is a meaningless paragraph. This appears to be a sort of "spam in reverse" designed to boost traffic ratings.
To rank articles even with the simplest metric of article length you have to be able to extract content from the markup. The exotic markup and scripting that dominates Web content these days breaks most open source parsing packages such as Beautiful Soup when applied to large volumes characteristic of Google and similar sources. I've found that 30% or more of mined articles break these packages as a rule of thumb. This has caused us to refocus on developing very low level, intelligent, character based parsers to separate the raw text from the markup and scripting. The more fine grained your parsing (i.e. partitioning of content) the more intelligent (and hand made) your tools must be. To make things even more interesting, you have a moving target as web authoring continues to morph and change with the development of new scripting approaches, markup, and language extensions. This tends to favor service based information delivery as opposed to "shrink wrapped" applications.
Looking back over the years there appears to have been very few scholarly papers written about the low level mechanics (i.e. the "practice of the former" you refer to) of such extraction, probably because it's so domain and content specific.
Beautiful Soup is a robust HTML parser written in Python.
It gracefully handles HTML with bad markup and is also well-engineered as a Python library, supporting generators for iteration and search, dot-notation for child access (e.g., access <foo><bar/></foo>' usingdoc.foo.bar`) and seamless unicode.
If you are out to extract content from pages that heavily utilize javascript, selenium remote control can do the job. It works for more than just testing. The main downside of doing this is that you'll end up using a lot more resources. The upside is you'll get a much more accurate data feed from rich pages/apps.

Should I make it a priority to semantically mark up my pages? Or is the Semantic Web a good idea that will never really get off the ground?

The Semantic Web is an awesome idea. And there are a lot of really cool things that have been done using the semantic web concept. But after all this time I am beginning to wonder if it is all just a pipe dream in the end. If we will ever truly succeed in making a fully semantic web, and if we are not going to be able to utilize semantic web to provide our users a deeper experience on the web is it worth spending the time and extra effort to ensure FULLY semantic web pages are created by myself or my team?
I know that semantic pages usually just turn out better (more from attention to detail than anything I would think), so I am not questioning attempting semantic page design, what I am currently mulling over, is dropping the review and revision process of making a partially semantic page, fully semantic in hopes of some return in the future.
On a practical level, some aspects of the semantic web are taking off:
1) Semantic markup helps search engines identify key content and improves keyword results.
2) Online identity is a growing concern, and semantic markup in links like rel='me' help to disambiguate these things. Autodiscovery of social connections is definitely upcoming. (Twitter uses XFN markup for all of your information and your friends, for example)
3) Google (and possibly others) are starting to pay attention to microformats like hCard and hCalendar to gather greater information about people and events going on. This feature is still on the "very new" list, but these microformats are useful examples of the semantic web.
It may take some time for it all to get there, but there are definite possible benefits. I wouldn't put a huge amount of effort into it these days, but its definitely worth keeping in mind when you're developing a site.
Yahoo and Google have both announced support for RDFa annotations in your HTML content. Check out Yahoo SearchMonkey and Google Rich Snippets. If you care about SEO and driving traffic to your site, these are good ways to get better search engine coverage today.
Additionally, the Common Tag vocabulary is an RDFa vocabulary for annotating and organizing your content using semantic tags. Yahoo and Google will make use of these annotations, and existing publishing platforms such as Drupal 7 are investigating adopting the Common Tag format.
I would say no.
The reason I would say this is that the current return for creating a fully semantic web page right now is practically zero. You will have to spend extra time and effort, and there is very little to show for it now.
Effort is not like investing, however, so doing it now has no practical advantage. If the semantic web does start to show potential, then you can always revisit it and tap into that potential later.
It should be friendly to search engines, but going further is not going to provide good ROI.
Furthermore, what are you selling? A lot of the purpose behind being semantic beyond being indexable is easier 3rd party integration and data mining (creating those ontologies). Are these desirable traits for your data sets? If you are selling advertisement, making it easier for others to pull in your content is probably not going to be helpful.
It's all about where you want to spend your time.
You shouldn't do anything without a requirement. Otherwise, how do you know if you've succeeded? Do you have a requirement for being semantic? How much? How do you measure success? How do you measure return on investment?
Don't do anything just because of fads, unless keeping up with fads is a requirement.
Let me ask you a question - would you live in a house or buy a car that wasn't built according to a spec?
"So is this 4x4 lumber, upheld with a steel T-Beam?"
"Nope...we managed to rig the foundation on on PVC Piping...pretty cool, huh."