Is XML really more semantic that HTML with classes/ids? - html

I'm coming from a HTML / JavaScript / PHP background and have recently started learning XML.
I was reading this excerpt from "No Nonsense XML Web Development with PHP" which includes this comparison:
<div>
<div>
<h2>Product One</h2>
<p>Product One is an exciting new widget that will simplify your life.</p>
<p><b>Cost: $19.95</b></p>
<p><b>Shipping: $2.95</b></p>
</div>
</div>
Take a good look at this – admittedly simple – code sample from a computer’s perspective. A human can certainly read this document and make the necessary semantic leaps to understand it, but a computer couldn’t. ....
A computer program (and even some humans) that tried to decipher this document wouldn’t be able to make the kinds of semantic leaps required to make sense of it. The computer would be able only to render the document to a browser with the styles associated with each tag. HTML is chiefly a set of instructions for rendering documents inside a Web browser; it’s not a method of structuring documents to bring out their meaning.
The author then compares this to XML with this:
If the above document were created in XML, it might look a little like this:
<productListing title="ABC Products">
<product>
<name>Product One</name>
<description>Product One is an exciting new widget that will simplify your life.</description>
<cost>$19.95</cost>
<shipping>$2.95</shipping>
</product>
</productListing>
In theory, we should be able to look at any XML document and understand instantly what’s going on. In the example above, we know that a product listing contains products, and that each product has a name, a description, a price, and a shipping cost. You could say, rightly, that each XML document is self-describing, and is readable by both humans and software.
I get the author's point to a degree. Of course a computer would not be able to discern meaning from this HTML, there's no context.
However, I would never expect the HTML to be written in this way. Rather I would expect the HTML to use classes and/or ids to provide the necessary context more like:
<div class="productListing">
<div class="product">
<h2 class="name">Product One</h2>
<p class="description">Product One is an exciting new widget that will simplify your life.</p>
<p class="cost"><b>Cost: $19.95</b></p>
<p class="shipping"><b>Shipping: $2.95</b></p>
</div>
</div>
Given this example, my question is:
Is XML really more semantic than HTML that utilizes classes/ids to provide context to the data it contains?
(Note that I simplified the code examples to avoid TL;DR)

This is an interesting question.I'll give you my two cents.
I jumped onto XML a few years ago when I had to built a dynamic website and my client didn't have access to the database(just FTP access).What I essentially coded was an XML backend and PHP which fetched this through SimpleXML parsing.
In retrospect, I do think XML is more semantically richer than HTML. As a comment pointed out above, the html class has been a styling construct. I don't remember personally using/ hearing anyone using classes or ids for purposes other than CSS/JS based styles or animations.
The key in using XML over HTML with classes was the flexibility to throw it around. For another project, updating values of XML elements from one system, and then having them read and displayed by an other system made a lot of things smoother.Additionally, the XML parsing libraries allow a number of functions for parsing through the nodes.
Also it's important to note that XML allows you to define attributes.This could be viewed as something similar to classes and ids to HTML.
Also, let's not forget that RSS feeds are essentially XML and not HTML with more tags.
Therefore, answering your question specifically with respect to semantic, I definitely think XML has the advantage there.
TLDR:XML is more semantic according to me

You are correct that in terms of just looking at markup, there is little do none difference between XML's "meaningful" element names, and HTML class/id. However, keep in mind that for XML, there is a set of technologies and tools that allow you to easily work with element names. You can write schemas and validate against them. You can compose schemas by using namespaces. You can extract structures by using simple XPath expressions. All of this is much harder with the HTML approach.
So if you have requirements to capture and process "meaningful" structures, then XML is your friend. If all you want is to have snapshot of something where you can say "this is a product", then maybe there really might be not such a big difference.
My advice would be: If you store and process data using multiple publishing pipelines, XML very likely is a much better starting point. If all you want is capture snapshots that will get delivered to HTML-based consumers, then "semantically enriched" HTML may be the easier way to go.

Related

XML as complement to HTML

I'm having trouble wrapping my head around using XML as complement to HTML. I know what they are used for but I don't quite understand how to use them together.
I know that you can use JavaScript to convert an XML file to HTML, but I don't get how that's going to do the trick. How would I be able to style this HTML-file?
I have a template form, which I want to be accessible on a server and for which I want to enable edits. Once edited I want to save the edits on a separate file, so that the template is still available.(Just so you guys have a little bit of background regarding what I need this for).
After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data.
Could anyone explain in more detail how exactly XML can be used as a complement to HTML?
If you need more details or information please let me know. I did do a lot of research and I read the other posts regarding how to convert XML to HTML with JavaScript, but that doesn't answer my question about how EXACTLY they complement each other.
I guess my problem here is that I have yet to manage to wrap my head around the concept.
XML is related to HTML, as it uses the same magic characters for its markup and the same logic where to put the data.
The characters <> are used to separate the markups from the content.
The character & together with an entity code like < is used to encode characters, which would lead to troubles otherwise
elements can contain attributes like <someElement someAttribute="attr value">
elements can contain text or sub elements
The big difference is, that XML is absolutely free how you name your elements and attributes, while HTML relys on dedicated names (like <body>), whereas XML is absolutely strict in structure while HTML allows a lot (like unclosed tags).
As a thing in the middle there is XHTML, which is as strict as XML but sticks to the rules of HTML.
It is almost impossible to read HTML as XML, but you can easily create XML which is taken by any browser as a valid web page.
Your issue cries for XSLT. This is a method to transform a given XML into a new format. This allows for example, to export your data as XML and create a nice web page from it. Different XSLT will present the same data in different ways.
There are several online tools to test this feature. you might have a look here.
Your statement After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data is not all clear... How you send data (to a web application), and the way you send the (manipulated) data back, is not bound to XML. This is very often done with JSON, using Java Script to read, edit and send it back.
XML -> XSLT - HTML is often seen to create (rather static) reports for a web viewer

What are some good ways to parse HTML and CSS in Perl?

I have a project where my input files used to be XML. I'm now being asked to start processing HTML with embedded CSS instead, and I'd like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we're moving to HTML with CSS, I'm thinking I'll need to move to something else. That said, before I dig myself knee deep into silly decisions I'll likely regret, I wanted to ask here: what do you guys use for this kind of task?
The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML's text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.
An example of the old XML is:
<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
o_size="11.04" o_cs="4.6">
Some text
</text>
An example of the new HTML is:
<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
<span class="ft19" >
Some text
</span></nobr>
</div>
where "ft19" refers to a css style element from the top of the page of the format:
.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
x-pdf-letter-spacing:0.83px;}
Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:
my #texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');
as I'm able to do with the XML. Does anything like that exist for parsing HTML? I'd really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I'm trying to do.
Ideas?
The basic one I am aware of is HTML::Parser.
There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author's blog which is very interesting but much newer and experimental.
I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there's that too.
If you need something even more generic (and evil) you can look into "writing" your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don't do what you need.
Perhaps I haven't helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.
Edit: one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.
It's not clear - is the Perl parsing for the purposes of doing the conversion to HTML (with embedded CSS)? If so, why not forget Perl and use XSLT which is designed to transform XML documents?

What are the advantages of creating web pages with XML instead of HTML?

From time to time, I see web pages whose content is solely written in XML (not HTML or XHTML). These pages usually have some style sheets (either XSLT or CSS) attached to them which makes them look like any other ordinary web page.
My question is, what are the advantages of such an approach (if any), and why would anyone choose to work this way?
EDIT: If this is a good thing, why is it not widespread?
EDIT 2: Thanks everyone for the great responses. They really enlightened me. I also found this question whose content is also related.
It's easier to generate it programmatically and reuse it for other purposes than displaying as webpage.
Update:
EDIT: If this is a good thing, why is it not widespread?
Not everyone needs to generate it programmatically or reuse it for other purposes than displaying as webpage. It's then easier to use plain HTML.
One possible advantage would be for use of the data of the page in something other than a web browser; that would (presumably) be easier to do if a page's content were well-formed XML. Of course in theory a well-formed, semantic XHTML page should be nearly as able to be parsed, as well.
It can also be easier to generate XML instead of XHTML, depending on the data source.
When you are getting XML data in to your system, and you are supposed to present this XML data then it is much easier to write some XSLT for that XML instead of parsing it using some sort of parser and then presenting the data.
That can be a valid point for using XML instead of XHTML or HTML
Update
To answer your question on why this is not widespread, is because XSTL is tedious and hard to work with. Specifically XPath, which can be for some people quite difficult to use.
Those pages use XSLT to get rendered on the client side. Not every browser (especially older ones) supports rendering XML + XSLT. XML can however be used server-side as template and get transformed to HTML by the application running on the server. I personally don't see any advantages to this approach.
There are a lot more web pages that are written solely in XML than you know. You're only seeing the ones that do the XSLT transformation on the client side. Server-side transformation of XML is not at all unusual, because there's a plethora of things that produce data in XML, and transforming XML to HTML in XSLT is straightforward. You'll never know this is happening if you just look at the HTML, which bears no signs of having been generated via XSLT.
Personally, I don't understand it either though one of the biggest problems is support in IE. I created a skeleton ecommerce site serving XML, transformed by XSLT and styled using CSS. I sorely missed the ability to use XLink and other wonderful XML features. It's also nice to be able to tag the data for what it is. I used a 'menu' tag for the restaurant menus. 'price' tags for prices and so on. If a user clicked on a link to change menus, all I had to do was send the name of the item, the price and the description instead of the complete page. iirc, a 4K or more HTML menu page was only 200 bytes of sent data.
As far as the "one error makes everything crash in XML" type comments, the same is true of any programming language so proper coding should be no bother for programmers and careful HTML/CSS types.
Before anyone says that what I did was actually XHTML...no. I served XML. I did call up XHTML namespaces when needed for links, images and HTML type things but only when necessary.

Convert doc/docx to semantic HTML

I would like to convert doc/docx documents to semantic HTML.
Some wishes/requirements:
Semantic HTML such that headers in the document are <h1>, <h2> etc., tables are <table> and so forth.
Should preferably be possible to handle headings, lists, tables and images. Graphs and math formulas is a nice extra.
• Doesn't have to be converted straight from doc/docx to html, could use an intermediary format, such as xml or docbook.
• Should work programatically, and with large number of documents.
The closest thing to a solution I've found so far is http://holloway.co.nz/docvert/index.html, but unfortunately there are many a few bugs, small user base and it can't handle a lot of documents. More of a proof of concept.
" headers in the document are "
I think this is impossible.
Because MS Word only write down the result, with different styles of <p>
just like printed text on paper, the original info are not recorded.
Your other wishes could be approached.
There're two commercial tools can do this
(don't believe those free tools or online tools, they don't do the real work.)
1 Word Cleaner by Zapadoo
www.zapadoo.com
2 HTML Cleaner for Word by wonder Studio
www.htmlcleaner.com
I prefer the second one which released just last year. You can try them both.
There's a tool called upCast which is able to convert Word documents into XML.
docx4j (for docx only, not doc) writes clean HTML output. You'd need to change things a bit if you wanted <h1> instead of <p class="h1">, but its open source so you can do that.
I wrote a utility which implements the requirements you listed, excluding images, graphs and maths formulas. It's beta quality (i.e., it works on my machine). I published it at http://www.modeltext.com/word
Just more ideas.
Use Gmail to convert word docs
http://www.oreillynet.com/mac/blog/2006/05/use_gmail_to_convert_word_docs.html

What is semantic markup, and why would I want to use that?

Like it says.
Using semantic markup means that the (X)HTML code you use in a page contains metadata describing its purpose -- for example, an <h2> that contains an employee's name might be marked class="employee-name". Originally there were some people that hoped search engines would use this information, but as the web has evolved semantic markup has been mostly used for providing hooks for CSS.
With CSS and semantic markup, you can keep the visual design of the page separate from the markup. This results in bandwidth savings, because the design only has to be downloaded once, and easier modification of the design because it's not mixed in to the markup.
Another point is that the elements used should have a logical relationship to the data contained within them. For example, tables should be used for tabular data, <p> should be used for textual paragraphs, <ul> should be used for unordered lists, etc. This is in contrast to early web designs, which often used tables for everything.
Semantics literally means using "meaningful" language; in Web Development, this basically means using tags and identifiers which describe the content.
For example, applying IDs such as #Navigation, #Header and #Content to your <div> tags, rather than #Left, and #Main, or using unordered lists for a list of navigational links, rather than a table.
The main benefits are in future maintenance; you can easily change the layout or the presentation without losing the meaning of your content. You navigation bar can move from the left to the right, or your links displayed horizontally rather than vertically, without losing the meaning.
From http://www.digital-web.com/articles/writing_semantic_markup/ :
semantic markup is markup that is descriptive enough to allow us and the machines we program to recognize it and make decisions about it. In other words, markup means something when we can identify it and do useful things with it. In this way, semantic markup becomes more than merely descriptive. It becomes a brilliant mechanism that allows both humans and machines to “understand” the same information.
Besides the already mentioned goal of allowing software to 'understand' the data, there are more practical applications in using it to translate between ontologies, or for mapping between dis-similar representations of data - without having to translate or standardize the data (which can result in a loss of information, and typically prevents you from improving your understanding in the future).
There were at least 2 sessions at OSCon this year related to the use of semantic technologies. One was on BigData (slides are available here: http://en.oreilly.com/oscon2008/public/schedule/proceedings, the other was the guys from FreeBase.
BigData was using it to map between two dis-similar data models (including the use of query languages which were specifically created for working with semantic data sets). FreeBase is mapping between different data sets and then performing further analysis to derive meaning across those data sets.
Related topics to look into: OWL, OQL, SPARQL, Franz (AllegroGraph, RacerPRO and TopBraid).
Here is an example of a HTML5, semantically tagged website that I've been working on that uses the recently accepted Micro-formats as specified at http://schema.org along with the new more semantic tagging elements of HTML5.
http://blog-to-book.com/view/stuff/about/semantic%20web
Googles has a handy Semantic tagging test tool that will show you how adding semantic tags to content enables search engines to 'understand' far more about your web pages.
Here is the test tool: http://www.google.com/webmasters/tools/richsnippets?url=http%3A%2F%2Fblog-to-book.com%2Fview%2Fstuff%2Fabout%2Fsemantic+web&view=
Notice how google now knows that the 'things' on the page are books, and they have an isbn13 identifier. Adding additional metadata, such as price and author enables further inferences to be made.
Hope this points you in some interesting directions. More detailed semantic tagging can be achieved using the Good Relations Ontology which is pretty much the most comprehensive I can think of right now.