Alternative to HTML standard for expressing static documents content - html

The content tends to be mixed with it's form when expressed as a HTML+CSS+JS document. Almost every modern website requires CSS and/or JavaScript to be readable. Most of them are not easy to parse automatically because they relay on web browser to render it. Sections of the document are defined using visual clues, colors and formatting. One can use HTML5 tags like <article> but those are not a part of any bigger structure as far as I know, and still can contain non-content elements.
Websites are basically apps or clients.
Is there any standard that can be used to serve content of a website that has a well defined schema? An API for websites that could be used to express content in the form that is easy to server, parse, store, cryptographically sign...
I'm aware of formats like XML and JSON but I have not managed to find any standardized way to express a blog post as a JSON document.
An example of what I have in mind:
This question can be fetched as an JSON document using Stackexchange API. The result is machine readable and easy to parse but is in not standardized. It reflects details of Stackexchange specific data structures. Other QA website will have different API, with different structure and formats even though both have questions and answers.

There are two important standards out there dealing with the semantic aspect of a web page, like the one you are looking for. Microdata and RDFa. With their aid, you can pick a certain open vocabulary to describe your data or create your own based on them.
With JSON-LD also, you can create a schema for JSON documents like the XML schema is for the XML documents.

Related

XML as complement to HTML

I'm having trouble wrapping my head around using XML as complement to HTML. I know what they are used for but I don't quite understand how to use them together.
I know that you can use JavaScript to convert an XML file to HTML, but I don't get how that's going to do the trick. How would I be able to style this HTML-file?
I have a template form, which I want to be accessible on a server and for which I want to enable edits. Once edited I want to save the edits on a separate file, so that the template is still available.(Just so you guys have a little bit of background regarding what I need this for).
After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data.
Could anyone explain in more detail how exactly XML can be used as a complement to HTML?
If you need more details or information please let me know. I did do a lot of research and I read the other posts regarding how to convert XML to HTML with JavaScript, but that doesn't answer my question about how EXACTLY they complement each other.
I guess my problem here is that I have yet to manage to wrap my head around the concept.
XML is related to HTML, as it uses the same magic characters for its markup and the same logic where to put the data.
The characters <> are used to separate the markups from the content.
The character & together with an entity code like < is used to encode characters, which would lead to troubles otherwise
elements can contain attributes like <someElement someAttribute="attr value">
elements can contain text or sub elements
The big difference is, that XML is absolutely free how you name your elements and attributes, while HTML relys on dedicated names (like <body>), whereas XML is absolutely strict in structure while HTML allows a lot (like unclosed tags).
As a thing in the middle there is XHTML, which is as strict as XML but sticks to the rules of HTML.
It is almost impossible to read HTML as XML, but you can easily create XML which is taken by any browser as a valid web page.
Your issue cries for XSLT. This is a method to transform a given XML into a new format. This allows for example, to export your data as XML and create a nice web page from it. Different XSLT will present the same data in different ways.
There are several online tools to test this feature. you might have a look here.
Your statement After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data is not all clear... How you send data (to a web application), and the way you send the (manipulated) data back, is not bound to XML. This is very often done with JSON, using Java Script to read, edit and send it back.
XML -> XSLT - HTML is often seen to create (rather static) reports for a web viewer

What is the difference between html and xml?

Blogger blogs use XML, and Tumblr blogs use HTML, but they both do the same thing. What is the difference between XML and HTML and what are the pros and cons of using them?
HTML is a markup language for describing hypertext ("with links") documents.
XML is a generic markup language designed for using as a base for building custom markup languages.
They are different tools for different jobs.
When you write a template for either Blogger or Tumblr, you are using a template language. Blogger's template language happens to be built in XML. The template language is combined with your data to generate the HTML documents that are sent to the browser.
XML was developed to describe data and to focalize on what the data represent.
HTML was developed to display data about to focalize on the way that data looks.
HTML is about displaying data, XML is about describing information.
Reference
In an ideal world, HTML would be just one of many XML vocabularies. The only reason it isn't is history: (a) it was wild on the web before XML was formulated, and (b) it developed a culture where you could write any old garbage and the browsers would try and make sense of it, whereas XML insisted that you stick to the rules.
The W3C spent about ten years trying to make this vision happen, through the medium of XHTML, but in the end the browser vendors decided to go their own way.
XML was developed to describe data and to focalize on what the data represents.
Whereas HTML describes the structure of Web pages using markup.
In other words, HTML is used to display data, XML is used to store and transport data.

REST, hypertext and non-browser clients

I am confused on how a REST api can both be hypertext driven, but also machine readable. Say I design an API and some endpoint lists contents of a collection.
GET /api/companies/
The server should return a list with company resources e.g:
/api/companies/adobe
/api/companies/microsoft
/api/companies/apple
One way would be to generate a hypertext (html) page with <a> links to the respective resources. However I would also like to make it easy for a non-browser client to read this list. For example, some client might want to populates a dropdown gui with companies. In this case returning a html document is inappropriate, and it might be better to return a list in JSON or XML format.
It is not clear to me how REST style can satisfy both. Is there a practical solution or examples of a REST api that is both nice to browsers and non-browser clients?
What you're looking for is nowadays referred to as HATEOAS API's. See this question for examples: Actual examples for HATEOAS (REST-architecture)
The ReST architectural style, as originally defined by Roy Fielding, prescribes "hypertext as the engine for application state" as one of the architectural contraints. However, this concept got more or less "lost in translation" when people started equaling "RESTful API's" with "using the HTTP verbs right" (plus a little more, if you're lucky). (Edit: Providing credence for my assertion are the first and second highest-ratest answers in What exactly is RESTful programming? . The first talks only about HTTP verbs).
Some thoughts on your question: (mainly because the subject keeps fascinating me)
In HATEOAS, standardized media types with precise meaning are very important. It's generally thought to be best to reuse an existing media type when possible, to benefit from general understanding and tooling around this. One popular method is using XML, because it offers both general structure for data and a way to define semantics, i.e. through an XML schema or with namespaces. XML in and by itself is more or less meaningless when considering HATEOAS. The same applies for JSON.
For supporting links, you want to choose a media type that either supports links "natively" (i.e. text/html, application/xhtml+xml) or a media type that allows defining what pieces in the document must be interpreted as links through some embedded metadata, such as XML can with for example XLINK. I don't think you could use application/json because JSON by itself has no pre-defined place to define metadata. I do think that it would be possible to design a media type based on json - call it application/x-descriptive-json - that defines up-front that the JSON document returned must consist of a "header" and "body" property where header may contain further specified metadata. You could also design a media type for JSON just to support embedded links. Simpler media type, less extenisble. I wouldn't be surprised if both things I describe already exist in some form.
To be both nice to browsers and non-browser clients, all it takes is respecting the Accept header. You must assume that a client who asks for text/html is truly happy with text/html. This could be an argument for not using text/html as the media type for your non-browser API entry point. In principle, I think it could work though if the only thing you want is links. Good HTML markup can be very well consumed by non-browser clients. HTML also defines way to do paging, through rel="next", rel="previous".
The three biggest problems of a singular media type for both browsers and non-browsers I see are:
you must ensure all your site html is outputted with non-browser consumption in mind, i.e. embed sufficient metadata. Perhaps add hidden links in some places. It's a bit comparable from thinking about accessibility for visually impaired people: Though now, you're designing for a consumer who cannot read English, or any natural language for that matter. :)
there may be lots of markup and content that may essentially irrelevant to a non-browser client. Think of repeating header and footer text, navigation area's that kind of things.
HTML may simply lack the expressiveness you need. In principle, as soon as you go "think up" some conventions specific to your site (i.e. say rel="original-image means the link to the full-size, original image), then you're not doing strictly HATEOS anymore (at least, that's my understanding). HTML leaves no room for defining new meaning to elements. XML does.
A work-around to problem three might be using XHTML, since XHTML, by the virtue of being XML, does allow specifying new kinds of elements through namespaces.
I see #robert_b_clarke mentioning Microformats, which is relevant in this discussion. It's indeed one way of trying to improve accessibility for non-human agents. The main issue with this from a technical point of view is that it essentially relies on "out-of-band" information. Microformats are not part of the text/html spec. In a way, it's comparable to saying: "Hey, if I say that there's a resource with type A and id X, you can access it at mysite.com/A/X." The example I gave with rel=original-image could be called a micro-format as well. But it is a way to go. "State in your API docs: We serve nicely formatted text/html. Our text/html also embeds the following microformats: ..." You can even define your own ones.
I think the following presentation as a nice down-to-earth explanation of HATEOAS:
http://www.slideshare.net/apigee/hateoas-101-opinionated-introduction-to-a-rest-api-style
Edit:
I only now read about HTML5 microdata (because of #robert_b_clarke). It seems like HTML5 does provide a way for supplying additional information beyond what's possible with standard HTML tags. Consider what I wrote dated. :) Edit edit: It's only a draft, phew. ;)
Edit 2
Re a "descriptive JSON" format: This has just been announced http://jsonapi.org/ . They have applied for their own mime type. It's by Yehuda Katz (Ember.js) and Steve Klabnib, who's writing Designing Hypermedia API's.
The HTTP Accept header can be used by clients to request a response in a specific content type. For example, your REST API clients might request JSON data using the following header:
GET http://yourdomain.com/api/companies
Accept: application/json
So your server app can then serve JSON or HTML for the same URL depending on the value of the Accept header. Of course all your REST client apps will have to include that header, which may or may not be practical.
There are numerous alternative approaches, one of which is to serve the same XHTML content to both browsers and client apps. You can use HTML5 microdata or Microformats to embed structured data within HTML. That approach has a number of limitations. API client requests will result in larger, more complicated responses than necessary as they will include a load of stuff that's only usable by a web browser. There are also other differences in behaviour you might like to enforce. For instance you would probably want an unauthorized GET request for a protected resource to result in an HTTP 401 response for a machine client, and a redirect to login page for a web browser.
You may find that the easiest way is to be less principled and serve the human friendly and machine friendly versions of your resources through separate URLs
http://yourdomain.com/companies
http://yourdomain.com/api/companies
I've seen this question answered several ways. Some developers add a request parameter to indicate the format of the response, as in /api/companies/?rtnType=json. This method may be acceptable in a small application. It is a departure from true RESTful theology though.
The better way (in Java at least) is to use something like the Spring Framework. Spring can provide dynamic response formatting based on the media type in the HTTP request. The book "Spring in Action" (Walls, 2011) has an excellent explanation of this in chapter 11. And there are similar ways to accomplish dynamic response formatting in other languages without breaking REST.

Datapower - To parse HTML

I have a situation where the underlying application provides a UI layer and this in turn has to be rendered as a portlet. However, I do not want all parts of the UI originally presented to be rendered in Portlet.
Proposed solution: Using Datapower for parsing an XML being a norm, I am wondering if it is possible to parse a HTML. I understand HTML may not be always well formed. But if there are very few HTML pages in underlying application, then a contract can be enforced..
Also, if one manages to parse and extract data out of HTML using DP, then the resultant (perhaps and XML) can be used to produce HTML5 with all its goodies.
So question: Is it advisable to use Datapower to parse an HTML page to extract an XML out of it? Prerequisite: number of HTML pages per application could vary in data but not with many pages.
I suspect you will be unable to parse HTML using DataPower. DataPower can parse well-formed XML, but HTML - unless it is explicitly designed as xHTML - is likely to be full of tags that break well-formedness.
Many web pages are full of tags like <br> or <ul><li>Item1<li>Item2<li>Item3</ul>, all of which will cause the parsing to fail.
If you really want to follow your suggested approach, you'll probably need to do something on a more flexible platform such as WAS where you can build (or reuse) a parser that takes care of all of that for you.
If you think about it, this is what your web browser does - it has all the complex rules that turn badly-formed XML tags (i.e. HTML) into a valid DOM structure. It sounds like you may be better off doing manipulation at the level of the DOM rather than the HTML, as that way you can leverage existing, well-tested parsing solutions and focus on the structure of the data. You could do this client-side using JavaScript or you could look at a server-side JavaScript option such as Rhino or PhantomJS.
All of this might be doing things the hard way, though. Have you confirmed whether or not the underlying application has any APIs or web services that IT uses to render the pages, allowing you to get to the data without the existing presentation layer getting in the way?
Cheers,
Chris
Question of parsing and HTML page originates when you want to do some processing over it. If this is the case you can face problems because datapower by default will not allow hyperlinks inside the well formed XML or HTML document [It is considered to be a security risk], however this can be overcome with appropriate settings in XML manager present.
As far as question of HTML page parsing is concerned, Datapower being and ESB layer is expected to provide message format translation and that it indeed does. So design wise it is a good place to do message format translation. Practically however you will face above mentioned problem when you try to parse HTML as XML document.
The parsing can produce any message format model you wish [theoretically] hence you can use the XSLT to achieve what you wish.
Ajitabh

What are the advantages of creating web pages with XML instead of HTML?

From time to time, I see web pages whose content is solely written in XML (not HTML or XHTML). These pages usually have some style sheets (either XSLT or CSS) attached to them which makes them look like any other ordinary web page.
My question is, what are the advantages of such an approach (if any), and why would anyone choose to work this way?
EDIT: If this is a good thing, why is it not widespread?
EDIT 2: Thanks everyone for the great responses. They really enlightened me. I also found this question whose content is also related.
It's easier to generate it programmatically and reuse it for other purposes than displaying as webpage.
Update:
EDIT: If this is a good thing, why is it not widespread?
Not everyone needs to generate it programmatically or reuse it for other purposes than displaying as webpage. It's then easier to use plain HTML.
One possible advantage would be for use of the data of the page in something other than a web browser; that would (presumably) be easier to do if a page's content were well-formed XML. Of course in theory a well-formed, semantic XHTML page should be nearly as able to be parsed, as well.
It can also be easier to generate XML instead of XHTML, depending on the data source.
When you are getting XML data in to your system, and you are supposed to present this XML data then it is much easier to write some XSLT for that XML instead of parsing it using some sort of parser and then presenting the data.
That can be a valid point for using XML instead of XHTML or HTML
Update
To answer your question on why this is not widespread, is because XSTL is tedious and hard to work with. Specifically XPath, which can be for some people quite difficult to use.
Those pages use XSLT to get rendered on the client side. Not every browser (especially older ones) supports rendering XML + XSLT. XML can however be used server-side as template and get transformed to HTML by the application running on the server. I personally don't see any advantages to this approach.
There are a lot more web pages that are written solely in XML than you know. You're only seeing the ones that do the XSLT transformation on the client side. Server-side transformation of XML is not at all unusual, because there's a plethora of things that produce data in XML, and transforming XML to HTML in XSLT is straightforward. You'll never know this is happening if you just look at the HTML, which bears no signs of having been generated via XSLT.
Personally, I don't understand it either though one of the biggest problems is support in IE. I created a skeleton ecommerce site serving XML, transformed by XSLT and styled using CSS. I sorely missed the ability to use XLink and other wonderful XML features. It's also nice to be able to tag the data for what it is. I used a 'menu' tag for the restaurant menus. 'price' tags for prices and so on. If a user clicked on a link to change menus, all I had to do was send the name of the item, the price and the description instead of the complete page. iirc, a 4K or more HTML menu page was only 200 bytes of sent data.
As far as the "one error makes everything crash in XML" type comments, the same is true of any programming language so proper coding should be no bother for programmers and careful HTML/CSS types.
Before anyone says that what I did was actually XHTML...no. I served XML. I did call up XHTML namespaces when needed for links, images and HTML type things but only when necessary.