Datapower - To parse HTML - html

I have a situation where the underlying application provides a UI layer and this in turn has to be rendered as a portlet. However, I do not want all parts of the UI originally presented to be rendered in Portlet.
Proposed solution: Using Datapower for parsing an XML being a norm, I am wondering if it is possible to parse a HTML. I understand HTML may not be always well formed. But if there are very few HTML pages in underlying application, then a contract can be enforced..
Also, if one manages to parse and extract data out of HTML using DP, then the resultant (perhaps and XML) can be used to produce HTML5 with all its goodies.
So question: Is it advisable to use Datapower to parse an HTML page to extract an XML out of it? Prerequisite: number of HTML pages per application could vary in data but not with many pages.

I suspect you will be unable to parse HTML using DataPower. DataPower can parse well-formed XML, but HTML - unless it is explicitly designed as xHTML - is likely to be full of tags that break well-formedness.
Many web pages are full of tags like <br> or <ul><li>Item1<li>Item2<li>Item3</ul>, all of which will cause the parsing to fail.
If you really want to follow your suggested approach, you'll probably need to do something on a more flexible platform such as WAS where you can build (or reuse) a parser that takes care of all of that for you.
If you think about it, this is what your web browser does - it has all the complex rules that turn badly-formed XML tags (i.e. HTML) into a valid DOM structure. It sounds like you may be better off doing manipulation at the level of the DOM rather than the HTML, as that way you can leverage existing, well-tested parsing solutions and focus on the structure of the data. You could do this client-side using JavaScript or you could look at a server-side JavaScript option such as Rhino or PhantomJS.
All of this might be doing things the hard way, though. Have you confirmed whether or not the underlying application has any APIs or web services that IT uses to render the pages, allowing you to get to the data without the existing presentation layer getting in the way?
Cheers,
Chris

Question of parsing and HTML page originates when you want to do some processing over it. If this is the case you can face problems because datapower by default will not allow hyperlinks inside the well formed XML or HTML document [It is considered to be a security risk], however this can be overcome with appropriate settings in XML manager present.
As far as question of HTML page parsing is concerned, Datapower being and ESB layer is expected to provide message format translation and that it indeed does. So design wise it is a good place to do message format translation. Practically however you will face above mentioned problem when you try to parse HTML as XML document.
The parsing can produce any message format model you wish [theoretically] hence you can use the XSLT to achieve what you wish.
Ajitabh

Related

XML as complement to HTML

I'm having trouble wrapping my head around using XML as complement to HTML. I know what they are used for but I don't quite understand how to use them together.
I know that you can use JavaScript to convert an XML file to HTML, but I don't get how that's going to do the trick. How would I be able to style this HTML-file?
I have a template form, which I want to be accessible on a server and for which I want to enable edits. Once edited I want to save the edits on a separate file, so that the template is still available.(Just so you guys have a little bit of background regarding what I need this for).
After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data.
Could anyone explain in more detail how exactly XML can be used as a complement to HTML?
If you need more details or information please let me know. I did do a lot of research and I read the other posts regarding how to convert XML to HTML with JavaScript, but that doesn't answer my question about how EXACTLY they complement each other.
I guess my problem here is that I have yet to manage to wrap my head around the concept.
XML is related to HTML, as it uses the same magic characters for its markup and the same logic where to put the data.
The characters <> are used to separate the markups from the content.
The character & together with an entity code like < is used to encode characters, which would lead to troubles otherwise
elements can contain attributes like <someElement someAttribute="attr value">
elements can contain text or sub elements
The big difference is, that XML is absolutely free how you name your elements and attributes, while HTML relys on dedicated names (like <body>), whereas XML is absolutely strict in structure while HTML allows a lot (like unclosed tags).
As a thing in the middle there is XHTML, which is as strict as XML but sticks to the rules of HTML.
It is almost impossible to read HTML as XML, but you can easily create XML which is taken by any browser as a valid web page.
Your issue cries for XSLT. This is a method to transform a given XML into a new format. This allows for example, to export your data as XML and create a nice web page from it. Different XSLT will present the same data in different ways.
There are several online tools to test this feature. you might have a look here.
Your statement After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data is not all clear... How you send data (to a web application), and the way you send the (manipulated) data back, is not bound to XML. This is very often done with JSON, using Java Script to read, edit and send it back.
XML -> XSLT - HTML is often seen to create (rather static) reports for a web viewer

HTML alternatives to make website?

I am using HTML to make a websites. I know an alternative languages to markup a website: XHTML, WML. Is there any more markup languages? Can I make a website only with XML or SGML?
Thank you for responses.
There aren't many.
To have a browser make your website viewable, you must provide your website in some language your browser can understand. Just plaintext and HTML work universally, there is quite some SVG and PDF support these days, but the only language you can use that can do all the things most people want a website to be able to do, you will have to use HTML or XHTML in some way. Either through JS or by some templating system, but you'll have to use it to generate what's commonly accepted as being a webpage as far as I'm aware.
That being said, there are some languages like Haml, which can be 'compiled' to HTML so you could use that instead. There are also converters for other XML-based languages and such.
If you want to deliver XML from your server, you can convert it to HTML in the browser using either
(a) XSLT 1.0: nearly all browsers have built in support for XSLT 1.0, which can be invoked using the xml-stylesheet processing instruction embedded in the XML
(b) XSLT 3.0: supported using Saxon-JS, which can be invoked using a small Javascript call in a skeletal HTML page.
(c) CSS: if the XML is reasonably close in structure to what you want to present to the user, you can attach styling properties to the XML elements using CSS.
You can also of course maintain your content in XML and convert it to rendered HTML using XSLT either at publishing time, or when each page is requested using code on the server to do the conversion on demand.
Many scientific journals use XML (actually JATS XML) for publishing scientific articles. XML in this case is transforming to HTML either on server or on client side by javascript. As example you can look here, where taking place client-side transformation. But Google will not index such XML.

What are the advantages of creating web pages with XML instead of HTML?

From time to time, I see web pages whose content is solely written in XML (not HTML or XHTML). These pages usually have some style sheets (either XSLT or CSS) attached to them which makes them look like any other ordinary web page.
My question is, what are the advantages of such an approach (if any), and why would anyone choose to work this way?
EDIT: If this is a good thing, why is it not widespread?
EDIT 2: Thanks everyone for the great responses. They really enlightened me. I also found this question whose content is also related.
It's easier to generate it programmatically and reuse it for other purposes than displaying as webpage.
Update:
EDIT: If this is a good thing, why is it not widespread?
Not everyone needs to generate it programmatically or reuse it for other purposes than displaying as webpage. It's then easier to use plain HTML.
One possible advantage would be for use of the data of the page in something other than a web browser; that would (presumably) be easier to do if a page's content were well-formed XML. Of course in theory a well-formed, semantic XHTML page should be nearly as able to be parsed, as well.
It can also be easier to generate XML instead of XHTML, depending on the data source.
When you are getting XML data in to your system, and you are supposed to present this XML data then it is much easier to write some XSLT for that XML instead of parsing it using some sort of parser and then presenting the data.
That can be a valid point for using XML instead of XHTML or HTML
Update
To answer your question on why this is not widespread, is because XSTL is tedious and hard to work with. Specifically XPath, which can be for some people quite difficult to use.
Those pages use XSLT to get rendered on the client side. Not every browser (especially older ones) supports rendering XML + XSLT. XML can however be used server-side as template and get transformed to HTML by the application running on the server. I personally don't see any advantages to this approach.
There are a lot more web pages that are written solely in XML than you know. You're only seeing the ones that do the XSLT transformation on the client side. Server-side transformation of XML is not at all unusual, because there's a plethora of things that produce data in XML, and transforming XML to HTML in XSLT is straightforward. You'll never know this is happening if you just look at the HTML, which bears no signs of having been generated via XSLT.
Personally, I don't understand it either though one of the biggest problems is support in IE. I created a skeleton ecommerce site serving XML, transformed by XSLT and styled using CSS. I sorely missed the ability to use XLink and other wonderful XML features. It's also nice to be able to tag the data for what it is. I used a 'menu' tag for the restaurant menus. 'price' tags for prices and so on. If a user clicked on a link to change menus, all I had to do was send the name of the item, the price and the description instead of the complete page. iirc, a 4K or more HTML menu page was only 200 bytes of sent data.
As far as the "one error makes everything crash in XML" type comments, the same is true of any programming language so proper coding should be no bother for programmers and careful HTML/CSS types.
Before anyone says that what I did was actually XHTML...no. I served XML. I did call up XHTML namespaces when needed for links, images and HTML type things but only when necessary.

Win32.: How to scrape HTML without regular expressions?

A recent blog entry by a Jeff Atwood says that you should never parse HTML using regular expressions - yet doesn't give an alternative.
I want to scrape search search results, extracting values:
<div class="used_result_container">
...
...
<div class="vehicleInfo">
...
...
<div class="makemodeltrim">
...
<a class="carlink" href="[Url]">[MakeAndModel]</a>
...
</div>
<div class="kilometers">[Kilometers]</div>
<div class="price">[Price]</div>
<div class="location">
<span class='locationText'>Location:</span>[Location]
</div>
...
...
</div>
...
...
</div>
...and it repeats
You can see the values I want to extract, [enclosed in brackets]:
Url
MakeAndModel
Kilometers
Price
Location
Assuming we accept the premise that parsing HTML:
generally a bad idea
rapidly devolves into madness
What's the way to do it?
Assumptions:
native Win32
loose html
Assumption clarifications:
Native Win32
.NET/CLR is not native Win32
Java is not native Win32
perl, python, ruby are not native Win32
assume C++, in Visual Studio 2000, compiled into a native Win32 application
Native Win32 applications can call library code:
copied source code
DLLs containing function entry points
DLLs containing COM objects
DLLs containing COM objects that are COM-callable wrappers (CCW) around managed .NET objects
Loose HTML
xml is not loose HTML
xhtml is not loose HTML
strict HTML is not loose HTML
Loose HTML implies that the HTML is not well-formed xml (strict HTML is not well-formed xml anyway), and so an XML parser cannot be used. In reality I was present the assumption that any HTML parser must be generous in the HTML it accepts.
Clarification#2
Assuming you like the idea of turning the HTML into a Document Object Model (DOM), how then do you access repeating structures of data? How would you walk a DOM tree? I need a DIV node that is a class of used_result_container, which has a child DIV of class of vehicleInfo. But the nodes don't necessarily have to be direct children of one another.
It sounds like I'm trading one set of regular expression problems for another. If they change the structure of the HTML, I will have to re-write my code to match - as I would with regular expressions. And assuming we want to avoid those problems, because those are the problems with regular expressions, what do I do instead?
And would I not be writing a regular expression parser for DOM nodes? i'm writing an engine to parse a string of objects, using an internal state machine and forward and back capture. No, there must be a better way - the way that Jeff alluded to.
I intentionally kept the original question vague, so as not to lead people down the wrong path. I didn't want to imply that the solution, necessarily, had anything to do with:
walking a DOM tree
xpath queries
Clarification#3
The sample HTML I provided I trimmed down to the important elements and attributes. The mechanism I used to trim the HTML down was based on my internal bias that uses regular expressions. I naturally think that I need various "sign-posts in the HTML that I look for.
So don't confuse the presented HTML for the entire HTML. Perhaps some other solution depends on the presence of all the original HTML.
Update 4
The only proposed solutions seem to involve using a library to convert the HTML into a Document Object Model (DOM). The question then would have to become: then what?
Now that I have the DOM, what do I do with it? It seems that I still have to walk the tree with some sort of regular DOM expression parser, capable of forward matching and capture.
In this particular case i need all the used_result_container DIV nodes which contain vehicleInfo DIV nodes as children. Any used_result_container DIV nodes that do not contain vehicleInfo has a child are not relevant.
Is there a DOM regular expression parser with capture and forward matching? I don't think XPath can select higher level nodes based on criteria of lower level nodes:
\\div[#class="used_result_container" && .\div[#class="vehicleInfo"]]\*
Note: I use XPath so infrequently that I cannot make up hypothetical xpath syntax very goodly.
Python:
lxml - faster, perhaps better at parsing bad HTML
BeautifulSoup - if lxml fails on your input try this.
Ruby: (heard of the following libraries, but never tried them)
Nokogiri
hpricot
Though if your parsers choke, and you can roughly pinpoint what is causing the choking, I frankly think it's okay to use a regex hack to remove that portion before passing it to the parser.
If you do decide on using lxml, here are some XPath tutorials that you may find useful. The lxml tutorials kind of assume that you know what XPath is (which I didn't when I first read them.)
Edit: Your post has really grown since it first came out... I'll try to answer what I can.
i don't think XPath can select higher level nodes based on criteria of lower level nodes:
It can. Try //div[#class='vehicleInfo']/parent::div[#class='used_result_container']. Use ancestor if you need to go up more levels. lxml also provides a getparent() method on its search results, and you could use that too. Really, you should look at the XPath sites I linked; you can probably solve your problems from there.
how then do you access repeating structures of data?
It would seem that DOM queries are exactly suited to your needs. XPath queries return you a list of the elements found -- what more could you want? And despite its name, lxml does accept 'loose HTML'. Moreover, the parser recognizes the 'sign-posts' in the HTML and structures the whole document accordingly, so you don't have to do it yourself.
Yes, you are still have to do a search on the structure, but at a higher level of abstraction. If the site designers decide to do a page overhaul and completely change the names and structure of their divs, then that's too bad, you have to rewrite your queries, but it should take less time than rewriting your regex. Nothing will do it automatically for you, unless you want to write some AI capabilities into your page-scraper...
I apologize for not providing 'native Win32' libraries, I'd assumed at first that you simply meant 'runs on Windows'. But the others have answered that part.
Native Win32
You can always use IHtmlDocument2. This is built-in to Windows at this point. With this COM interface, you get native access to a powerful DOM parser (IE's DOM parser!).
Use Html Agility Pack for .NET
Update
Since you need something native/antique, and the markup is likely bad, I would recommend running the markup through Tidy and then parsing it with Xerces
Use Beautiful Soup.
Beautiful Soup is an HTML/XML parser
for Python that can turn even invalid
markup into a parse tree. It provides
simple, idiomatic ways of navigating,
searching, and modifying the parse
tree. It commonly saves programmers
hours or days of work. There's also a
Ruby port called Rubyful Soup.
If you are really under Win32 you can use a tiny and fast COM object to do it
example code with vbs:
Set dom = CreateObject("htmlfile")
dom.write("<div>Click for <img src='http://www.google.com/images/srpr/logo1w.png'>Google</a></div>")
WScript.Echo(dom.Images.item(0).src)
You can also do this in JScript, or VB/Dephi/C++/C#/Python etc on Windows. It use mshtml.dll dom layout and parser directly.
The alternative is to use an html dom parser. Unfortunately, it seems like most of them have problems with poorly formed html, so in addition you need to run it through html tidy or something similar first.
If a DOM parser is out of the question - for whatever reason,
I'd go for some variant of PHP's explode() or whatever is available in the programming language that you use.
You could for example start out by splitting by <div class="vehicleInfo">, which would give you each result (remember to ignore the first place). After that you could loop the results split each result by <div class="makemodeltrim"> etc.
This is by no means an optimal solution, and it will be quite fragile (almost any change in the layout of the document would break the code).
Another option would be to go after some CSS selector library like phpQuery or similar for your programming language.
Use a DOM parser
e.g. for java check this list
Open Source HTML Parsers in Java (I like to use cobra)
Or if you are sure e.g. that you only want to parse a certain subset of your html which ideally is also xml valid you could use some xml parser to parse only fragment you pass it in and then even use xpath to request the values your are interested in.
Open Source XML Parsers in Java (e.g. dom4j is easy to use)
I think libxml2, despite its name, also does its best to parse tag soup HTML. It is a C library, so it should satisfy your requirements. You can find it here.
BTW, another answer recommended lxml, which is a Python library, but is actually built on libxml2. If lxml worked well for him, chances are libxml2 is going to work well for you.
How about using Internet Explorer as an ActiveX control? It will give you a fully rendered structure as it viewed the page.
The HTML::Parser and HTML::Tree modules in Perl are pretty good at parsing most typical so-called HTML on the web. From there, you can locate elements using XPath-like queries.
What do you think about ihtmldocument2,
I think it should help.

HTML - xml data islands

I am designing a web app and I intent to embed data on an xml island so that I can dynamically render it on an HTML table on the client-side based on options the users will select.
I have the broad concepts, but I need pointers on how to use DOM in navigating my xml. And how to update my xml island possibly for posting back to the server?
Please any links to online resources or a quick advice will be very appreciated.
NB: I understand most of the dynamic HTML concepts and server and client side stuff, so don't shy being very technical in your response:)
In W3C HTML there are no XML data islands (unless you're referring to external XML file linked via frames loaded using Javascript), but you can re-use HTML elements and insert metadata in class, title (if you care about HTML4 validity), data-* (HTML5) or your custom attributes.
For DOM navigation you've got DOM Core, like element.childNodes, .nextSibling, .getAttribute(), etc.
DOM can be verbose and tedious to use (e.g. when looking for elements in DOM you have to be careful to skip text nodes), so there are JS libraries like jQuery and Prototype built on top of it that offer more convenient API.
If you intend to a lot of DOM transformations, then Javascript API for XPath and XSLT processor will be handy.
What you describe can be done with XML.
However, I think it would be much easier if you used JSON instead of XML. That way, you can directly work with a Javascript object, which is friendlier than navigating XML DOM. Then you can send the serialized JSON form to the server using the JSON library
Ajax Patterns has some good examples for using data islands: http://ajaxpatterns.org/wiki/index.php?title=XML_Data_Island