XML as complement to HTML - html

I'm having trouble wrapping my head around using XML as complement to HTML. I know what they are used for but I don't quite understand how to use them together.
I know that you can use JavaScript to convert an XML file to HTML, but I don't get how that's going to do the trick. How would I be able to style this HTML-file?
I have a template form, which I want to be accessible on a server and for which I want to enable edits. Once edited I want to save the edits on a separate file, so that the template is still available.(Just so you guys have a little bit of background regarding what I need this for).
After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data.
Could anyone explain in more detail how exactly XML can be used as a complement to HTML?
If you need more details or information please let me know. I did do a lot of research and I read the other posts regarding how to convert XML to HTML with JavaScript, but that doesn't answer my question about how EXACTLY they complement each other.
I guess my problem here is that I have yet to manage to wrap my head around the concept.

XML is related to HTML, as it uses the same magic characters for its markup and the same logic where to put the data.
The characters <> are used to separate the markups from the content.
The character & together with an entity code like < is used to encode characters, which would lead to troubles otherwise
elements can contain attributes like <someElement someAttribute="attr value">
elements can contain text or sub elements
The big difference is, that XML is absolutely free how you name your elements and attributes, while HTML relys on dedicated names (like <body>), whereas XML is absolutely strict in structure while HTML allows a lot (like unclosed tags).
As a thing in the middle there is XHTML, which is as strict as XML but sticks to the rules of HTML.
It is almost impossible to read HTML as XML, but you can easily create XML which is taken by any browser as a valid web page.
Your issue cries for XSLT. This is a method to transform a given XML into a new format. This allows for example, to export your data as XML and create a nice web page from it. Different XSLT will present the same data in different ways.
There are several online tools to test this feature. you might have a look here.
Your statement After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data is not all clear... How you send data (to a web application), and the way you send the (manipulated) data back, is not bound to XML. This is very often done with JSON, using Java Script to read, edit and send it back.
XML -> XSLT - HTML is often seen to create (rather static) reports for a web viewer

Related

Datapower - To parse HTML

I have a situation where the underlying application provides a UI layer and this in turn has to be rendered as a portlet. However, I do not want all parts of the UI originally presented to be rendered in Portlet.
Proposed solution: Using Datapower for parsing an XML being a norm, I am wondering if it is possible to parse a HTML. I understand HTML may not be always well formed. But if there are very few HTML pages in underlying application, then a contract can be enforced..
Also, if one manages to parse and extract data out of HTML using DP, then the resultant (perhaps and XML) can be used to produce HTML5 with all its goodies.
So question: Is it advisable to use Datapower to parse an HTML page to extract an XML out of it? Prerequisite: number of HTML pages per application could vary in data but not with many pages.
I suspect you will be unable to parse HTML using DataPower. DataPower can parse well-formed XML, but HTML - unless it is explicitly designed as xHTML - is likely to be full of tags that break well-formedness.
Many web pages are full of tags like <br> or <ul><li>Item1<li>Item2<li>Item3</ul>, all of which will cause the parsing to fail.
If you really want to follow your suggested approach, you'll probably need to do something on a more flexible platform such as WAS where you can build (or reuse) a parser that takes care of all of that for you.
If you think about it, this is what your web browser does - it has all the complex rules that turn badly-formed XML tags (i.e. HTML) into a valid DOM structure. It sounds like you may be better off doing manipulation at the level of the DOM rather than the HTML, as that way you can leverage existing, well-tested parsing solutions and focus on the structure of the data. You could do this client-side using JavaScript or you could look at a server-side JavaScript option such as Rhino or PhantomJS.
All of this might be doing things the hard way, though. Have you confirmed whether or not the underlying application has any APIs or web services that IT uses to render the pages, allowing you to get to the data without the existing presentation layer getting in the way?
Cheers,
Chris
Question of parsing and HTML page originates when you want to do some processing over it. If this is the case you can face problems because datapower by default will not allow hyperlinks inside the well formed XML or HTML document [It is considered to be a security risk], however this can be overcome with appropriate settings in XML manager present.
As far as question of HTML page parsing is concerned, Datapower being and ESB layer is expected to provide message format translation and that it indeed does. So design wise it is a good place to do message format translation. Practically however you will face above mentioned problem when you try to parse HTML as XML document.
The parsing can produce any message format model you wish [theoretically] hence you can use the XSLT to achieve what you wish.
Ajitabh

What are some good ways to parse HTML and CSS in Perl?

I have a project where my input files used to be XML. I'm now being asked to start processing HTML with embedded CSS instead, and I'd like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we're moving to HTML with CSS, I'm thinking I'll need to move to something else. That said, before I dig myself knee deep into silly decisions I'll likely regret, I wanted to ask here: what do you guys use for this kind of task?
The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML's text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.
An example of the old XML is:
<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
o_size="11.04" o_cs="4.6">
Some text
</text>
An example of the new HTML is:
<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
<span class="ft19" >
Some text
</span></nobr>
</div>
where "ft19" refers to a css style element from the top of the page of the format:
.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
x-pdf-letter-spacing:0.83px;}
Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:
my #texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');
as I'm able to do with the XML. Does anything like that exist for parsing HTML? I'd really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I'm trying to do.
Ideas?
The basic one I am aware of is HTML::Parser.
There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author's blog which is very interesting but much newer and experimental.
I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there's that too.
If you need something even more generic (and evil) you can look into "writing" your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don't do what you need.
Perhaps I haven't helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.
Edit: one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.
It's not clear - is the Perl parsing for the purposes of doing the conversion to HTML (with embedded CSS)? If so, why not forget Perl and use XSLT which is designed to transform XML documents?

using html/css, i would like to automatically generate a bibliography at the bottom of my website akin to latex's \bibliography command

I'll ask my question first, then give some background for those who are interested:
I would like to know if there is a command in html that will automatically generate a bibliography from a .bib file? This means that throughout the text, i would add something like <cite name="Jones2010">, and then at the bottom of the html (or css) file, I would write something like <makebib file="biblist.bib", format="APA">, and a bibliography would be generated using my .bib file, and formated according to the APA style. The functionality would be quite similar to footnotes, except that each footnote is populated according to some script that extracts the information from (essentially) an xml file and outputs the content in the desired format. It is not difficult to imagine somebody creating a tool to do just that, however, my google search skills have not enabled me to find such a tool. It is easy to find tools that convert bib files to html or xml, but that is not sufficient for my needs. I do not desire to publish my entire bib file online. Rather, for each document that I generate, I want several of the entries in the bib file to be included as footnotes. Any pointers will be greatly appreciated.
Now, the reason behind the question:
I have recently begun switching from writing all my manuscripts using latex to writing them using html/css. The advantages of this approach are fast: only 1 file for versioning (instead of .dvi, .ps, .aux, .blg, etc.), it is much smaller to share, other people can edit the html file and compile it much more easily, it is more configurable to my tastes, easier to read on screen, etc. The disadvantage for me, however, is that while I've been writing in latex for years, I've only just begin using html and css for scientific document creating. The main impetus for the switch was MathJaX, which enables me to to embed latex equations in my html files, and therefore, allows me to combine the advantages of latex with the advantages of css. I imagine that nearly all my colleagues will switch away from latex to this simpler format, assuming a few remaining issues get resolved, like ease of creating bibliographies.
Many thanks.
What you're asking isn't possible, unless when you specify html/css you really mean html/css/php or html/css/python or some other combination that includes an actual programming language, rather than just a markup language.
I understand your motivation, I'd love to switch to html instead of latex! However, I suspect an html-based solution would involve so much extra processing added on top to sort out bibliographies etc that the complexity would start approaching that of LaTeX by the time you got it all worked out.
I'd be pleased to be proven wrong on this!
I've done this, in the past, using XSLT and BibTeX. In outline, the steps are
Mark up your document using some convention or other: I used <span class='citation'>Smith99</span>
Write an XSLT script to transform that file into a .aux file with \citation commands in it
Use BibTeX along with a .bst file which spits out HTML rather than LaTeX
Use another XSLT script (or the same one, in a different mode) to pull the bibliography in
It's not quite as fiddly as it sounds, but you can look at how I did it on google code. In particular, see structure.xslt and plainhtml.bst.
If there's a more direct way, I'd be quite interested to hear about it.
Both answers so far are somewhat correct, although not quite what you were asking for. Part of the problem is that the question as it's phrased doesn't necessarily makes sense.
HTML is just markup; you need something to process the markup, be it python, php, ruby, etc.
And you probably want to write in XML (or XHTML), not HTML.
XSLT may work for you (once it's in XML), but remember, an XSLT document that defines a set of rules. You would get an XSLT engine to apply your XSLT rules against your XML document.
You can create an html bibliography from a .bib file using bibtex2html. This package takes a series of command line arguments and extracts the info from the BibTeX source and outputs a file with html markup.
As far as I know you cannot get it to read and parse the html document like the LaTeX \cite command but there are several ways to indicate the references you want. I find that the easiest way is to just maintain a text file of the BibTeX keys I use in my manuscript and then call this using the --citefile option. There is also a tool called bib2bib included that will take search commands.
It is a very flexible package and there are a lot of options so it works in a lot of situations. For example you can get it to omit the <html> headers from the output file so that you can directly paste into an existing html document.
The documentation is useful but make sure you look at the pdf documentation file and the man pages.

What are the advantages of creating web pages with XML instead of HTML?

From time to time, I see web pages whose content is solely written in XML (not HTML or XHTML). These pages usually have some style sheets (either XSLT or CSS) attached to them which makes them look like any other ordinary web page.
My question is, what are the advantages of such an approach (if any), and why would anyone choose to work this way?
EDIT: If this is a good thing, why is it not widespread?
EDIT 2: Thanks everyone for the great responses. They really enlightened me. I also found this question whose content is also related.
It's easier to generate it programmatically and reuse it for other purposes than displaying as webpage.
Update:
EDIT: If this is a good thing, why is it not widespread?
Not everyone needs to generate it programmatically or reuse it for other purposes than displaying as webpage. It's then easier to use plain HTML.
One possible advantage would be for use of the data of the page in something other than a web browser; that would (presumably) be easier to do if a page's content were well-formed XML. Of course in theory a well-formed, semantic XHTML page should be nearly as able to be parsed, as well.
It can also be easier to generate XML instead of XHTML, depending on the data source.
When you are getting XML data in to your system, and you are supposed to present this XML data then it is much easier to write some XSLT for that XML instead of parsing it using some sort of parser and then presenting the data.
That can be a valid point for using XML instead of XHTML or HTML
Update
To answer your question on why this is not widespread, is because XSTL is tedious and hard to work with. Specifically XPath, which can be for some people quite difficult to use.
Those pages use XSLT to get rendered on the client side. Not every browser (especially older ones) supports rendering XML + XSLT. XML can however be used server-side as template and get transformed to HTML by the application running on the server. I personally don't see any advantages to this approach.
There are a lot more web pages that are written solely in XML than you know. You're only seeing the ones that do the XSLT transformation on the client side. Server-side transformation of XML is not at all unusual, because there's a plethora of things that produce data in XML, and transforming XML to HTML in XSLT is straightforward. You'll never know this is happening if you just look at the HTML, which bears no signs of having been generated via XSLT.
Personally, I don't understand it either though one of the biggest problems is support in IE. I created a skeleton ecommerce site serving XML, transformed by XSLT and styled using CSS. I sorely missed the ability to use XLink and other wonderful XML features. It's also nice to be able to tag the data for what it is. I used a 'menu' tag for the restaurant menus. 'price' tags for prices and so on. If a user clicked on a link to change menus, all I had to do was send the name of the item, the price and the description instead of the complete page. iirc, a 4K or more HTML menu page was only 200 bytes of sent data.
As far as the "one error makes everything crash in XML" type comments, the same is true of any programming language so proper coding should be no bother for programmers and careful HTML/CSS types.
Before anyone says that what I did was actually XHTML...no. I served XML. I did call up XHTML namespaces when needed for links, images and HTML type things but only when necessary.

Django templatetag for rendering a subset of html

I have some html (in this case created via TinyMCE) that I would like to add to a page. However, for security reason, I don't want to just print everything the user has entered.
Does anyone know of a templatetag (a filter, preferably) that will allow only a safe subset of html to be rendered?
I realize that markdown and others do this. However, they also add additional markup syntax which could be confusing for my users, since they are using a rich text editor that doesn't know about markdown.
There's removetags, but it's a blacklisting approach which fails to remove tags when they don't look exactly like the well-formed tags Django expects, and of course since it doesn't attempt to remove attributes it is totally vulnerable to the 1,000 other ways of script-injection that don't involve the <script> tag. It's a trap, offering the illusion of safety whilst actually providing no real security at all.
HTML-sanitisation approaches based on regex hacking are almost inevitably a total fail. Using a real HTML parser to get an object model for the submitted content, then filtering and re-serialising in a known-good format, is generally the most reliable approach.
If your rich text editor outputs XHTML it's easy, just use minidom or etree to parse the document then walk over it removing all but known-good elements and attributes and finally convert back to safe XML. If, on the other hand, it spits out HTML, or allows the user to input raw HTML, you may need to use something like BeautifulSoup on it. See this question for some discussion.
Filtering HTML is a large and complicated topic, which is why many people prefer the text-with-restrictive-markup languages.
Use HTML Purifier, html5lib, or another library that is built to do HTML sanitization.
You can use removetags to specify list of tags to be remove:
{{ data|removetags:"script" }}