what parsing means? - language-agnostic

what parsing means for web development. Things like parsing html, parsing style sheets etc are used. Also web development languages use parse methods. What is essence parsing means here.

Generally speaking, parsing means inspecting a text string, character by character, and performing certain actions based on the meaning inherent in those characters.
A web browser parses HTML to translate it into blocks of content on the page. It parses style sheets to apply/adjust the styles of those blocks.

By Definition:
Verb: Analyze (a sentence) into its
component parts and describe their
syntactic roles.
Noun: An act of or
the result obtained by parsing a
string or a text
Its not specific to web development. Parsing just would mean computing and crunching stuff. Like a web browser parses HTML and displays a web page.

Related

XML as complement to HTML

I'm having trouble wrapping my head around using XML as complement to HTML. I know what they are used for but I don't quite understand how to use them together.
I know that you can use JavaScript to convert an XML file to HTML, but I don't get how that's going to do the trick. How would I be able to style this HTML-file?
I have a template form, which I want to be accessible on a server and for which I want to enable edits. Once edited I want to save the edits on a separate file, so that the template is still available.(Just so you guys have a little bit of background regarding what I need this for).
After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data.
Could anyone explain in more detail how exactly XML can be used as a complement to HTML?
If you need more details or information please let me know. I did do a lot of research and I read the other posts regarding how to convert XML to HTML with JavaScript, but that doesn't answer my question about how EXACTLY they complement each other.
I guess my problem here is that I have yet to manage to wrap my head around the concept.
XML is related to HTML, as it uses the same magic characters for its markup and the same logic where to put the data.
The characters <> are used to separate the markups from the content.
The character & together with an entity code like < is used to encode characters, which would lead to troubles otherwise
elements can contain attributes like <someElement someAttribute="attr value">
elements can contain text or sub elements
The big difference is, that XML is absolutely free how you name your elements and attributes, while HTML relys on dedicated names (like <body>), whereas XML is absolutely strict in structure while HTML allows a lot (like unclosed tags).
As a thing in the middle there is XHTML, which is as strict as XML but sticks to the rules of HTML.
It is almost impossible to read HTML as XML, but you can easily create XML which is taken by any browser as a valid web page.
Your issue cries for XSLT. This is a method to transform a given XML into a new format. This allows for example, to export your data as XML and create a nice web page from it. Different XSLT will present the same data in different ways.
There are several online tools to test this feature. you might have a look here.
Your statement After a lot of research I came to the conclusion that I would need to use XML, as I will have to store and transport data is not all clear... How you send data (to a web application), and the way you send the (manipulated) data back, is not bound to XML. This is very often done with JSON, using Java Script to read, edit and send it back.
XML -> XSLT - HTML is often seen to create (rather static) reports for a web viewer

Embed HTML paragraphs in xsl-fo/fop

I'm currently considering using xsl-fo/fop to generate PDFs out of my Java web application. Parts of the content to be printed are "HTML fragments" (TinyMCE editor in web frontend) from different sources.
Is there a way to "embed" HTML into FOP?
I want to avoid an xslt transformation for individual paragraphs containing HTML fragments (the doc contains a lot of other content as well).
The alternative is to create one HTML/XML document containing all the paragraphs and other content and then apply an xslt transformation over everything, but somehow I'd like to avoid this if possible.
Note: I also considered HTML to PDF engines (e.g. Prince) but they seem to be ludicrously expensive.
Thanks!
No, FOP's only valid input is an XSL-FO file (or XML + XSLT producing XSL-FO, but that's just a convenience option as it's an external library that performs the transformation).
However, you can use / adapt an existing xhtml to FO stylesheets, like those provided by AntennaHouse and RenderX.
Besides, even if you say you don't want to do it, writing an xhtml -> XSL-FO transformation is not a daunting task; this is especially true if the input comes from TinyMCE, as you can configure it to allow only a limited subset of tags, which will require a small set of templates.
(disclosure: I'm a FOP developer, though not very active nowadays)

Alternative to HTML standard for expressing static documents content

The content tends to be mixed with it's form when expressed as a HTML+CSS+JS document. Almost every modern website requires CSS and/or JavaScript to be readable. Most of them are not easy to parse automatically because they relay on web browser to render it. Sections of the document are defined using visual clues, colors and formatting. One can use HTML5 tags like <article> but those are not a part of any bigger structure as far as I know, and still can contain non-content elements.
Websites are basically apps or clients.
Is there any standard that can be used to serve content of a website that has a well defined schema? An API for websites that could be used to express content in the form that is easy to server, parse, store, cryptographically sign...
I'm aware of formats like XML and JSON but I have not managed to find any standardized way to express a blog post as a JSON document.
An example of what I have in mind:
This question can be fetched as an JSON document using Stackexchange API. The result is machine readable and easy to parse but is in not standardized. It reflects details of Stackexchange specific data structures. Other QA website will have different API, with different structure and formats even though both have questions and answers.
There are two important standards out there dealing with the semantic aspect of a web page, like the one you are looking for. Microdata and RDFa. With their aid, you can pick a certain open vocabulary to describe your data or create your own based on them.
With JSON-LD also, you can create a schema for JSON documents like the XML schema is for the XML documents.

Datapower - To parse HTML

I have a situation where the underlying application provides a UI layer and this in turn has to be rendered as a portlet. However, I do not want all parts of the UI originally presented to be rendered in Portlet.
Proposed solution: Using Datapower for parsing an XML being a norm, I am wondering if it is possible to parse a HTML. I understand HTML may not be always well formed. But if there are very few HTML pages in underlying application, then a contract can be enforced..
Also, if one manages to parse and extract data out of HTML using DP, then the resultant (perhaps and XML) can be used to produce HTML5 with all its goodies.
So question: Is it advisable to use Datapower to parse an HTML page to extract an XML out of it? Prerequisite: number of HTML pages per application could vary in data but not with many pages.
I suspect you will be unable to parse HTML using DataPower. DataPower can parse well-formed XML, but HTML - unless it is explicitly designed as xHTML - is likely to be full of tags that break well-formedness.
Many web pages are full of tags like <br> or <ul><li>Item1<li>Item2<li>Item3</ul>, all of which will cause the parsing to fail.
If you really want to follow your suggested approach, you'll probably need to do something on a more flexible platform such as WAS where you can build (or reuse) a parser that takes care of all of that for you.
If you think about it, this is what your web browser does - it has all the complex rules that turn badly-formed XML tags (i.e. HTML) into a valid DOM structure. It sounds like you may be better off doing manipulation at the level of the DOM rather than the HTML, as that way you can leverage existing, well-tested parsing solutions and focus on the structure of the data. You could do this client-side using JavaScript or you could look at a server-side JavaScript option such as Rhino or PhantomJS.
All of this might be doing things the hard way, though. Have you confirmed whether or not the underlying application has any APIs or web services that IT uses to render the pages, allowing you to get to the data without the existing presentation layer getting in the way?
Cheers,
Chris
Question of parsing and HTML page originates when you want to do some processing over it. If this is the case you can face problems because datapower by default will not allow hyperlinks inside the well formed XML or HTML document [It is considered to be a security risk], however this can be overcome with appropriate settings in XML manager present.
As far as question of HTML page parsing is concerned, Datapower being and ESB layer is expected to provide message format translation and that it indeed does. So design wise it is a good place to do message format translation. Practically however you will face above mentioned problem when you try to parse HTML as XML document.
The parsing can produce any message format model you wish [theoretically] hence you can use the XSLT to achieve what you wish.
Ajitabh

GWT HTML Widget XSS security

Might be a noobish question (most likely) but according to the official developer documents GWT's HTML widget is not XSS safe and one must exercise caution when embedding custom HTML/Script text.
So i guess my question is, why does this:
HTML testLabel = new HTML("dada<script type='text/javascript'>document.write('<b>Hello World</b>');</script>");
Not show a javascript popup? If somehow, GWT's HTML widget does protect from XSS attacks then in what types of situations does it not (so i can know what to expect)?
GWT documentation contains few articles about security (including dealing with XSS using SafeHtml).
Your example doesn't work because scripts defined via innerHTML doesn't get executed in Chrome/Firefox(i think there were some workaround for IE using defer attribute).
But you shouldn't rely on this browser restriction.. So it is better to use SafeHtml and always validate inputs from users.
I don't know about this widget in particular, but in general it is worth knowing that XSS vectors come in many many flavours. Only a small percentage actually use the script tag.
One very important factor is that they are location-dependent. For example, a string that is xss-safe outside any tags, may not be safe inside a tag's attribute value, or within a delimited string that is inside a javascript block.
They can also be browser-dependent, as many exploit 'bugs' in the document parsing model.
To get a sense of the variety of different vectors that can be abused to produce malicious javascript injection, please see these two cheat sheets
I also recommend you read the prevention cheat sheet at owasp