Get an HTML page as XML code - html

I just learnt about how to parse data in Xcode using NSXMLPARSER.
In order to do that, obviously, I will need xml files, but I am still a beginner with web programming.
I am having difficulties getting an xml file from a web page. I tried to convert html to xml using some softwares but I am still not getting the format I want.
The format that I want should be similar to this:
<?xml version="1.0" encoding="UTF-8"?>
<Books>
<Book id="1">
<title>Circumference</title>
<author>Nicholas Nicastro</author>
<summary>Eratosthenes and the Ancient Quest to Measure the Globe.</summary>
</Book>
<Book id="2">
<title>Copernicus Secret</title>
<author>Jack Repcheck</author>
<summary>How the scientific revolution began</summary>
</Book>
</Books>
So how can I get a format like this from a webpage?
And one more thing: If someone knows about NSXMLPARSER using Xcode, is this the way to go to extract data from websites? I mean getting an xml file, putting it in the resource of our project and then extracting the data from it?

HTML is also XML. So if you want to extract data from any given website, you will need to get the HTML (the source of the page) and parse it "as is", then look for the data you need.
A simple website may look like this:
<html>
<head>
<title>My website</title>
</head>
<body>
<h1>welocome</h1>
Text
<p>paragraph</p>
</body>
</html>
As you can see, this is valid, wellformed XML. If you are interested in the <title>, parse this XML and look for the <title>-tag.
The problem is that browsers are not so strict with the wellformedness of HTML. A missing end tag for <p> is often tolerated. An XML-parser would normally not be that "nice" and produce an error.
Very often websites has rss/atom-feeds. These are pure XML and are always wellformed. These feeds are made for the purpose of getting data that is easily interpreted by XML parsers.

Related

Basic Working Example of an XXE Attack in HTML

I'm trying to run some tests with XXE attacks in an html page, but i'm having trouble coming up with a working example. After looking around the internet for a long time, I came up with this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<script id="embeddedXML" type="text/xml">
<!DOCTYPE foo [
<!ELEMENT foo ANY>
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<foo>&xxe;</foo>
</script>
</head>
<body>
<script type="application/javascript">
alert(document.getElementById('embeddedXML').innerHTML);
</script>
</body>
</html>
But, it doesn't work. The XML inside the script tag doesn't "run", per se, meaning that when the alert pops up, it just displays the xml as plaintext. It doesn't interpret the DOCTYPE header thing and get the information from the listed file.
It's been very hard to google around for this because apparently XML doesn't "run", but something needs to happen where this text is interpreted instead of just written out. I don't know what that thing is, or how to get it working inside an HTML page as written here.
any tips much appreciated. Thanks!
See OWASP
Among the Risk Factors is:
The application parses XML documents.
Now, script elements are defined (in HTML 4 terms) as containing CDATA, so markup in them (except </script>) has no special meaning. So there is no XML parsing going on there.
Meanwhile alert() deals in strings, not in markup, so there's still no XML parsing going on.
Since you have no XML parser, there's no vulnerability.
In general, if you want XML parsing in the middle of a web page then you need to use JavaScript (e.g. with DOM Parser but I wouldn't be surprised if it was not DTD aware and so not vulnerable (and even if it was vulnerable then it might well block access to local external entities).

Is it possible to make a selectable drop down menu using data from an XML file?

I'm trying to create a directory for an address book, and I was wondering if it would be possible to create a selectable drop down menu that would pull the contact data from an XML file. The ideal way I would want it is to have all of the names of the contacts in the drop down menu, and when one is selected the rest of the information would pop up above the drop down, such as Address, Phone Number, and Email.
Either use a server-side language such as PHP to extract the data from the XML and insert it into the HTML document, or use AJAX to pull the XML file to the client then use JavaScript to process it and insert it into the DOM.
There should be libraries/frameworks/plugins/whatever available to parse XML using whatever language you need, if you know how to insert stuff into the HTML document (in the case of PHP) or into the DOM (in the case of JavaScript), you can do this easy.
From what I understand you have an XML document. Using XSLT you create an XHTML file from your XML and that you can display in your browser (XHTML is HTML that is conform to XML rules).
If that is the case then, yes, you can make links using XSLT. But the data needs to be in your XML source file and not in some database.
There is an article that describes it: http://www.ibm.com/developerworks/xml/library/x-tipxslt/index.html
You could attach an XSL to the XML using something like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="cdcatalog.xsl"?>
... actual XML content...
If applying the XSL on the XML outputs an HTML page with JavaScript, you can get the actual result.
Outputting JavaScript is a bit of a pain because of character escaping but it can be done.

growl for windows - rss feed not being parsed - format error?

So I'm attempting to use the custom subscriber SDK for Growl for Windows. Trying to dynamically create a RSS feed. Using C#, with Razor views. This is a sample of what the view looks like to which I am pointing the url of the subscriber:
#model GrowlExtras.Subscriptions.FeedMonitor.FeedItem
<?xml version="1.0" encoding="UTF-8" ?>
#{
Response.ContentType = "application/rss+xml";
ViewBag.Title = "Feed";
}
<rss version="2.0">
<channel>
<title>#Model.Title</title>
<link>#Url.Action("Feed", "Home", null, "http")</link>
<description>#Model.Description</description>
<lastBuildDate>#Model.PubDate</lastBuildDate>
<language>en-us</language>
</channel>
</rss>
This page is accessed locally (for now) using this url: http://localhost:2751/Home/Feed. So, I'm putting this url in as the "Feed Url:" on the "subscribe to notifications popup".. but getting an error "could not parse feed" and the OpenReadCompletedEventArgs e result is throwing the exception "OpenReadCompletedEventArgs '(e.Result).Length' threw an exception of type 'System.NotSupportedException'"
Any help welcome! Am I barking up the wrong tree completely here, or just missing something with the formatting of the feed file? Don't suppose it has something to do with the fact that the page is hosted locally at the moment?
The real answer is that Razor verifies that what you are trying to write is valid HTML. If you fail to do so, Razor fails.
Your code tried to write incorrect HTML:
If you look at the documentation of link tag in w3schools you can read the same thing expressed in different ways:
"The element is an empty element, it contains attributes only."
"In HTML the tag has no end tag."
What this mean is that link is a singleton tag, so you must write this tag as a self-closing tag, like this:
<link atrib1='value1' attrib2='value2' />
So you can't do what you was trying to do: use an opening and a closing tag with contents inside.
That's why Razor fails to generate this your <xml> doc.
But there is one way you can deceive Razor: don't let it know that you're writing a tag, like so:
#Html.Raw("<link>")--your link's content--#Html.Raw("</link>")
Remember that Razor is for writing HTML so writing XML with it can become somewhat tricky.
Right. Got this one sorted now.
A lot more digging was required! The page above did not parse correctly as a valid xml page, for example - html tags are being generated here, and this throws the rss parser of the plugin in question.
What I ended up doing was to use the built in RSS Syndication classes, and some help from these posts/related answers on the topic.
<http://stackoverflow.com/a/825016/1152015>
<http://stackoverflow.com/a/684518/1152015>
<http://stackoverflow.com/a/1292769/1152015>
<http://stackoverflow.com/a/2690302/1152015>
<http://stackoverflow.com/a/3098559/1152015>
<http://msdn.microsoft.com/en-us/library/bb412174>
So, to clarify - I had the code to consume a rss feed, but the page being generated dynamically was not being parsed correctly by the parser I was using.

How best open xml, parse with xslt and show result in browser

I am currently studying ways to present transformed xml files in browsers. My experience with this is minimal, so a number of questions pop up.
I have a transformation test.xslt which transforms input xml to html, and an input file test.xml containing
<?xml version="1.0" standalone="yes"?>
<?xml-stylesheet type="text/xsl" href="test.xslt" ?>
<root>...</root>
which, when opened in IE9, neatly displays the transformed xml contained above in the root element.
Question 1
Is there a processing instruction or similar available to include the source xml into the xml to be opened, somewhat like the following:
<?xml version="1.0" standalone="yes"?>
<?xml-stylesheet type="text/xsl" href="test.xslt" ?>
<... instruction to include source file data.xml>
Question 2
The file opened has extension xml. Is there a way to change file contents so it is valid html, allowing the file to be saved with extension html, so that when opened, the default browser will be selected (simply changing extension to html obviously does not have the desired effect so some structural change is necessary) ?
Question 3
My goal is to query a db to get the data to be parsed by the xslt code. What is the best way to do this (no problem if this includes javascript)?
Question 4
Standard db utilities may export query results in attribute-centered fashion (column names and values being represented as attribute names and values). This may involve pre-parsing the xml from db in order to convert it to parent-child fashion (columns as children instead of attributes). What is the best way to do this pre-parsing (note: I already have the xslt for this; I wonder about the data flow and when/how to run two xslt's in sequence) and then apply test.xslt (preferably without saving intermediate xml result files on the server)?
Question 5
When I open above xml in IE9, this works fine as said. But opening it in Firefox errors (RTF issue, apparently I need to use Firefox's node-set function but I still have to discover which namespace that has), and Opera/Chrome/Safari do not show any content. What exactly are the prerequisites for the various browsers where can I find more information on this?
Q1 If you start by serving an html file which then accesses the xml and xslt via javascript it naturally has access to both the input and the output of the xslt. If you are serving the xml and initiating the transformation using xml-stylesheet pi, then perhaps the best thing to do (depending on what you want to do) is to stuff the original source into the output, then javascript in the generated page can access it if needed, eg
<xsl:template matcj="whatever">
<html>
<head>
<script id="source" type="x-xml-spurce">
<xsl:copy-of select="/"/>
</script>
.... whatever you were going to do
then if you need to access the source in response to a user action on the page, a script can retrieve the script with id source and do whatever is needed. (If there is a possibility of the the source including the string you have to code it a bit more defensively).
Q2 If you want to use the xml-stylesheet API then you have to serve it as xml. However you can instead just serve html and then access the xml and xslt from within a script in the html page using the browsers javascip xslt api. as noted above that is more flexible than the xml-stylesheet mechanism.
Q3 pass
Q4 If you are accessing the xslt from javascript then it is easy to chain the output of one to the input of another without writing back to the server as you just have access to the result as a DOM node (or string, depending)
Answer to question 5: Firefox/Mozilla, Opera, Safari, Chrome all support the EXSLT node-set extension function in the namespace http://exslt.org/common, for IE and MSXML you can use script (imported) inside the XSLT stylesheet to allow it to support that namespace too, see http://dpcarlisle.blogspot.de/2007/05/exslt-node-set-function.html. That way inside the main stylesheet where you need to use the node-set function you don't need to write different code to cater for the different namespaces.

converting html in xml and xslt

i am trying to code XSLT and xml..
one of the problem i am facing is i actually get the values of my xslt file from the xml file one of the fields like description have html tags
<span class="text"><xsl:value-of select=BusinessDescription" >
</xsl:value-of></span><br />
so its outputting including the html tags like
<p> Hello there,</P>
<b>Hotel</b>
how do i transform the html on the web browser to show the output of the html tags??
like
Hello there,
Hotel
If I understand this question well, you are asking how to interprete escaped markup not as text but as markup.
The answer is:
This cannot be done in pure XSLT 1.0 or XSLT 2.0 (in XSLT 3.0 / XPath 3.0 there might be a function to parse a string as XML).
To do this you need to write an extension function, that takes a string, parses this as XML document and returns the resulting XML document.
Therefore, instead of:
<xsl:value-of select="BusinessDescription"/>
the code that uses this extension function would look something like this:
<xsl:copy-of select="my:xml-parse(BusinessDescription)"/>
The extension function itself would be written in your favourite PL and will simply create an XmlDocument object and try to load the string (with a method such as LoadXml()), then return this XmlDocument as its result.
If you could post the XSL and XML (try to reduce them to the smallest code that still produces the problem) we could give a more accurate answer. One likely possibility is that your XSL does not produce the <html><body>...</body></html> tags.
Your HTML content should enclosed inside the <body>...</body> element.