Using an NSXMLParser to parse HTML - html

I'm working on an app which aggregates some feeds from the internet and reformats the content. So I'm looking for a way to parse some HTML. Given XML and HTML are very similar in structure I was thinking "maybe I should just use an NSXMLParser" I'm already using it to parse my RSS feeds and I've become comfortable using it, but I'm running into a problem.
The parser will not recognize <p> as an element. It has no problem extracting elements like <title>, or <img>, but it doesn't like <p>. Has anyone tried doing this, and if so do you have any suggestion or work arounds for this issue? I think the XMLParser is good for what I'm doing and I would like to use it, but obviously, if I can't get the text in <p> elements it's completely useless to me.
Any suggestions are welcome, even ones suggesting a different method entirely. I've looked into some third party libraries for doing this but from what I've read they all have some bugs and I would much prefer to use something provided by Apple.

There's absolutely nothing special about "p" as the name of an element. While it is hard to be sure because you haven't provided an example of the HTML you are parsing, the problem is most likely caused by HTML that is not well-formed XML. In other words, using NSXMLParser would work on XHTML, but not necessarily plain-old HTML.
The "p" element is frequently found in HTML without the matching closing tag, which is not valid XML. My guess is that you would have to convert the HTML to XHTML before trying to parse it with an NSXMLParser

HTML is not necessarily well-formed XML, and that's the trouble when you parse it as XML.
Take the following example:
<body>
<p>123
<p>abc
<p>789
</body>
If you view this chunk of html in a browser, it would show just as what you expected. But if you parse this as xml, there would be trouble, as those p tags are not closed.

I recommend you use my DTHTMLParser which is modeled after NSXMLParser and uses libxml2 to parse HTML perfectly. You generally cannot rely on the HTML to be well-formed and be parseable as xml.
libxml2 has a HTML mode where it is able to ignore things like un-closed tags and whatever HTML might have in ideosyncrasies.
HTML parsing explained:
http://www.cocoanetics.com/2011/09/taming-html-parsing-with-libxml-1/
http://www.cocoanetics.com/2012/01/taming-html-parsing-with-libxml-2/
DTHTMLParser documentation:
https://docs.cocoanetics.com/DTFoundation/Classes/DTHTMLParser.html
Source, part of DTFoundation:
DTHTMLParser.h
DTHTMLParser.m

Related

Look for Nested XML tag with Regex

This is my first post here, hoping to get some response. I've read through few similar posts and consensus is not to try parsing xml/html with regex but what I'm asking seems to be easier than the ones on other postings, so i'm giving it a shot.
I'm trying to find all the nested tags, here are some examples
I want to catch:
<a><a></a></a>
I don't want to catch
<a></a><a></a>
So in plain english I want to catch all
<a> following other <a> without having </a> in between them..and I want to look though the entire string so i should proceed even it sees a newline or linebreak
Hoping to have this problem solved.
Thanks all!
I hope you are ready for parsing XML with regex.
First of all, let's define what XML tags would look like!
<tag_name␣(optional space (then whatever that doesnt end with "/"))>(whatever)</␣(optional space)tag_name>
<tag_name␣(optional space)/>
To match one of these tags we can then use the following regex:
/<[^ \/>]++ ?\/>|<([^ \>]++) ?[^>]*+>.*?<\/ ?\1>/s
Obviously, no tags are going to nest within our second type of XML tag. So our two-level nested regex would then be:
/<([^ \>]++) ?[^>]*+>.*?(?:<([^ \>]++) ?[^>]*+>.*?<\/ ?\2>|<[^ \/>]++ ?\/>).*?<\/ ?\1>/s
Now let's apply some recursion magic (Hopefully your regex engine supports recursion (and doesn't crash yet)):
/<([^ \>]++) ?[^>]*+>(.*?(?:<([^ \>]++) ?[^>]*+>(?:[^<]*+|(?2))<\/ ?\3>|<[^ \/>]++ ?\/>).*?)<\/ ?\1>/s
Done - The regex should do.
No seriously, try it out.
I stole an XML file fragment from w3schools XML tutorial and tried it with my regex, I copied a Maven project .xml from aliteralmind's question and tried it with my regex as well. Works best with heavily nested elements.
(source: gyazo.com)
Cheers.
If you want a 100% correct solution, for example one that works with arbitrary content in comments and CDATA sections and in internal/external entities, and with author-chosen namespace prefixes, then it can't be done with regular expressions.
And since a 100% correct solution is very easy to achieve with XSLT, I think you are using the wrong technology.
No doubt you can achieve an acceptably high hit rate with regular expressions if you're prepared to put enough work in, but the details depend on aspects of the specification that you haven't made clear: for example, what you want to do with the nested elements that you find, and whether you want to locate elements nested 3-deep or 4-deep as well as those nested 2-deep.

Perl HTML::TreeBuilder adding <html>, <head> and <body> tags to parsed content, how to stop or work around it?

Background:
I'm using HTML::TreeBuilder to parse an entire html page, say "whole_page" for reference's sake. I'm then using the inherited parse_content method (same as for whole_page) of a new TreeBuilder object to to parse a chunk of html, say "html_to_insert". The root element of html_to_insert should be a <div> tag. Ultimately, the html_to_insert tree needs to be inserted into the the whole_page tree.
Problem:
The html_to_insert tree is being wrapped with <html>, <head> and <body> tags, which I obviously don't need. I looked at HTML::Parser to see if there was a parameter that might solve the problem, but I couldn't find anything.
Question:
Is there a simple way to stop the parse method from wrapping html_to_insert with the un-needed tags? Knowing what I'm trying to do, am I doing this ass backwards (is there a better way)?
Thanks for any help.
You might want to look on guts method in HTML::Tree. It returns only non-implicit nodes as a list.
If you can ensure your HTML is XHTML-compliant, that is, it's a proper XML document, you may be able to use XML tools to do the job instead. In the past, I've used XML::Twig for this type of job, it was a bit easier that way.
Of course, if you're parsing arbitrary web pages from the internet, you may not have this type of guarantee.

What are some good ways to parse HTML and CSS in Perl?

I have a project where my input files used to be XML. I'm now being asked to start processing HTML with embedded CSS instead, and I'd like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we're moving to HTML with CSS, I'm thinking I'll need to move to something else. That said, before I dig myself knee deep into silly decisions I'll likely regret, I wanted to ask here: what do you guys use for this kind of task?
The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML's text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.
An example of the old XML is:
<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
o_size="11.04" o_cs="4.6">
Some text
</text>
An example of the new HTML is:
<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
<span class="ft19" >
Some text
</span></nobr>
</div>
where "ft19" refers to a css style element from the top of the page of the format:
.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
x-pdf-letter-spacing:0.83px;}
Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:
my #texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');
as I'm able to do with the XML. Does anything like that exist for parsing HTML? I'd really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I'm trying to do.
Ideas?
The basic one I am aware of is HTML::Parser.
There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author's blog which is very interesting but much newer and experimental.
I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there's that too.
If you need something even more generic (and evil) you can look into "writing" your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don't do what you need.
Perhaps I haven't helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.
Edit: one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.
It's not clear - is the Perl parsing for the purposes of doing the conversion to HTML (with embedded CSS)? If so, why not forget Perl and use XSLT which is designed to transform XML documents?

What is the safest way to extract <title> from an HTML file using xpath?

Here is my current xpath code "/html/head/title".
But you know, in the real world html environment, the code format usually broken, e.g. <html> tag is missing could cause an exception. So, I would like to know if there's a safe way to extract the <title> tag? (something like getElementByTagName)
"//title" perhaps?
Because of the unruly nature of html markup you should use an html parsing library. You didn't specify a platform or language but there are a number of open source libraries out there.
Actually /html/head/title should work just fine, even on badly malformed mark-up, assuming:
there is a title element;
your HTML parser behaves the same way browser parsers do;
your HTML parser puts the HTML elements into the null namespace.
You will have to allow for the possibility of there being multiple title elements in invalid HTML, so /html/head/title[1] is possibly better.
If you can use javascript, you can do it:
document.title
If you have something that an XML parser can parse (which is not the case with most HTML, but needs to be the case to use XPath), then you could use //title to get the element.

Howto remove HTML <a> tags in a CDATA element

I have HTML in a CDATA element (HTML is too crappy to be parsed) and I would like to remove <a href> tags, but keep text in the tags.
I'm searching around regex but still not find a good way to do that.
All advices are welcome!
You could remove anything from a string that looks like a HTML link via regex. Results heavily depend on your input, but replacing </?a\b[^>]*> with the empty string could get you pretty far.
In any case, handling HTML with regular expressions is crappy and ad-hoc. If your input data set is limited and well known and all you need to do is some throw-away one-time conversion code then crappy and ad-hoc may be enough and you could get away with it.
If you are developing code that is intended to be of the long-lived sort, you should definitely look into one of the avilable HTML parsers (BeautifulSoup for Python or the HTML Agility Pack for .NET come to mind) and not only handle your HTML in a structured way, but also fix it while you are at it.