Grep and Extract Data in Perl - html

I have HTML content stored in a variable. How do I extract data that is found between a set of common tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:
...
<td class="jumlah">*DATA_1*</td>
<td class="ud">*DATA_2*</td>
...
And then I would like to store a mapping DATA_2 => DATA_1 in a hash

Since it is HTML I think this could work for you?
https://metacpan.org/pod/XML::XPath
XPath is the way.

Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.
First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);
Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:
my $tdNodes = $tree->findnodes('/html/body/table/tr/td');
Finally you can just iterate over all the nodes in a loop to find what you want:
foreach my $node ($tdNodes->get_nodelist) {
my $data = $node->findvalue('.'); // the content of the node
print "$data\n";
}
See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.
With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.

Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.
Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.

You might try this module: HTML::TreeBuilder::XPath. The doc says:
This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

Related

Extracting string from html web scrape

I'm looking for some guidance on a web scraping script i'm working on.
All is going well but I'm stuck on stripping out the image file data.
I'm currently doing a WebRequest, getting elements by class, selecting outerHTML, but need to strip out just the contents of attribute data-imagezoom as per this example.
Sample data:
<a class="aaImg" href="https://imagehost.ssl.server123.com/Product-800x800/image.jpg">
<img class="aaTmb" alt="Matrix 900 x 900 test" src="https://imagehost.ssl.server123.com/Product-190x190/image.jpg" item="image"
data-imagezoom="https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg" data-thumbnail="https://imagehost.ssl.server123.com/Product-190x190/image.jpg">
</img>
</a>
Current code to get that data:
$ProductInfo = Invoke-WebRequest -Uri $ProductURL
$ProductImageRaw = $ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg") |
Select outerHTML
I can obviously get the first image by selecting the href attribute easily.
I was 'dirty coding' by replacing 800x800 with 1600x1600 as the filenames are the same, just a different path, but that came unstuck pretty quick when there were inconsistencies in path names.
You need to access the outer <a> element's <img> child element and call its .getAttribute() method to get the attribute value of interest:
$ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg").
childnodes[0].getAttribute('data-imagezoom')
.childnodes[0] returns the first child node (element)
.getAttributes('data-imagezoom') returns the value of the data-imagezoom attribute.[1]
This should return string https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg.
As for your own answer:
Using regexes (or substring search) to parse structured data such as HTML and XML is brittle and best avoided.
For instance, if the source HTML changes to use '...' instead of "..." around attribute values, your solution breaks (this particular case is not hard to account for in a regex, but there are many more ways in which such parsing can go wrong).
Cross-platform perspective:
Regrettably, the .ParsedHTML property with its HTML DOM is only available in Windows PowerShell (and its COM implementation is cumbersome and slow to work with in PowerShell).
PowerShell Core, even on Windows, doesn't support it, and there's no in-box HTML parser available (as of PowerShell Core 6.2.0).
The HtmlAgilityPack NuGet package is a popular open-source HTML parser, but it is aimed at C# and therefore nontrivial to install and use in PowerShell.
That said, this answer by TheIncorrigible1 has a working example that downloads the required assembly on demand.
[1] Note that .getAttribute() is necessary to access custom attributes, whereas standard attributes such as id and, in the case of <a> elements, href, are represented directly as object properties (e.g., .id; note that .getAttribute() works with standard attributes too.)
So, after a quick crash course in some Regex, this is what I've come up with.
(?<=data-imagezoom=").*?(?="\s)
A positive lookbehind, select all until the closing quotes and whitespace.
Thanks all.

Failing to extract html table rows

I try to extract all five rows listed in the table above.
I'm using Ruby hpricot library to extract the table rows using xpath expression.
In my example, the xpath expression I use is /html/body/center/table/tr. Note that I've removed the tbody tag from the expression, which is usually the case for successful extraction.
The weird thing is that I'm getting the first three rows in the result with the last two rows missing. I just have no idea what's going on there.
EDIT: Nothing magic about the code, just attaching it upon request.
require 'open-uri'
require 'hpricot'
faculty = Hpricot(open("http://www.utm.utoronto.ca/7800.0.html"))
(faculty/"/html/body/center/table/tr").each do |text|
puts text.to_s
end
The HTML document in question is invalid. (See http://validator.w3.org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html.) Hpricot parses it in another way than your browser — hence the different results — but it can't really be blamed. Until HTML5, there was no standard on how to parse invalid HTML documents.
I tried replacing Hpricot with Nokogiri and it seems to give the expected parse. Code:
require 'open-uri'
require 'nokogiri'
faculty = Nokogiri.HTML(open("http://www.utm.utoronto.ca/7800.0.html"))
faculty.search("/html/body/center/table/tr").each do |text|
puts text
end
Maybe you should switch?
The path table/tr does not exist. It's table/tbody/tr or table//tr. When you use table/tr, you're specifically looking for a <tr> that is a direct descendant of <table>, but from your image, this isn't how the markup is structured.

How to rewrite a stream of HTML tokens into a new document?

Suppose I have an HTML document that I have tokenized, how could I transform it into a new document or apply some other transformations?
For example, suppose I have this HTML:
<html>
<body>
<p>text</p>
<p>Hello <span class="green">world</span></p>
</body>
</html>
What I have currently written is a tokenizer that outputs a stream of tokens. For this document they would be (written in pseudo code):
TAG_OPEN[html] TAG_OPEN[body] TAG_OPEN[p] TAG_OPEN[a] TAG_ATTRIBUTE[href]
TAG_ATTRIBUTE_VALUE[/foo] TEXT[text] TAG_CLOSE[a] TAG_CLOSE[p]
TAG_OPEN[p] TEXT[Hello] TAG_OPEN[span] TAG_ATTRIBUTE[class]
TAG_ATTRIBUTE_VALUE[green] TEXT[world] TAG_CLOSE[span] TAG_CLOSE[p]
TAG_CLOSE[body] TAG_CLOSE[html]
But now I don't have any idea how could I use this stream to create some transformations.
For example, I would like to rewrite TAG_ATTRIBUTE_VALUE[/foo] in TAG_OPEN[a] TAG_ATTRIBUTE[href] to something else.
Another transformation I would like to do is make it output TAG_ATTRIBUTE[href] attributes after the TAG_OPEN[a] in parenthesis, for example,
text
gets rewritten into
text(/foo)
What is the general strategy for doing such transformations? There are many other transformations I would like to do, like stripping all tags and just leaving TEXT content, adding tags after some specific tags, etc.
Do I need to create the parse tree? I have never done it and don't know how to create a parse tree from a stream of tokens. Or can I do it somehow else?
Any suggestions are welcome.
And one more thing - I would like to learn all this parsing myself, so I am not looking for a library!
Thanks beforehand, Boda Cydo
If we can assume that the html is xml compliant, then xslt would be a way to go. But I am assuming that would be out as you seem to want to write your own parser (not sure why).
If you really want to write a parser (I'd write parse rules, not your own parser engine) take a look at antlr and MS oslo.
There are various ways of parsing/traversing an XML/HTML tree. Perhaps I can point you to:-
http://razorsharpcode.blogspot.com/2009/10/combined-pre-order-and-post-order-non.html
If you want to do pre-order or post-order manipulation of DOM elements, you can use the algorithm described there.

Match multiple terms within <body> tags

I've want to match any occurrence of a search term (or list of search terms) within the tags of a document. My current solution uses preg (within a Joomla plugin)
$pattern = '/matchthisterm/i';
$article->text = preg_replace($pattern,"<span class=\"highlight\">\\0</span>",$article->text);
But this replaces everything within the HTML of the document so I need to match the tags first. Is this even the best way to achieve this?
EDIT:
OK, I've used simplehtmldom, but just need some help getting to the correct term. So far I've got:
$pattern = '/(matchthisterm)/i';
$html = str_get_html($buffer);
$es = $html->find('text');
foreach ($es as $term) {
//Match to the terms within the text nodes
if (preg_match($pattern, $term->plaintext)) {
$term->outertext = '<span class="highlight">' . $term->outertext . '</span>';
}
}
This makes the entire node text bold, am I ok to use the preg_replace in here?
SOLUTION:
//Get the HTML and look at the text nodes
$html = str_get_html($buffer);
$es = $html->find('text');
foreach ($es as $term) {
//Match to the terms within the text nodes
$term->outertext = str_ireplace('matchthis', '<span class="highlight">matchthis</span>', $term->outertext);
}
No, processing [X][HT]ML with regex is largely disastrous. In the simplest case for your example, this input:
bof
gives quite thoroughly broken output:
matchthisterm</span>/bar">bof
The proper way to do it would be to use a proper HTML/XML parser (for example DOMDocument.loadHTML or simplehtmldom), then scan and replace the contents of each text node separately. Finally re-save the HTML back to a string.
An alternative for search term highlighting is to do it in JavaScript. Since the browser has already parsed the HTML to a DOM, that saves you a processing step. See eg. this question for an example.
I agree processing HTML with regex is not a good solution.
I just read the argument about why regex can't parse HTML here: RegEx match open tags except XHTML self-contained tags
I quite agree with the whole thing, but the problem is MUCH simpler here: we just need to know whether we are inside some HTML tag or not. We don't have to parse an HTML structure and interpreting a tree and mismatching tags or some other errors. We just know that a HTML tag is something between < and >. I believe the regex is a very good, adapted and consistent tool here.
It's not because we're dealing with some HTML that we don't want to use regex. We need to focus on the real problem here, which I believe doesn't really process HTML. We only need to know whether we're inside a tag or not. I hope I won't get too much downvotes for this, but I completely assume my position.
I'm redirecting you to a previous post (where you put a link to this topic) I made sooner this day: Highlight text, except html tags
On the same idea, and I hope we know all we need to, you're using preg_replace() where a simpler function like str_ireplace() would be sufficient. If you just need to replace a word (or a set of words) inside a string and deal with case insensivity, don't use regex. Keep it simple. (I'm assuming you didn't simplify the replacement you're trying to make on purpose to explain your problem here).
I haven't used preg but I've done pattern matching in perl, java and actionscript before. If this is anything similar you have to escape special characters. For example "\<span class.... I found a website that talks about using preg, in case you haven't come across this site, that can be found here

How do I match text in HTML that's not inside tags?

Given a string like this:
This is the foo link
... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:
This is the <b>foo</b> link
However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.
So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?
Note: I promise that the HTML in question will never be anything pathological like:
<img title="Haha! Here are some angle brackets to screw you up: ><" />
Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.
Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."
Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.
So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.
If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:
s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g
In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature ':5.10';
use Template::Refine::Fragment;
my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world. This is a test of foo finding. Here is another foo.');
say $frag->process(
simple_replace {
my $n = shift;
my $text = $n->textContent;
$text =~ s/foo/<foo>/g;
return XML::LibXML::Text->new($text);
} '//text()',
)->render;
This outputs:
<p>Hello, world. This is a test of <foo> finding. Here is another <foo>.</p>
Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".
Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)
The following regex will match all text between tags or outside of tags:
<.*?>(.*?)<.*?>|>(.*?)<
Then you can operate on that as desired.
Try this one
(?=>)?(\w[^>]+?)(?=<)
it matches all words between tags
To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. (note: PCRE machine)
(?<=>)((?:\w+)(?:\s*))(?1)*