Failing to extract html table rows

Failing to extract html table rows - html

I try to extract all five rows listed in the table above.
I'm using Ruby hpricot library to extract the table rows using xpath expression.
In my example, the xpath expression I use is /html/body/center/table/tr. Note that I've removed the tbody tag from the expression, which is usually the case for successful extraction.
The weird thing is that I'm getting the first three rows in the result with the last two rows missing. I just have no idea what's going on there.
EDIT: Nothing magic about the code, just attaching it upon request.
require 'open-uri'
require 'hpricot'
faculty = Hpricot(open("http://www.utm.utoronto.ca/7800.0.html"))
(faculty/"/html/body/center/table/tr").each do |text|
puts text.to_s
end

The HTML document in question is invalid. (See http://validator.w3.org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html.) Hpricot parses it in another way than your browser — hence the different results — but it can't really be blamed. Until HTML5, there was no standard on how to parse invalid HTML documents.
I tried replacing Hpricot with Nokogiri and it seems to give the expected parse. Code:
require 'open-uri'
require 'nokogiri'
faculty = Nokogiri.HTML(open("http://www.utm.utoronto.ca/7800.0.html"))
faculty.search("/html/body/center/table/tr").each do |text|
puts text
end
Maybe you should switch?

The path table/tr does not exist. It's table/tbody/tr or table//tr. When you use table/tr, you're specifically looking for a <tr> that is a direct descendant of <table>, but from your image, this isn't how the markup is structured.

Related

Extracting string from html web scrape

I'm looking for some guidance on a web scraping script i'm working on.
All is going well but I'm stuck on stripping out the image file data.
I'm currently doing a WebRequest, getting elements by class, selecting outerHTML, but need to strip out just the contents of attribute data-imagezoom as per this example.
Sample data:
<a class="aaImg" href="https://imagehost.ssl.server123.com/Product-800x800/image.jpg">
<img class="aaTmb" alt="Matrix 900 x 900 test" src="https://imagehost.ssl.server123.com/Product-190x190/image.jpg" item="image"
data-imagezoom="https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg" data-thumbnail="https://imagehost.ssl.server123.com/Product-190x190/image.jpg">
</img>
</a>
Current code to get that data:
$ProductInfo = Invoke-WebRequest -Uri $ProductURL
$ProductImageRaw = $ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg") |
Select outerHTML
I can obviously get the first image by selecting the href attribute easily.
I was 'dirty coding' by replacing 800x800 with 1600x1600 as the filenames are the same, just a different path, but that came unstuck pretty quick when there were inconsistencies in path names.

You need to access the outer <a> element's <img> child element and call its .getAttribute() method to get the attribute value of interest:
$ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg").
childnodes[0].getAttribute('data-imagezoom')
.childnodes[0] returns the first child node (element)
.getAttributes('data-imagezoom') returns the value of the data-imagezoom attribute.[1]
This should return string https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg.
As for your own answer:
Using regexes (or substring search) to parse structured data such as HTML and XML is brittle and best avoided.
For instance, if the source HTML changes to use '...' instead of "..." around attribute values, your solution breaks (this particular case is not hard to account for in a regex, but there are many more ways in which such parsing can go wrong).
Cross-platform perspective:
Regrettably, the .ParsedHTML property with its HTML DOM is only available in Windows PowerShell (and its COM implementation is cumbersome and slow to work with in PowerShell).
PowerShell Core, even on Windows, doesn't support it, and there's no in-box HTML parser available (as of PowerShell Core 6.2.0).
The HtmlAgilityPack NuGet package is a popular open-source HTML parser, but it is aimed at C# and therefore nontrivial to install and use in PowerShell.
That said, this answer by TheIncorrigible1 has a working example that downloads the required assembly on demand.
[1] Note that .getAttribute() is necessary to access custom attributes, whereas standard attributes such as id and, in the case of <a> elements, href, are represented directly as object properties (e.g., .id; note that .getAttribute() works with standard attributes too.)

So, after a quick crash course in some Regex, this is what I've come up with.
(?<=data-imagezoom=").*?(?="\s)
A positive lookbehind, select all until the closing quotes and whitespace.
Thanks all.

Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

This is meant to provide a canonical Q&A to all that similar (but much too specific questions to be a close target candidate) popping up once or twice a week.
I'm developing an application that needs to parse a website with tables in it. As deriving XPath expression for scraping web pages is boring and error-prone work, I'd like to use the XPath extractor feature of Firebug (or similar tools in other browsers) for this.
Example input looks like this:
<!-- snip -->
<table id="example">
<tr>
<th>Example Cell</th>
<th>Another one</th>
</tr>
<tr>
<td>foobar</td>
<td>42</td>
</tr>
</table>
<!-- snip -->
I want to extract the first data cell ("foobar"). Firebug proposes the XPath expression
//table[#id="example"]/tbody/tr[2]/td[1]
which works fine in any XPath tester plugins, but not my own application (no results found). If I cut down the query to //table[#id], it works again.
What's going wrong?

The Problem: DOM Requires <tbody/> Tags
Firebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code.
The DOM for HTML requires that all table rows not contained in a table header of footer (<thead/>, <tfoot/>) are included in table body tags <tbody/>. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation says
The tbody element is exposed for all tables, even if the table does not explicitly define a tbody element.
There is an in-depth explanation in another answer on stackoverflow.
On the other hand, HTML does not necessarily require that tag to be used:
The TBODY start tag is always required except when the table contains only one table body and no table head or foot sections.
Most XPath Processors Work on raw XML
Excluding JavaScript, most XPath processors work on raw XML, not the DOM, thus do not add <tbody/> tags. Also HTML parser libraries like tag-soup and htmltidy only output XHTML, not "DOM-HTML".
This is a common problem posted on Stackoverflow for PHP, Ruby, Python, Java, C#, Google Docs (Spreadsheets) and lots of others. Selenium runs inside the browser and works on the DOM -- so it is not affected!
Reproducing the Issue
Compare the source shown by Firebug (or Chrome's Dev Tools) with the one you get by right-clicking and selecting "Show Page Source" (or whatever it's called in your browsers) -- or by using curl http://your.example.org on the command line. Latter will probably not contain any <tbody/> elements (they're rarely used), Firebug will always show them.
Solution 1: Remove /tbody Axis Step
Check if the table you're stuck at really does not contain a <tbody/> element (see last paragraph). If it does, you've probably got another kind of problem.
Now remove the /tbody axis step, so your query will look like
//table[#id="example"]/tr[2]/td[1]
Solution 2: Skip <tbody/> Tags
This is a rather dirty solution and likely to fail for nested tables (can jump into inner tables). I would only recommend to to this in very rare cases.
Replace the /tbody axis step by a descendant-or-self step:
//table[#id="example"]//tr[2]/td[1]
Solution 3: Allow Both Input With and Without <tbody/> Tags
If you're not sure in advance that your table or use the query in both "HTML source" and DOM context; and don't want/cannot use the hack from solution 2, provide an alternative query (for XPath 1.0) or use an "optional" axis step (XPath 2.0 and higher).
XPath 1.0:
//table[#id="example"]/tr[2]/td[1] | //table[#id="example"]/tbody/tr[2]/td[1]
XPath 2.0: //table[#id="example"]/(tbody, .)/tr[2]/td[1]

Just came across the same problem. I almost wrote a recursive funtion to check for every tbody tag if it exists and traverse the dom that way, then I remembered I know regex. :)
Before parsing, get the html as a string. Insert missing <tbody> and </tbody> tags with regex, then load it back into your DOMDocument object.
Jens Erat gives a good explanation, but here is
Solution 4: Make sure the HTML source always has the <tbody> tags with regex
JavaScript
var html = '<html><table><tr><td>foo</td><td>bar</td></tr></table></html>';
html.replace(/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/g,"$1<tbody>").replace(/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/g,"$1</tbody>$4");
PHP
$html = $dom->saveHTML();
$html = preg_replace(array('/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/','/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/'),array('$1<tbody>','$1</tbody>$4'),$html);
$dom->loadHTML($html);
Just the regex:
matches `<table>` tag with whatever else junk inside the tag and between this and the next tag if the next tag is NOT `<tbody>` also with stuff inside the tag
/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/
replace with
$1<tbody>
the $1 referencing the captured `<table>` tag with contents.
Do the same for the closing tag like this:
/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/
replace with
$1</tbody>$4
This way the dom will ALWAYS have the <tbody> tags where necessary.

Grep and Extract Data in Perl

I have HTML content stored in a variable. How do I extract data that is found between a set of common tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:
...
<td class="jumlah">*DATA_1*</td>
<td class="ud">*DATA_2*</td>
...
And then I would like to store a mapping DATA_2 => DATA_1 in a hash

Since it is HTML I think this could work for you?
https://metacpan.org/pod/XML::XPath
XPath is the way.

Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.
First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);
Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:
my $tdNodes = $tree->findnodes('/html/body/table/tr/td');
Finally you can just iterate over all the nodes in a loop to find what you want:
foreach my $node ($tdNodes->get_nodelist) {
my $data = $node->findvalue('.'); // the content of the node
print "$data\n";
}
See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.
With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.

Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.
Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.

You might try this module: HTML::TreeBuilder::XPath. The doc says:
This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

Match multiple terms within <body> tags

I've want to match any occurrence of a search term (or list of search terms) within the tags of a document. My current solution uses preg (within a Joomla plugin)
$pattern = '/matchthisterm/i';
$article->text = preg_replace($pattern,"<span class=\"highlight\">\\0</span>",$article->text);
But this replaces everything within the HTML of the document so I need to match the tags first. Is this even the best way to achieve this?
EDIT:
OK, I've used simplehtmldom, but just need some help getting to the correct term. So far I've got:
$pattern = '/(matchthisterm)/i';
$html = str_get_html($buffer);
$es = $html->find('text');
foreach ($es as $term) {
//Match to the terms within the text nodes
if (preg_match($pattern, $term->plaintext)) {
$term->outertext = '<span class="highlight">' . $term->outertext . '</span>';
}
}
This makes the entire node text bold, am I ok to use the preg_replace in here?
SOLUTION:
//Get the HTML and look at the text nodes
$html = str_get_html($buffer);
$es = $html->find('text');
foreach ($es as $term) {
//Match to the terms within the text nodes
$term->outertext = str_ireplace('matchthis', '<span class="highlight">matchthis</span>', $term->outertext);
}

No, processing [X][HT]ML with regex is largely disastrous. In the simplest case for your example, this input:
bof
gives quite thoroughly broken output:
matchthisterm</span>/bar">bof
The proper way to do it would be to use a proper HTML/XML parser (for example DOMDocument.loadHTML or simplehtmldom), then scan and replace the contents of each text node separately. Finally re-save the HTML back to a string.
An alternative for search term highlighting is to do it in JavaScript. Since the browser has already parsed the HTML to a DOM, that saves you a processing step. See eg. this question for an example.

I agree processing HTML with regex is not a good solution.
I just read the argument about why regex can't parse HTML here: RegEx match open tags except XHTML self-contained tags
I quite agree with the whole thing, but the problem is MUCH simpler here: we just need to know whether we are inside some HTML tag or not. We don't have to parse an HTML structure and interpreting a tree and mismatching tags or some other errors. We just know that a HTML tag is something between < and >. I believe the regex is a very good, adapted and consistent tool here.
It's not because we're dealing with some HTML that we don't want to use regex. We need to focus on the real problem here, which I believe doesn't really process HTML. We only need to know whether we're inside a tag or not. I hope I won't get too much downvotes for this, but I completely assume my position.
I'm redirecting you to a previous post (where you put a link to this topic) I made sooner this day: Highlight text, except html tags
On the same idea, and I hope we know all we need to, you're using preg_replace() where a simpler function like str_ireplace() would be sufficient. If you just need to replace a word (or a set of words) inside a string and deal with case insensivity, don't use regex. Keep it simple. (I'm assuming you didn't simplify the replacement you're trying to make on purpose to explain your problem here).

I haven't used preg but I've done pattern matching in perl, java and actionscript before. If this is anything similar you have to escape special characters. For example "\<span class.... I found a website that talks about using preg, in case you haven't come across this site, that can be found here

How do I match text in HTML that's not inside tags?

Given a string like this:
This is the foo link
... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:
This is the <b>foo</b> link
However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.
So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?
Note: I promise that the HTML in question will never be anything pathological like:
<img title="Haha! Here are some angle brackets to screw you up: ><" />
Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.
Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."
Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.
So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.

If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:
s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g

In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature ':5.10';
use Template::Refine::Fragment;
my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world. This is a test of foo finding. Here is another foo.');
say $frag->process(
simple_replace {
my $n = shift;
my $text = $n->textContent;
$text =~ s/foo/<foo>/g;
return XML::LibXML::Text->new($text);
} '//text()',
)->render;
This outputs:
<p>Hello, world. This is a test of <foo> finding. Here is another <foo>.</p>
Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".
Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)

The following regex will match all text between tags or outside of tags:
<.*?>(.*?)<.*?>|>(.*?)<
Then you can operate on that as desired.

Try this one
(?=>)?(\w[^>]+?)(?=<)
it matches all words between tags

To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. (note: PCRE machine)
(?<=>)((?:\w+)(?:\s*))(?1)*

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008