Only parsing outer element - html

I am writing a scraper with Nokogiri, and I want to scrape a large HTML file.
Currently, I am scraping a large table; here is a small fragment:
<table id="rptBidTypes__ctl0_dgResults">
<tr>
<td align="left">S24327</td>
<td>
Airfield Lighting
<div>
<div>
<table cellpadding="5px" border="2" cellspacing="1px" width="100%" bgcolor=
"black">
<tr>
<td bgcolor="white">Abstract:<br />
This project is for the purchase and delivery, of various airfield
lighting, for a period of 36 months, with two optional 1 year renewals,
in accordance with the specifications, terms and conditions specified in
the solicitation.</td>
</tr>
</table>
</div>
</div>
</td>
</tr>
</table>
And here is the Ruby code I am using to scrape:
document = doc.search("table#rptBidTypes__ctl0_dgResults tr")
document[1..-1].each do |v|
cells = v.search 'td'
if cells.inner_html.length > 0
data = {
number: cells[0].text,
}
end
ScraperWiki::save_sqlite(['number'], data)
end
Unfortunately this isn't working for me. I only want to extract S24327, but I am getting the content of every table cell. How do I only extract the content of the first td?
Keep in mind that under this table, there are many table rows following the same format.

In CSS, table tr means tr anywhere underneath the table, including nested tables. But table > tr means the tr must be a direct child of the table.
Also, it appears you only want the cell values, so you don't need to iterate. This will give you all such cells (the first in each row):
doc.search("table#rptBidTypes__ctl0_dgResults > tr > td[1]").map(&:text)

The content of the first td would be:
doc.at("table#rptBidTypes__ctl0_dgResults td").text

The problem is that your search is matching two different things: the <tr> tag nested directly within the table with id rptBidTypes__ctl0_dgResults, and the <tr> tag within the table nested inside that parent table. When you loop through document[1..-1] you're actually selecting the second <tr> tag rather than the first one.
To select just the direct child <tr> tag, use:
document = doc.search("table#rptBidTypes__ctl0_dgResults > tr")
Then you can get the text for the <td> tag with:
document.css('td')[0].text #=> "S24327"

Related

Selecting variations in Nokogiri

I'm scraping these two sites:
https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Law
https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=BSL.
Unfortunately, they have variations. One has the level name (Eg. Level 2) inside a href tag, while the other one is just plain text. How can I select one or the other depending which one is there?
I tried this to no avail:
level.css(/"a[href]"|".left"/).text
Here are shortened versions of the 2 HTML sections:
<table class="chart">
<tr valign="middle">
<td class="left">Level 2</td> <!-- the problem -->
<td class="middle"><div style="width:86%;"><strong>86%</strong></div></td>
</tr>
</table>
<table class="chart">
<tr valign="middle">
<td class="left">Level 1</td>
<td class="middle"><div style="width:32%;"><strong>32%</strong></div></td>
</tr>
</table>
My Code (edited from section of code to whole method)
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css("table.chart tr").collect do |level|
right = level.css(".right").text.split
{level: level.css("a[href]").text, available: right[0], out_of_available: right[3]}
end
end
If what you want to do is grab the text that is within the innermost div, you should be able to dive all the way down just by calling #text on the parsed td element. No need to account for and walk extra tags that might be present inside, e.g. the link tag. Given your code as written:
details_page.css("table.chart tr").collect do |level|
level = level.text
end
For each element, that would pull the level label or percentage value (inner text) as a string and assign the value to the levels variable.
Edit: also, if all you care about is getting the level label, you can just filter the elements by class up front:
details_page.css("table.chart tr td.left").collect do |level|
level = level.text
end
The answer by jk_ should work in this particular case.
In the more general case, if you're going to use a CSS selector, you need to use CSS syntax for "or" (a comma). So if you were going to use the selectors you originally asked about, it'd be
level.css('a[href], .left').text
Thanks to inspiration from #jk_ I fixed it using .css(".left").text. That just selects all the text in the left td inside the tr.
The working code:
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css("table.chart tr").collect do |level|
right = level.css(".right").text.split
{level: level.css(".left").text, available: right[0], out_of_available: right[3]}
end
end

Excel VBA web scraping from table

I am trying to extract some info from the table below into Excel using VBL without any success. The values which I need do not seem to have any element ID, tag name or class name assigned to it. I'm after the Fuel Usage value(89218) and the time value in the same row (01:15). Can anyone point me in the right direction on how to scrape values from a table, or how to extract data from specific TR, TD.
HTML source of the table:
<h3>Airbus A300-600-PW4158 Fuel Planner</h3>
<p>London to Chicago EGKK-KORD (3441 NM)<br /></p>
<h2>Total Fuel: 101901 POUNDS</h2>
<table width="100%" border=1>
<tr>
<th style="text-align:left;"> </th>
<th style="text-align:left;">Fuel</td>
<th style="text-align:left;">Time</th>
</tr>
<tr>
<td>Fuel Usage</td>
<td>89218</td>
<td>08:47</td>
</tr>
<tr>
<td>Reserve Fuel</td>
<td>12682</td>
<td>01:15</td>
</tr>
<tr>
<td>Fuel on Board</td>
<td>101901</td>
<td>10:02</td>
</tr>
</table>
much appreciated.
CSS Selectors:
Without seeing more of the HTML you can use the following CSS selectors selectors for the snippet shown:
tr td:nth-child(2)
tr td:nth-child(3)
With CSS selectors this will bring back nodeLists of all 2 or 3 child tds with a tr.
For example:
You can access individual items from a nodeList by index.
VBA:
The syntax in vba overall will be something like:
.document.querySelectorAll("tr td:nth-child(2)")(0).innerText
or possibly
.document.querySelectorAll("tr td:nth-child(2)").Item(0).innerText
The 0 is hypothetical. You would need to inspect your full HTML to ascertain the correct index to use.
The .document innerHTML can be populated from the .responseText using IE, for example, to navigate to the page.

How to match nearest tag backward with XPath

I have a HTML like this:
html =<<EOS
<table><!-- outer table -->
<tr><td>
<table><!-- inner table 1 -->
<tr><td>Foo</td></tr>
</table>
<table><!-- inner table 2 -->
<tr><td>Bar</td></tr>
</table>
</td></tr>
</table>
EOS
I want to get a changing value Bar from a static value Foo.
With this code I can get the value.
Nokogiri::HTML(html)
doc.xpath("//table[tr/td[text()='Foo']]/following-sibling::table//td").text
And I wanted to rewrite like this:
doc.xpath("//table[//td[text()='Foo']]/following-sibling::table//td").text
But this code doesn't work because //table[//td[text()='Foo']] matches outer table not the inner table.
Is there a expression for nearest backward match in XPath like this?
//table[(nearest match expression)td[text()='Foo']]
Yes, //table[//td[text()='Foo']] gives the outer table as the first result (not the only result) , but //table[//td[text()='Foo']]/following-sibling::table//td still retrieves <td>Bar</td>.
The problematic part of //table[//td[text()='Foo']] is the // in front of td, because it selects all descendant td elements:
<table>
<tr>
<td>This is selected</td>
<td>
<table>
<tr>
<td>This is also selected</td>
</tr>
</table>
</td>
</tr>
</table>
You should use // only sparingly. I would use the expression
//table[tr/td = 'Foo']/following-sibling::table[1]/tr/td
EDIT: As suggested by Phrogz, in Nokogiri, instead of [1] in the expression above, you can use at_xpath as in
doc.at_xpath(//table[tr/td = 'Foo']/following-sibling::table/tr/td).text
to only get the first result node that was found. That is, if you actually intend to only find one node and if the wanted node is the first one in document order.

Use TBODY for every row in a TABLE? (and does it matter for semantics?)

I'm reworking a table class in PHP. One of its functions is that one block of data can span multiple TRs. For this feature I'm now using TBODY tags to group these rows together.
However, this got me thinking about the TBODY semantic. I know the convention is that tables have one TBODY, and use one block of data per TR. But shouldn't every TR be contained in a TBODY then?
the convention is that tables have one TBODY, and use one set of data per TR.
That is not necessarily true. Each TR represents a row, and nothing more. If you have a group of rows that are related, it's alright to contain each group in its own TBODY. It's perfectly fine for a single table to have multiple table bodies; the HTML 4.01 spec demonstrates a table with two bodies or blocks of data:
<TABLE>
<THEAD>
<TR> ...header information...
</THEAD>
<TFOOT>
<TR> ...footer information...
</TFOOT>
<TBODY>
<TR> ...first row of block one data...
<TR> ...second row of block one data...
</TBODY>
<TBODY>
<TR> ...first row of block two data...
<TR> ...second row of block two data...
<TR> ...third row of block two data...
</TBODY>
</TABLE>
HTML allows for multiple TBODY tags in one table (but only one THEAD and TFOOT). So while it may not be conventional (and waste some bytes), I don't see any good reason not to wrap each TR in a distinct TBODY if this fits your application.

Hierarchical html table, putting last td on next line

I'm creating a simple hierarchical table with html and CSS and I'm getting into trouble with formatting the last td element with class .child to be on next line.
I want to have the nested table inside table > tr > td.child becase each table can be sorted and javascript sorters don't implement any grouping of rows (my problem of having nested table could be easily solved by moving the .child > table element into next table > tr however this would break the nice nesting structure)
Is there a way to put td.child on next row with css?
html sample:
<table>
<tr>
<td>I have</td>
<td>1</td>
<td>pie</td>
<td class="child">
<table>
<tr>
<td>I have</td>
<td>1</td>
<td>pie</td>
</tr>
</table>
</td>
</tr>
</table>
You could do something like this . You'd need to be careful cross browser though (only checked on Chrome)