Selecting variations in Nokogiri - html

I'm scraping these two sites:
https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Law
https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=BSL.
Unfortunately, they have variations. One has the level name (Eg. Level 2) inside a href tag, while the other one is just plain text. How can I select one or the other depending which one is there?
I tried this to no avail:
level.css(/"a[href]"|".left"/).text
Here are shortened versions of the 2 HTML sections:
<table class="chart">
<tr valign="middle">
<td class="left">Level 2</td> <!-- the problem -->
<td class="middle"><div style="width:86%;"><strong>86%</strong></div></td>
</tr>
</table>
<table class="chart">
<tr valign="middle">
<td class="left">Level 1</td>
<td class="middle"><div style="width:32%;"><strong>32%</strong></div></td>
</tr>
</table>
My Code (edited from section of code to whole method)
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css("table.chart tr").collect do |level|
right = level.css(".right").text.split
{level: level.css("a[href]").text, available: right[0], out_of_available: right[3]}
end
end

If what you want to do is grab the text that is within the innermost div, you should be able to dive all the way down just by calling #text on the parsed td element. No need to account for and walk extra tags that might be present inside, e.g. the link tag. Given your code as written:
details_page.css("table.chart tr").collect do |level|
level = level.text
end
For each element, that would pull the level label or percentage value (inner text) as a string and assign the value to the levels variable.
Edit: also, if all you care about is getting the level label, you can just filter the elements by class up front:
details_page.css("table.chart tr td.left").collect do |level|
level = level.text
end

The answer by jk_ should work in this particular case.
In the more general case, if you're going to use a CSS selector, you need to use CSS syntax for "or" (a comma). So if you were going to use the selectors you originally asked about, it'd be
level.css('a[href], .left').text

Thanks to inspiration from #jk_ I fixed it using .css(".left").text. That just selects all the text in the left td inside the tr.
The working code:
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css("table.chart tr").collect do |level|
right = level.css(".right").text.split
{level: level.css(".left").text, available: right[0], out_of_available: right[3]}
end
end

Related

How to use Nokogiri to insert a block of HTML with proper indentation?

I have a block of HTML that I want to insert into an HTML document in Nokogiri. The issue is when inserting the block anywhere in the HTML it doesn't take the indentation of where it is inserted. Here is an example:
HTML document (shortened for example):
<div>
<div id="insertHere">
</div>
</div>
HTML to insert:
<table>
<tbody>
<tr>
<td>Hi</td>
</tr>
</tbody>
</table>
Result after inserting. This happens because it doesn't take into account the indentation. I want to be able to take into account the indentation where it is being inserted and pad the left of every line that is being inserted with that indentation:
<div>
<table>
<tbody>
<tr>
<td>Hi</td>
</tr>
</tbody>
</table>
</div>
It get's inserted using Nokogiri's node.replace('<table>....</table>').
What I am wanting it to look like:
<div>
<table>
<tbody>
<tr>
<td>Hi</td>
</tr>
</tbody>
</table>
</div>
Is there a way to get the indentation left of a block where I am inserting or replacing it?
Edit: If not using Nokogiri is this another way I could accomplish this? Maybe set a unique ID on each element kind of like the data-react-id set on react elements and then once I have a place where I need to insert an element I can use a regex to find it and match whitespace indentation to the left? Open to other approaches outside of Nokogiri. Trying to brainstorm other options.
I know Nokogiri can't "pretty print" but is there a way to get the whitespace "to the left or" or whitespace "before the current element and after a line break" to count indentation then I can pad what is being inserted manually. Maybe there is a way in nokogiri to get the nodes parent, then some how use the contents of the parent to get the whitespace to the left of the current node.
Turns out Nokogiri is really not meant for pretty-printing as #Casper mentioned. Instead I just run it all through an html pretty printer which I am using the https://github.com/threedaymonk/htmlbeautifier gem for.

Is td allowed inside thead?

I have couple of <th> elements within a <thead> element. The first one or one of them is an empty th used as placeholder and does not contain any text.
Wave tool gives out an error that th cannot be empty and suggests I change to <td>.
Now if I have a <td> within a <thead> it solves the issue and passes html validation too.
Is there any reason, I should not be having a <td> within <thead>
From HTML view:
<td> is allowed inside a <thead>. Permitted content of a <thead> are zero or more <tr> elements. In a <tr> element you can put a <td> and/or <th> element. It doesn’t matter.
From WCAG view:
A table can not have any empty table headers. This can be really confusing for screen reader users. There is one special case: Layout tables. Tables which are only used for "layouting", can have empty <td>'s as "column header". But if i understand your case correctly, you have some other regular table content, so you must add a column header for every column.
So in your case it is not ok to have an empty <td> as column header.

Xpath using sibblings or fellowing in two defrent Cell

Put bluntly I want to locate TestCoupon10% inside td then open a sibling td then locate //a[contains(#id,"cmdOpen")] I did try sibling and fellowing but likely I didnt do it right because
//span[./text()="TestCoupon10%"]/following-sibling:a[contains(#id,"cmdOpen")]
result into an invalid xpath. the HTML structure look as fellow
<tr>
<td>
<span id="oCouponGrid_ctl03_lblCode">TestCoupon10%</span>
</td>
<td>...</td>
<td>...</td>
<td valign="middle" align=""right">
<a id="oCouponGrid_ctl03_cmdOpen">
</td>
</tr>
I need to find cmdOpen and test coupon does anyone has an idea how to?
Axes are delimited with double colons, not single ones (those are used for namespace prefixes). You wanted to say this:
//span[./text()="TestCoupon10%"]/following-sibling::a[contains(#id,"cmdOpen")]
But - the <a> is not a following sibling of the <span> in question. You need to do some navigating:
//span[./text()="TestCoupon10%"]/parent::td/following-sibling::td/a[contains(#id,"cmdOpen")]
Or, simply avoid descending into the tree you you don't have to "climb up" again in the first place.
//td[span = "TestCoupon10%"]/following-sibling::td/a[contains(#id,"cmdOpen")]

Only parsing outer element

I am writing a scraper with Nokogiri, and I want to scrape a large HTML file.
Currently, I am scraping a large table; here is a small fragment:
<table id="rptBidTypes__ctl0_dgResults">
<tr>
<td align="left">S24327</td>
<td>
Airfield Lighting
<div>
<div>
<table cellpadding="5px" border="2" cellspacing="1px" width="100%" bgcolor=
"black">
<tr>
<td bgcolor="white">Abstract:<br />
This project is for the purchase and delivery, of various airfield
lighting, for a period of 36 months, with two optional 1 year renewals,
in accordance with the specifications, terms and conditions specified in
the solicitation.</td>
</tr>
</table>
</div>
</div>
</td>
</tr>
</table>
And here is the Ruby code I am using to scrape:
document = doc.search("table#rptBidTypes__ctl0_dgResults tr")
document[1..-1].each do |v|
cells = v.search 'td'
if cells.inner_html.length > 0
data = {
number: cells[0].text,
}
end
ScraperWiki::save_sqlite(['number'], data)
end
Unfortunately this isn't working for me. I only want to extract S24327, but I am getting the content of every table cell. How do I only extract the content of the first td?
Keep in mind that under this table, there are many table rows following the same format.
In CSS, table tr means tr anywhere underneath the table, including nested tables. But table > tr means the tr must be a direct child of the table.
Also, it appears you only want the cell values, so you don't need to iterate. This will give you all such cells (the first in each row):
doc.search("table#rptBidTypes__ctl0_dgResults > tr > td[1]").map(&:text)
The content of the first td would be:
doc.at("table#rptBidTypes__ctl0_dgResults td").text
The problem is that your search is matching two different things: the <tr> tag nested directly within the table with id rptBidTypes__ctl0_dgResults, and the <tr> tag within the table nested inside that parent table. When you loop through document[1..-1] you're actually selecting the second <tr> tag rather than the first one.
To select just the direct child <tr> tag, use:
document = doc.search("table#rptBidTypes__ctl0_dgResults > tr")
Then you can get the text for the <td> tag with:
document.css('td')[0].text #=> "S24327"

XPath:: Get following Sibling

I have following HTML Structure: I am trying to build a robust method to extract second color digest element since there will be many of these tag within the DOM.
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<td>Color Digest </td>
<td>AgArAQICGQMVBBwTIRQHIwg0GUMURAZTBWQJcwV0AoEDAQ </td>
</tr>
<tr>
<td>Color Digest </td>
<td>2,43,2,25,21,28,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>
I am trying to extract the Second "Color Digest" td element that has the decoded value.
I wrote the following xpath but instead of getting the second i am not getting the second td element.
//td[text() = ' Color Digest ']/following-sibling::td[2]
And when I change it to td[2] to td[1] I get both the elements.
You should be looking for the second tr that has the td that equals ' Color Digest ', then you need to look at either the following sibling of the first td in the tr, or the second td.
Try the following:
//tr[td='Color Digest'][2]/td/following-sibling::td[1]
or
//tr[td='Color Digest'][2]/td[2]
http://www.xpathtester.com/saved/76bb0bca-1896-43b7-8312-54f924a98a89
You can go for identifying a list of elements with xPath:
//td[text() = ' Color Digest ']/following-sibling::td[1]
This will give you a list of two elements, than you can use the 2nd element as your intended one. For example:
List<WebElement> elements = driver.findElements(By.xpath("//td[text() = ' Color Digest ']/following-sibling::td[1]"))
Now, you can use the 2nd element as your intended element, which is elements.get(1)
/html/body/table/tbody/tr[9]/td[1]
In Chrome (possible Safari too) you can inspect an element, then right click on the tag you want to get the xpath for, then you can copy the xpath to select that element.