I'm scraping these two sites:
https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=Law
https://www.library.uq.edu.au/uqlsm/availablepcsembed.php?branch=BSL.
Unfortunately, they have variations. One has the level name (Eg. Level 2) inside a href tag, while the other one is just plain text. How can I select one or the other depending which one is there?
I tried this to no avail:
level.css(/"a[href]"|".left"/).text
Here are shortened versions of the 2 HTML sections:
<table class="chart">
<tr valign="middle">
<td class="left">Level 2</td> <!-- the problem -->
<td class="middle"><div style="width:86%;"><strong>86%</strong></div></td>
</tr>
</table>
<table class="chart">
<tr valign="middle">
<td class="left">Level 1</td>
<td class="middle"><div style="width:32%;"><strong>32%</strong></div></td>
</tr>
</table>
My Code (edited from section of code to whole method)
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css("table.chart tr").collect do |level|
right = level.css(".right").text.split
{level: level.css("a[href]").text, available: right[0], out_of_available: right[3]}
end
end
If what you want to do is grab the text that is within the innermost div, you should be able to dive all the way down just by calling #text on the parsed td element. No need to account for and walk extra tags that might be present inside, e.g. the link tag. Given your code as written:
details_page.css("table.chart tr").collect do |level|
level = level.text
end
For each element, that would pull the level label or percentage value (inner text) as a string and assign the value to the levels variable.
Edit: also, if all you care about is getting the level label, you can just filter the elements by class up front:
details_page.css("table.chart tr td.left").collect do |level|
level = level.text
end
The answer by jk_ should work in this particular case.
In the more general case, if you're going to use a CSS selector, you need to use CSS syntax for "or" (a comma). So if you were going to use the selectors you originally asked about, it'd be
level.css('a[href], .left').text
Thanks to inspiration from #jk_ I fixed it using .css(".left").text. That just selects all the text in the left td inside the tr.
The working code:
def self.scrape_details_page(library_url)
details_page = Nokogiri::HTML(open(library_url))
details_page.css("table.chart tr").collect do |level|
right = level.css(".right").text.split
{level: level.css(".left").text, available: right[0], out_of_available: right[3]}
end
end
I have couple of <th> elements within a <thead> element. The first one or one of them is an empty th used as placeholder and does not contain any text.
Wave tool gives out an error that th cannot be empty and suggests I change to <td>.
Now if I have a <td> within a <thead> it solves the issue and passes html validation too.
Is there any reason, I should not be having a <td> within <thead>
From HTML view:
<td> is allowed inside a <thead>. Permitted content of a <thead> are zero or more <tr> elements. In a <tr> element you can put a <td> and/or <th> element. It doesn’t matter.
From WCAG view:
A table can not have any empty table headers. This can be really confusing for screen reader users. There is one special case: Layout tables. Tables which are only used for "layouting", can have empty <td>'s as "column header". But if i understand your case correctly, you have some other regular table content, so you must add a column header for every column.
So in your case it is not ok to have an empty <td> as column header.
I have following HTML Structure: I am trying to build a robust method to extract second color digest element since there will be many of these tag within the DOM.
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<td>Color Digest </td>
<td>AgArAQICGQMVBBwTIRQHIwg0GUMURAZTBWQJcwV0AoEDAQ </td>
</tr>
<tr>
<td>Color Digest </td>
<td>2,43,2,25,21,28,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>
I am trying to extract the Second "Color Digest" td element that has the decoded value.
I wrote the following xpath but instead of getting the second i am not getting the second td element.
//td[text() = ' Color Digest ']/following-sibling::td[2]
And when I change it to td[2] to td[1] I get both the elements.
You should be looking for the second tr that has the td that equals ' Color Digest ', then you need to look at either the following sibling of the first td in the tr, or the second td.
Try the following:
//tr[td='Color Digest'][2]/td/following-sibling::td[1]
or
//tr[td='Color Digest'][2]/td[2]
http://www.xpathtester.com/saved/76bb0bca-1896-43b7-8312-54f924a98a89
You can go for identifying a list of elements with xPath:
//td[text() = ' Color Digest ']/following-sibling::td[1]
This will give you a list of two elements, than you can use the 2nd element as your intended one. For example:
List<WebElement> elements = driver.findElements(By.xpath("//td[text() = ' Color Digest ']/following-sibling::td[1]"))
Now, you can use the 2nd element as your intended element, which is elements.get(1)
/html/body/table/tbody/tr[9]/td[1]
In Chrome (possible Safari too) you can inspect an element, then right click on the tag you want to get the xpath for, then you can copy the xpath to select that element.