Ruby and Nokogiri parsing table? - html

This is my HTML:
<tbody><tr><th>SHOES</th></tr>
<tr>
<td>
Shoe 1 <br>shoe 2<br> shoe3 <br>
</td>
</tr>
</tbody>
This is my code:
nodes = page.css("tr").select do |el|
el.css('th').text =~ /SHOES/
end
nodes.each do |value|
puts value.css("td").text
end
I wish to get the values shoe 1, shoe 2 and shoe 3, but there is no output. I suspect there is an extra <tr></tr> in between <tr><th>SHOES</th></tr>. Or are the <br> the culprit?
There are other structures like:
<tr>
<th>SHOES</th>
<td>NBA</td>
</tr>
and I got the desired output "NBA".
What did I do wrong?
I have two kinds of structures:
Name1: value
Name1: value2
The above would give:
<tr>
<th>Name1</th>
<td>Value</td>
</tr>
but sometimes it's:
Name:
value
value2
value3
So the HTML is:
<tbody><tr><th>Name</th></tr>
<tr>
<td>value<br>value2<br> ....</td>

In HTML, tables are composed by rows. When you iterate by those rows, only one of them is the header. Although logically you see a relation between the body rows and the header ones, for HTML (and therefore for Nokogiri) there's none.
If what you want, is to get every value of the cells that have a specific header, what you can do is count the specific column, and then get the values from there.
Using this HTML as source
html = '<tbody><tr><th>HATS</th><th>SHOES</th></tr>
<tr>
<td>
hat 1 <br>hat 2<br> hat3 <br>
</td>
<td>
Shoe 1 <br>shoe 2<br> shoe3 <br>
</td>
</tr>
</tbody>'
We then follow to get the position of the right , in the first row of the table
page = Nokogiri::HTML(html)
shoes_position = page.css("tr")[0].css('th').find_index do |el|
el.text =~ /SHOES/
end
And with that, we find the s in that position in every other row, and get the text from that
shoes_tds = page.css('tr').map {|row| row.css('td')[shoes_position] }.compact
shoes_names = shoes_tds.map { |td| td.text }
I use a compact to remove the nil values, as the first row (the one with the headers) will not have a td, thus returning nil

You can get there with css:
td = doc.at('tr:has(th[text()=SHOES]) + tr td')
td.children.map{|x| x.text.strip}.reject(&:empty?)
#=> ["Shoe 1", "shoe 2", "shoe3"]
but maybe mixing it up with xpath is better:
td.search('./text()').map{|x| x.text.strip}
#=> ["Shoe 1", "shoe 2", "shoe3"]

Related

Print two different things in a single line on a web page

If I have to print something like this --
Reason : reason number 1
reason number 2
reason number 3
Code : code number
Remarks : remark
So Reason, Code, and Remarks are headings, and thus I have them in tag, rest of the things such as - reason number 1, reason number 2, code number and remark are the values.
Now, If I use for the heading, the value automatically goes to a new line, What if I want to print them as key-values in a single line. How can I do so.
Preferably without float.
Try like this:
Working Demo
.html
<table border="1">
<ng-container *ngFor="let item of list1;let i = index">
<tr>
<td> {{list1[i]}}</td>
<td *ngIf="list2[i].length > 1">
<li *ngFor="let li of list2[i]">
{{li}}
</li>
</td>
<td *ngIf="list2[i].length <= 1">
{{list2[i]}}
</td>
</tr>
</ng-container>
</table>
.ts
list1 = ["Instructions ", "Reason Code "];
list2 = [["inst 1 ", "inst 2 "], ["Reason code is so and so"]];

XPath - Selection TD next to the selected XPath table which contains SPAN

I have a table and I'm trying to get data from via xpath. A simple example of the table looks like this:
horse id1 id2 id3 id4
abc 1 1 1 1
123 2 2 2 2
cba 3 3 <span>3</span> 3
321 4 4 4 4
What I want to do is look at column id3 and find the row that contains the span code (in this case it's row 3). Once I have this I would like to get the value in column 1 of that row (the one that span is on) which would be cba.
Can anyone help?
If you want to match tr that contains span, then you might use below XPath:
//table[1]//tr[.//span]/td[6]/a[1]
Also note that you can use less complex and more verbose expressions by using attributes of target/parent/child/sibling element.
Suppose you need to match link
<tr>
<td>
<a class="link new-link" href="/some/source">Click me!</a>
</td>
</tr>
you can use
//a[text()="Click me!"]
and
//a[#href="/some/source"]
and
//a[#class="link new-link" and text()="Click me!"]
... and a lot of other combinations
Try this below code.
List<WebElement> elements = driver.findElements(By.xpath("//td/span"));
for(int i=0;i<elements.size();i++)
{
System.out.println(elements.get(i).getText()); //Will give you only those data which `<td>` contains `<span>` tag.
}
Updated Answer
If you want only fourth column <td> data contains <span> tag refer below code.
suppose your html look this.
<table>
<tr>
<th>horse</th>
<th>number1</th>
<th>number2</th>
<th>number3</th>
</tr>
<tr>
<td>horse1</td>
<td>3424</td>
<td>data1</td>
<td>-----</td>
</tr>
<tr>
<td>horsename2</td>
<td>123</td>
<td><span>data2</span></td>
<td>-----</td>
</tr>
<tr>
<td>horsename2</td>
<td>123</td>
<td>-----</td>
<td><span>data3</span></td>
</tr>
</table>
refer this code.
int b = 1;
int[] array_list = new int[] {1,2,3}; //int b presents `<tr>` tag.
for(int i =0; i<array_list.length;i++)
{
WebElement span_source = driver.findElement(By.xpath("//th[4]/..//following::tr["+b+"]/td[4]"));
try
{
WebElement span = driver.findElement(By.xpath("//th[4]/..//following::tr["+b+"]/td[4]/span"));
System.out.println(span.getText());
}
catch(Exception e)
{
System.out.println("TD tag not contains data with span tag.");
}
b++;
}

How to use xpath to select a value in a drop down list located in a particular row?

I am testing a page using selenium web driver. I have rows of data that represent 'requests', and in the last column of each of those rows the user can click a drop down list (with the option to either approve or reject) element that allows them to 'approve' or 'reject' the request.
I need to be able to select the approve option on the drop down list of a row whose 'Name' column is equal to a variable (in this instance say the variable is 'John').
In this test the user will be approving 'John's' request by selecting approve. How do I use xpath to ensure I am selecting the correct drop down element for the right person (right row)? Will I need to include a select element within an xpath somehow?
An example of the select element method to select a drop down element:
new SelectElement(this.Driver.FindElement(By.Name("orm")).FindElement(By.Name("Tutors"))).SelectByText(tutorName);
<form name="RequestsForm" action="SubmitRequest.aspx" method="POST">
<h2 class="blacktext" align="center">Course approvals</h2>
<table class="cooltable" width="90%" border="0" cellspacing="1" cellpadding="1">
<tbody>
<tr>
<td class="heading">
<b>Name</b>
</td>
<td class="heading">
<b>Request Date</b>
</td>
<td class="heading">
<b>Approved</b>
</td>
</tr>
<tr>
<td>
John
<input id="T1" type="text" value="888" name="T1">
</td>
<td>1/3/2015</td>
<td>
<select id="D1" class="selecttext" size="1" name="D1">
<option>?</option>
<option value="Approved">Approved</option>
<option>Rejected</option>
</select>
</td>
</tr>
</tbody>
</table>
Using XPath, this gets the position where the Name column is in your table:
count(//table[#class='cooltable']/tbody/tr[1]/td[b = 'Name']/preceding-sibling::td)+1
You can use that position to get the corresponding table cell in the other columns. This selects the corresponding td in the second row (where the ... represent the expression above):
//table[#class='cooltable']/tbody/tr[2]/td[count( ... )+1]
Appending /text() will extract the text (with spaces). Using normalize-space() will trim the text so you can compare it with John:
normalize-space(//table[#class='cooltable']/tbody/tr[2]/td[count( ... )+1]/text()) = 'John'
To select only the tr which contains John in the Name column, you leave only the td in the predicate. Now it returns a node-set of all tr which match the predicate text = John:
//table[#class='cooltable']/tbody/tr[normalize-space(td[count( ... )+1]/text()) = 'John']
Finally, if you append //select/option[#value='Approved'] to that expression, you will select the option with the Approved attribute in the context of that tr. Here is the full XPath expression:
//table[#class='cooltable']/tbody/tr[normalize-space(td[count(//table[#class='cooltable']/tbody/tr[1]/td[b = 'Name']/preceding-sibling::td)+1]/text()) = 'John']//select/option[#value='Approved']

Find element neighbor

I have document with the following two formats:
<p><b>Referral Description:</b></p>
<p>
This is the body of the referral's detailed description.
I want to get this text out of the document.
</p>
and
<table>
<tr>
<td><b>FieldName:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field2Name:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field3Name:</b></td>
<td>field value</td>
</tr>
</table>
In both cases, you can see that I need a value sitting in an un-named element, and its adjacent neighbor is a matching tag with a <b>FieldName:</b> body.
My question is, how can I use the neighbor tags to get the values I need? I can target the neighbor with
doc.xpath('//p/b[content(text(), "Referral Description:")]')
but how do I take that and say "Give me your neighbor"?
I would do as below using Axis - following-sibling:::
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<p><b>Referral Description:</b></p>
<p>
This is the body of the referral's detailed description.
I want to get this text out of the document.
</p>
html
node = doc.xpath('//p[./b[contains(text(), "Referral Description:")]]/following-sibling::p')
puts node.text
# >>
# >> This is the body of the referral's detailed description.
# >> I want to get this text out of the document.
Or, using wild-card character * :
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<p><b>Referral Description:</b></p>
<p>
This is the body of the referral's detailed description.
I want to get this text out of the document.
</p>
html
["Referral Description:", "FieldName:", "Field1Name:"].map |header|
doc.xpath("//*[./b[contains(text(), '#{header}')]]/following-sibling::*')
end
# >>
# >> ["This is the body of the referral's detailed description.\nI want to get this text out of the document.", "field value", "field value"]
For the second part of HTML table :
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<table>
<tr>
<td><b>FieldName:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field2Name:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field3Name:</b></td>
<td>field value</td>
</tr>
</table>
html
field_ary = %w(FieldName Field2Name Field3Name)
nodeset = field_ary.map{|n| doc.xpath("//td[./b[contains(.,'#{n}')]]/following-sibling::*")}
nodeset.map{|n| n.text }
# => ["field value", "field value", "field value"]
or(another approach)
nodeset = field_ary.map{|n| doc.xpath("//*[./b[contains(.,'#{n}')]]/following-sibling::*")}
nodeset.map{|n| n.text }
# => ["field value", "field value", "field value"]
In css, the next adjacent sibling selector is +:
doc.at('p:has(b[text()="Referral Description:"]) + p').text

Is there a better way to get this element/node with Nokogiri?

I think the best way to explain this is via some code. Basically the only way to identify the TR I need inside the table (i've already reached the table itself and named it annual_income_statement) is by the text of the first TD in the TR, like this:
this may be helpful to know, too:
actual html:
doc = Nokogiri::HTML(open('https://www.google.com/finance?q=NYSE%3AAA&fstype=iii'))
html snippet:
<div id="incannualdiv">
<table id="fs-table">
<tbody>
<tr>..</tr>
...
<tr>
<td>Net Income</td>
<td>100</td>
</tr>
<tr>..</tr>
</tbody>
</table>
</div>
original xpath
irb(main):161:0> annual_income_statement = doc.xpath("//div[#id='incannualdiv']/table[#id='fs-table']/tbody")
irb(main):121:0> a = nil
=> nil
irb(main):122:0> annual_income_statement.children.each { |e| if e.text.include? "Net Income" and e.text.exclude? "Ex"
irb(main):123:2> a = e.text
irb(main):124:2> end }
=> 0
irb(main):125:0> a
=> "Net Income\n\n191.00\n611.00\n254.00\n-1,151.00\n"
irb(main):127:0> a.split "\n"
=> ["Net Income", "", "191.00", "611.00", "254.00", "-1,151.00"]
but is there a better way?
more details:
doc = Nokogiri::HTML(open('https://www.google.com/finance?q=NYSE%3AAA&fstype=iii'))
div = doc.at "div[#id='incannualdiv']" #div containing the table i want
table = div.at 'table' #table containing tbody i want
tbody = table.at 'tbody' #tbody containing tr's I want
trs = tbody.at 'tr' #SHOULD be all tr's of that table/tbody - but it's only the first TR?
I expect that last bit to give me ALL the TR's (which would include the TD i'm looking for)
but in fact it only gives me the first TR
Best is probably:
table.at 'tr:has(td[1][text()="Net Income"])'
Edit
More info:
doc = Nokogiri::HTML <<EOF
<div id="incannualdiv">
<table id="fs-table">
<tbody>
<tr>..</tr>
...
<tr>
<td>Net Income</td>
<td>100</td>
</tr>
<tr>..</tr>
</tbody>
</table>
</div>
EOF
table = doc.at 'table'
table.at('tr:has(td[1][text()="Net Income"])').to_s
#=> "<tr>\n<td>Net Income</td>\n <td>100</td>\n </tr>\n"