I have document with the following two formats:
<p><b>Referral Description:</b></p>
<p>
This is the body of the referral's detailed description.
I want to get this text out of the document.
</p>
and
<table>
<tr>
<td><b>FieldName:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field2Name:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field3Name:</b></td>
<td>field value</td>
</tr>
</table>
In both cases, you can see that I need a value sitting in an un-named element, and its adjacent neighbor is a matching tag with a <b>FieldName:</b> body.
My question is, how can I use the neighbor tags to get the values I need? I can target the neighbor with
doc.xpath('//p/b[content(text(), "Referral Description:")]')
but how do I take that and say "Give me your neighbor"?
I would do as below using Axis - following-sibling:::
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<p><b>Referral Description:</b></p>
<p>
This is the body of the referral's detailed description.
I want to get this text out of the document.
</p>
html
node = doc.xpath('//p[./b[contains(text(), "Referral Description:")]]/following-sibling::p')
puts node.text
# >>
# >> This is the body of the referral's detailed description.
# >> I want to get this text out of the document.
Or, using wild-card character * :
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<p><b>Referral Description:</b></p>
<p>
This is the body of the referral's detailed description.
I want to get this text out of the document.
</p>
html
["Referral Description:", "FieldName:", "Field1Name:"].map |header|
doc.xpath("//*[./b[contains(text(), '#{header}')]]/following-sibling::*')
end
# >>
# >> ["This is the body of the referral's detailed description.\nI want to get this text out of the document.", "field value", "field value"]
For the second part of HTML table :
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<table>
<tr>
<td><b>FieldName:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field2Name:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field3Name:</b></td>
<td>field value</td>
</tr>
</table>
html
field_ary = %w(FieldName Field2Name Field3Name)
nodeset = field_ary.map{|n| doc.xpath("//td[./b[contains(.,'#{n}')]]/following-sibling::*")}
nodeset.map{|n| n.text }
# => ["field value", "field value", "field value"]
or(another approach)
nodeset = field_ary.map{|n| doc.xpath("//*[./b[contains(.,'#{n}')]]/following-sibling::*")}
nodeset.map{|n| n.text }
# => ["field value", "field value", "field value"]
In css, the next adjacent sibling selector is +:
doc.at('p:has(b[text()="Referral Description:"]) + p').text
Related
I apply a relative XPath (./) to an HtmlElement and it doesn't return any results. When I try using double dots (../), it returns all results matching from root HTML instead of descendant results of that specific HtmlElement. I am not sure what is wrong here.
The version of lxml is 4.5.2
Example:
<html>
<h3>
<p>
<table>
<tr>
<td>Sample</td>
<td>Sample</td>
</tr>
</table>
</p>
<h3>
<p>
<table>
<tr>
<td>Sample 2</td>
<td>Sample 2</td>
</tr>
</table>
</p>
</html>
Code
r = requests.get('http://website.com')
tree = html.fromstring(r.content)
tables = tree.xpath("(//p/table)")
for table in tables:
result = table.xpath('.//td')
text = result.text_content()
The first iteration in the loop should return "Sample" texts and the second iteration should return "Sample 2" texts.
The problem was with the HTML itself. When I inspect the document on a browser, it shows that <p> is the parent of the <table> elements however requested HTML revealed that <p></p> is actually the sibling element preceding <table>.
This is my HTML:
<tbody><tr><th>SHOES</th></tr>
<tr>
<td>
Shoe 1 <br>shoe 2<br> shoe3 <br>
</td>
</tr>
</tbody>
This is my code:
nodes = page.css("tr").select do |el|
el.css('th').text =~ /SHOES/
end
nodes.each do |value|
puts value.css("td").text
end
I wish to get the values shoe 1, shoe 2 and shoe 3, but there is no output. I suspect there is an extra <tr></tr> in between <tr><th>SHOES</th></tr>. Or are the <br> the culprit?
There are other structures like:
<tr>
<th>SHOES</th>
<td>NBA</td>
</tr>
and I got the desired output "NBA".
What did I do wrong?
I have two kinds of structures:
Name1: value
Name1: value2
The above would give:
<tr>
<th>Name1</th>
<td>Value</td>
</tr>
but sometimes it's:
Name:
value
value2
value3
So the HTML is:
<tbody><tr><th>Name</th></tr>
<tr>
<td>value<br>value2<br> ....</td>
In HTML, tables are composed by rows. When you iterate by those rows, only one of them is the header. Although logically you see a relation between the body rows and the header ones, for HTML (and therefore for Nokogiri) there's none.
If what you want, is to get every value of the cells that have a specific header, what you can do is count the specific column, and then get the values from there.
Using this HTML as source
html = '<tbody><tr><th>HATS</th><th>SHOES</th></tr>
<tr>
<td>
hat 1 <br>hat 2<br> hat3 <br>
</td>
<td>
Shoe 1 <br>shoe 2<br> shoe3 <br>
</td>
</tr>
</tbody>'
We then follow to get the position of the right , in the first row of the table
page = Nokogiri::HTML(html)
shoes_position = page.css("tr")[0].css('th').find_index do |el|
el.text =~ /SHOES/
end
And with that, we find the s in that position in every other row, and get the text from that
shoes_tds = page.css('tr').map {|row| row.css('td')[shoes_position] }.compact
shoes_names = shoes_tds.map { |td| td.text }
I use a compact to remove the nil values, as the first row (the one with the headers) will not have a td, thus returning nil
You can get there with css:
td = doc.at('tr:has(th[text()=SHOES]) + tr td')
td.children.map{|x| x.text.strip}.reject(&:empty?)
#=> ["Shoe 1", "shoe 2", "shoe3"]
but maybe mixing it up with xpath is better:
td.search('./text()').map{|x| x.text.strip}
#=> ["Shoe 1", "shoe 2", "shoe3"]
I think the best way to explain this is via some code. Basically the only way to identify the TR I need inside the table (i've already reached the table itself and named it annual_income_statement) is by the text of the first TD in the TR, like this:
this may be helpful to know, too:
actual html:
doc = Nokogiri::HTML(open('https://www.google.com/finance?q=NYSE%3AAA&fstype=iii'))
html snippet:
<div id="incannualdiv">
<table id="fs-table">
<tbody>
<tr>..</tr>
...
<tr>
<td>Net Income</td>
<td>100</td>
</tr>
<tr>..</tr>
</tbody>
</table>
</div>
original xpath
irb(main):161:0> annual_income_statement = doc.xpath("//div[#id='incannualdiv']/table[#id='fs-table']/tbody")
irb(main):121:0> a = nil
=> nil
irb(main):122:0> annual_income_statement.children.each { |e| if e.text.include? "Net Income" and e.text.exclude? "Ex"
irb(main):123:2> a = e.text
irb(main):124:2> end }
=> 0
irb(main):125:0> a
=> "Net Income\n\n191.00\n611.00\n254.00\n-1,151.00\n"
irb(main):127:0> a.split "\n"
=> ["Net Income", "", "191.00", "611.00", "254.00", "-1,151.00"]
but is there a better way?
more details:
doc = Nokogiri::HTML(open('https://www.google.com/finance?q=NYSE%3AAA&fstype=iii'))
div = doc.at "div[#id='incannualdiv']" #div containing the table i want
table = div.at 'table' #table containing tbody i want
tbody = table.at 'tbody' #tbody containing tr's I want
trs = tbody.at 'tr' #SHOULD be all tr's of that table/tbody - but it's only the first TR?
I expect that last bit to give me ALL the TR's (which would include the TD i'm looking for)
but in fact it only gives me the first TR
Best is probably:
table.at 'tr:has(td[1][text()="Net Income"])'
Edit
More info:
doc = Nokogiri::HTML <<EOF
<div id="incannualdiv">
<table id="fs-table">
<tbody>
<tr>..</tr>
...
<tr>
<td>Net Income</td>
<td>100</td>
</tr>
<tr>..</tr>
</tbody>
</table>
</div>
EOF
table = doc.at 'table'
table.at('tr:has(td[1][text()="Net Income"])').to_s
#=> "<tr>\n<td>Net Income</td>\n <td>100</td>\n </tr>\n"
I need to extract nodes from html (not inner text so I can preserve the format for further manual investigation). I wrote the below code. But because how traverse works, I got duplicates in the new html file.
This is the real html to parse. http://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm
Basically I need to extract Item10 and part between "Executive Officers of the Registrant" to the next Item. Item 10 is in all documents, but "Executive Officers of the Registrant" is not in all documents. I need to get the nodes rather than just text because I want to preserve the tables, so in my next step I can parse tables in these sections if there are any.
Sample html:
html = "
<BODY>
<P>Dont need this </P>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"
I want to get:
html = "
<BODY>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"
Start to extract when the start_keyword appears.
End to extract when the end_keyword appears.
There are multiple sections I need to extract from one html. The keywords can appear in nodes with different names.
doc.at_css('body').traverse do |node|
inMySection = false
if node.text.match(/#{start_keyword}/)
inMySection = true
elsif node.text.match(/#{end_keyword}/)
inMySection = false
end
if inMySection
#Extract the nodes
end
end
I also tried to use xpath to achieve this without success after referring to these posts:
XPath axis, get all following nodes until
XPath to find all following siblings up until the next sibling of a particular type
It's not a problem with Nokogiri but your algorithm. You've put your flag inMySection inside your loop, that means at each step you set it again to false and you lose if it was previously set to true.
Based on your sample HTML input and output, the following snippet works:
nodes = Nokogiri::HTML(html)
inMySection = false
nodes.at_xpath('//body').traverse do |node|
if node.text.match(/Start/)
inMySection = true
elsif node.text.match(/End/)
inMySection = false
end
node.remove unless inMySection
end
print nodes
I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.
What about this table? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.
Please note that there are few tables in the HTML document? I am after one particular table with its tbody, <tbody id="threadbits_forum_251">. The name will be always the same (I hope). Can I use the tbody and the name in the code?
<table >
<tbody>
<tr> <!-- table header --> </tr>
</tbody>
<!-- show threads -->
<tbody id="threadbits_forum_251">
<tr>
<td></td>
<td></td>
<td>
<div>
<a href="showthread.php?t=230708" >Vb4 Gold Released</a>
</div>
<div>
<span><a>Paul M</a></span>
</div>
</td>
<td>
06 Jan 2010 <span class="time">23:35</span><br />
by shane943
</div>
</td>
<td>24</td>
<td>1,320</td>
</tr>
</tbody>
</table>
#!/usr/bin/ruby1.8
require 'nokogiri'
require 'pp'
html = <<-EOS
(The HTML from the question goes here)
EOS
doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
detail = {}
[
[:title, 'td[3]/div[1]/a/text()'],
[:name, 'td[3]/div[2]/span/a/text()'],
[:date, 'td[4]/text()'],
[:time, 'td[4]/span/text()'],
[:number, 'td[5]/a/text()'],
[:views, 'td[6]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp details
# => [{:time=>"23:35",
# => :title=>"Vb4 Gold Released",
# => :number=>"24",
# => :date=>"06 Jan 2010",
# => :views=>"1,320",
# => :name=>"Paul M"}]