Nokogiri extract nodes from html - html

I need to extract nodes from html (not inner text so I can preserve the format for further manual investigation). I wrote the below code. But because how traverse works, I got duplicates in the new html file.
This is the real html to parse. http://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm
Basically I need to extract Item10 and part between "Executive Officers of the Registrant" to the next Item. Item 10 is in all documents, but "Executive Officers of the Registrant" is not in all documents. I need to get the nodes rather than just text because I want to preserve the tables, so in my next step I can parse tables in these sections if there are any.
Sample html:
html = "
<BODY>
<P>Dont need this </P>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"
I want to get:
html = "
<BODY>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"
Start to extract when the start_keyword appears.
End to extract when the end_keyword appears.
There are multiple sections I need to extract from one html. The keywords can appear in nodes with different names.
doc.at_css('body').traverse do |node|
inMySection = false
if node.text.match(/#{start_keyword}/)
inMySection = true
elsif node.text.match(/#{end_keyword}/)
inMySection = false
end
if inMySection
#Extract the nodes
end
end
I also tried to use xpath to achieve this without success after referring to these posts:
XPath axis, get all following nodes until
XPath to find all following siblings up until the next sibling of a particular type

It's not a problem with Nokogiri but your algorithm. You've put your flag inMySection inside your loop, that means at each step you set it again to false and you lose if it was previously set to true.
Based on your sample HTML input and output, the following snippet works:
nodes = Nokogiri::HTML(html)
inMySection = false
nodes.at_xpath('//body').traverse do |node|
if node.text.match(/Start/)
inMySection = true
elsif node.text.match(/End/)
inMySection = false
end
node.remove unless inMySection
end
print nodes

Related

lxml relative xPath doesn't return result relative to the given HtmlElement

I apply a relative XPath (./) to an HtmlElement and it doesn't return any results. When I try using double dots (../), it returns all results matching from root HTML instead of descendant results of that specific HtmlElement. I am not sure what is wrong here.
The version of lxml is 4.5.2
Example:
<html>
<h3>
<p>
<table>
<tr>
<td>Sample</td>
<td>Sample</td>
</tr>
</table>
</p>
<h3>
<p>
<table>
<tr>
<td>Sample 2</td>
<td>Sample 2</td>
</tr>
</table>
</p>
</html>
Code
r = requests.get('http://website.com')
tree = html.fromstring(r.content)
tables = tree.xpath("(//p/table)")
for table in tables:
result = table.xpath('.//td')
text = result.text_content()
The first iteration in the loop should return "Sample" texts and the second iteration should return "Sample 2" texts.
The problem was with the HTML itself. When I inspect the document on a browser, it shows that <p> is the parent of the <table> elements however requested HTML revealed that <p></p> is actually the sibling element preceding <table>.

Scraping HTML by Class in VBA

I have a html code as shown
<div class="property-title visible-xs">
<a href="/property/473902/Office-Lot">
<h2><b> 2nd Floor, Block D5, Solaris Dutamas, No. 1, Jalan Dutamas 1, 50480, Kuala Lumpur</b></h2>
</a>
</div>
<p style="color: #0071ee;">Office Lot</p>
<h4><b>RM 880,000</b></h4>
<div>
<table>
<!-- <tr><td>Office Lot</td></tr> -->
<tr>
<td>Property Code</td><td>:</td><td>PB473902</td>
</tr>
<tr>
<td>Auction Date</td><td>:</td><td>2016-02-26</td>
</tr>
<tr>
<td>Built up </td><td>:</td><td>754 sq.ft </td>
</tr>
<tr>
<td>Tenure</td><td>:</td><td>Freehold</td>
</tr>
and I used the following code to extract the details "2nd Floor, Block D5,...."
objIE1.Document.getElementsByClassName("property-title visible-xs").getElementsByTagName ("a")
but it don't seem to get the result I need. Please help.
The html code shown is in multiple form.
This will work:
extract1 = objIE1.Document.getElementsByClassName("property-title visible-xs")(0).getElementsByTagName ("a")(0).innerText
Cells(1,1).Value = extract1
When a function has getElementsBy (plural - "Elements") such as getElementsByClassName or getElementsByTagName the code will extract a collection of elements so you need to specify which one you want, in this case it is the first which in html is 0. When a function uses getElementBy (singular - "Element") such as getElementById this extracts a single element and therefore does not need an index specification as there is no collection.

How to get value of data from td in ruby

I have following html table where in I want to fetch data value of using
<table>
<tr>
<td data="1">Hello 1</td>
<td data="2">Hello 2</td>
<td data="3">Hello 3</td>
<td data="4">Hello 4</td>
</tr>
</table>
I'm using nokogiri to read xpath of html as below
# Crawl a HTML elements using Nokogiri
def crawlTableData()
require 'open-uri'
require 'nokogiri'
open("http://localhost:/somepage",http_basic_authentication: ["username", "somepassword"]) do |f|
doc = Nokogiri::HTML(f.read)
return doc.xpath('//*[#id="main-panel"]/table[1]/tbody/tr[2]/td[2]').text
end
end
So finally it return nothing to me.
can anyone suggest what is correct way to fetch data value of
If you want only one <td>(s) data attribute's value, use like
doc.at('//*[#id="main-panel"]/table[1]/tbody/tr[2]/td[2]')['data']
at search for the first occurrence of path. Returns nil if nothing is found, otherwise a Node.
To get the attribute value for the attribute name, use [](name) method.

Find element neighbor

I have document with the following two formats:
<p><b>Referral Description:</b></p>
<p>
This is the body of the referral's detailed description.
I want to get this text out of the document.
</p>
and
<table>
<tr>
<td><b>FieldName:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field2Name:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field3Name:</b></td>
<td>field value</td>
</tr>
</table>
In both cases, you can see that I need a value sitting in an un-named element, and its adjacent neighbor is a matching tag with a <b>FieldName:</b> body.
My question is, how can I use the neighbor tags to get the values I need? I can target the neighbor with
doc.xpath('//p/b[content(text(), "Referral Description:")]')
but how do I take that and say "Give me your neighbor"?
I would do as below using Axis - following-sibling:::
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<p><b>Referral Description:</b></p>
<p>
This is the body of the referral's detailed description.
I want to get this text out of the document.
</p>
html
node = doc.xpath('//p[./b[contains(text(), "Referral Description:")]]/following-sibling::p')
puts node.text
# >>
# >> This is the body of the referral's detailed description.
# >> I want to get this text out of the document.
Or, using wild-card character * :
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<p><b>Referral Description:</b></p>
<p>
This is the body of the referral's detailed description.
I want to get this text out of the document.
</p>
html
["Referral Description:", "FieldName:", "Field1Name:"].map |header|
doc.xpath("//*[./b[contains(text(), '#{header}')]]/following-sibling::*')
end
# >>
# >> ["This is the body of the referral's detailed description.\nI want to get this text out of the document.", "field value", "field value"]
For the second part of HTML table :
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<table>
<tr>
<td><b>FieldName:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field2Name:</b></td>
<td>field value</td>
</tr>
<tr>
<td><b>Field3Name:</b></td>
<td>field value</td>
</tr>
</table>
html
field_ary = %w(FieldName Field2Name Field3Name)
nodeset = field_ary.map{|n| doc.xpath("//td[./b[contains(.,'#{n}')]]/following-sibling::*")}
nodeset.map{|n| n.text }
# => ["field value", "field value", "field value"]
or(another approach)
nodeset = field_ary.map{|n| doc.xpath("//*[./b[contains(.,'#{n}')]]/following-sibling::*")}
nodeset.map{|n| n.text }
# => ["field value", "field value", "field value"]
In css, the next adjacent sibling selector is +:
doc.at('p:has(b[text()="Referral Description:"]) + p').text

Ruby and Nokogiri parsing table?

This is my HTML:
<tbody><tr><th>SHOES</th></tr>
<tr>
<td>
Shoe 1 <br>shoe 2<br> shoe3 <br>
</td>
</tr>
</tbody>
This is my code:
nodes = page.css("tr").select do |el|
el.css('th').text =~ /SHOES/
end
nodes.each do |value|
puts value.css("td").text
end
I wish to get the values shoe 1, shoe 2 and shoe 3, but there is no output. I suspect there is an extra <tr></tr> in between <tr><th>SHOES</th></tr>. Or are the <br> the culprit?
There are other structures like:
<tr>
<th>SHOES</th>
<td>NBA</td>
</tr>
and I got the desired output "NBA".
What did I do wrong?
I have two kinds of structures:
Name1: value
Name1: value2
The above would give:
<tr>
<th>Name1</th>
<td>Value</td>
</tr>
but sometimes it's:
Name:
value
value2
value3
So the HTML is:
<tbody><tr><th>Name</th></tr>
<tr>
<td>value<br>value2<br> ....</td>
In HTML, tables are composed by rows. When you iterate by those rows, only one of them is the header. Although logically you see a relation between the body rows and the header ones, for HTML (and therefore for Nokogiri) there's none.
If what you want, is to get every value of the cells that have a specific header, what you can do is count the specific column, and then get the values from there.
Using this HTML as source
html = '<tbody><tr><th>HATS</th><th>SHOES</th></tr>
<tr>
<td>
hat 1 <br>hat 2<br> hat3 <br>
</td>
<td>
Shoe 1 <br>shoe 2<br> shoe3 <br>
</td>
</tr>
</tbody>'
We then follow to get the position of the right , in the first row of the table
page = Nokogiri::HTML(html)
shoes_position = page.css("tr")[0].css('th').find_index do |el|
el.text =~ /SHOES/
end
And with that, we find the s in that position in every other row, and get the text from that
shoes_tds = page.css('tr').map {|row| row.css('td')[shoes_position] }.compact
shoes_names = shoes_tds.map { |td| td.text }
I use a compact to remove the nil values, as the first row (the one with the headers) will not have a td, thus returning nil
You can get there with css:
td = doc.at('tr:has(th[text()=SHOES]) + tr td')
td.children.map{|x| x.text.strip}.reject(&:empty?)
#=> ["Shoe 1", "shoe 2", "shoe3"]
but maybe mixing it up with xpath is better:
td.search('./text()').map{|x| x.text.strip}
#=> ["Shoe 1", "shoe 2", "shoe3"]