How to get value of data from td in ruby - html

I have following html table where in I want to fetch data value of using
<table>
<tr>
<td data="1">Hello 1</td>
<td data="2">Hello 2</td>
<td data="3">Hello 3</td>
<td data="4">Hello 4</td>
</tr>
</table>
I'm using nokogiri to read xpath of html as below
# Crawl a HTML elements using Nokogiri
def crawlTableData()
require 'open-uri'
require 'nokogiri'
open("http://localhost:/somepage",http_basic_authentication: ["username", "somepassword"]) do |f|
doc = Nokogiri::HTML(f.read)
return doc.xpath('//*[#id="main-panel"]/table[1]/tbody/tr[2]/td[2]').text
end
end
So finally it return nothing to me.
can anyone suggest what is correct way to fetch data value of

If you want only one <td>(s) data attribute's value, use like
doc.at('//*[#id="main-panel"]/table[1]/tbody/tr[2]/td[2]')['data']
at search for the first occurrence of path. Returns nil if nothing is found, otherwise a Node.
To get the attribute value for the attribute name, use [](name) method.

Related

lxml relative xPath doesn't return result relative to the given HtmlElement

I apply a relative XPath (./) to an HtmlElement and it doesn't return any results. When I try using double dots (../), it returns all results matching from root HTML instead of descendant results of that specific HtmlElement. I am not sure what is wrong here.
The version of lxml is 4.5.2
Example:
<html>
<h3>
<p>
<table>
<tr>
<td>Sample</td>
<td>Sample</td>
</tr>
</table>
</p>
<h3>
<p>
<table>
<tr>
<td>Sample 2</td>
<td>Sample 2</td>
</tr>
</table>
</p>
</html>
Code
r = requests.get('http://website.com')
tree = html.fromstring(r.content)
tables = tree.xpath("(//p/table)")
for table in tables:
result = table.xpath('.//td')
text = result.text_content()
The first iteration in the loop should return "Sample" texts and the second iteration should return "Sample 2" texts.
The problem was with the HTML itself. When I inspect the document on a browser, it shows that <p> is the parent of the <table> elements however requested HTML revealed that <p></p> is actually the sibling element preceding <table>.

Transferring variables across HTML pages

I am taking in an input from a text box in one HTML page and saving it to local storage using this function.
function writeMail(emailInput){
console.log(emailInput);
localStorage.setItem('emailInput',emailInput);
let theEmail = localStorage.getItem('emailInput');
console.log(theEmail);
}
This works fine and I can check the inputs are correct through my console logs.
Yet when I try and get this from local storage to store in my table in my emailList html file, it seems to not work at all.
<body>
email = localstorage["emailInput"];
<table>
<caption> Email list
</caption>
<tr>
<th> Name </th>
<th> Email </th>
</tr>
<tr>
<td>email </td>
</tr>
</table>
</body>
For you to be able to manipulate the contents of HTML, you need to modify the DOM node specifically. In this specific case you should have an id attribute on the <td>and then use the innerHTML property of that node to set the desired value.
i.e.:
<td id="xpto"></td>
then on the code:
let theEmail = localStorage.getItem('emailInput');
document.getElementById("xpto").innerHTML = theEmail;
You should also set that code inside of a function that is called once the document has finished loading, so something like:
JAVASCRIPT:
function go(){
let theEmail = localStorage.getItem('emailInput');
document.getElementById("xpto").innerHTML = theEmail;
}
HTML:
<body onload="go()">

Nokogiri extract nodes from html

I need to extract nodes from html (not inner text so I can preserve the format for further manual investigation). I wrote the below code. But because how traverse works, I got duplicates in the new html file.
This is the real html to parse. http://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm
Basically I need to extract Item10 and part between "Executive Officers of the Registrant" to the next Item. Item 10 is in all documents, but "Executive Officers of the Registrant" is not in all documents. I need to get the nodes rather than just text because I want to preserve the tables, so in my next step I can parse tables in these sections if there are any.
Sample html:
html = "
<BODY>
<P>Dont need this </P>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"
I want to get:
html = "
<BODY>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"
Start to extract when the start_keyword appears.
End to extract when the end_keyword appears.
There are multiple sections I need to extract from one html. The keywords can appear in nodes with different names.
doc.at_css('body').traverse do |node|
inMySection = false
if node.text.match(/#{start_keyword}/)
inMySection = true
elsif node.text.match(/#{end_keyword}/)
inMySection = false
end
if inMySection
#Extract the nodes
end
end
I also tried to use xpath to achieve this without success after referring to these posts:
XPath axis, get all following nodes until
XPath to find all following siblings up until the next sibling of a particular type
It's not a problem with Nokogiri but your algorithm. You've put your flag inMySection inside your loop, that means at each step you set it again to false and you lose if it was previously set to true.
Based on your sample HTML input and output, the following snippet works:
nodes = Nokogiri::HTML(html)
inMySection = false
nodes.at_xpath('//body').traverse do |node|
if node.text.match(/Start/)
inMySection = true
elsif node.text.match(/End/)
inMySection = false
end
node.remove unless inMySection
end
print nodes

How to get the proper values after a html table parse with ruby/nokogiri

I have searched and searched for 3 days straight now trying to get a data scraper to work and it seems like I have successfully parsed the HTML table that looks like this:
<tr class='ds'>
<td class='ds'>Length:</td>
<td class='ds'>1/8"</td>
</tr>
<tr class='ds'>
<td class='ds'>Width:</td>
<td class='ds'>3/4"</td>
</tr>
<tr class='ds'>
<td class='ds'>Color:</td>
<td class='ds'>Red</td>
</tr>
However, I can not seem to get it to print to csv correctly.
The Ruby code is as follows:
Specifications = {
:length => ['Length:','length','Length'],
:width => ['width:','width','Width','Width:'],
:Color => ['Color:','color'],
.......
}.freeze
def specifications
#specifications ||= xml.css('tr.ds').map{|row| row.css('td.ds').map{|cell| cell.children.to_s } }.map{|record|
specification = Specifications.detect{|key, value| value.include? record.first }
[specification.to_s.titleize, record.last] }
end
And the csv is printing into one column (what seems to be the full arrays):
[["", nil], ["[:finishtype, [\"finish\", \"finish type:\", \"finish type\", \"finish type\", \"finish type:\"]]", "Metal"], ["", "1/4\""], ["[:length, [\"length:\", \"length\", \"length\"]]", "18\""], ["[:width, [\"width:\", \"width\", \"width\", \"width:\"]]", "1/2\""], ["[:styletype, [\"style:\", \"style\", \"style:\", \"style\"]]"........
I believe the issue is that I have not specified which values to return but I wasn't successful anytime I tried to specify the output. Any help would be greatly appreciated!
Try changing
[specification.to_s.titleize, record.last]
to
[specification.last.first.titleize, record.last]
The detect yields e.g. [:length, ["Length:", "length", "Length"]] which will become
"[:length, [\"Length:\", \"length\", \"Length\"]]" by to_s. With last.first you can extract just the part "Length:" of it.
In case you encounter attributes not matching to your Specification, you could just drop the values by changing to:
xml.css('tr.ds').map{|row| row.css('td.ds').map{|cell| cell.children.to_s } }.map{|record|
specification = Specifications.detect{|key, value| value.include? record.first }
[specification.last.first.titleize, record.last] if specification
}.compact

Read td values using prototype

I would like to read the values of HTML td using prototype. For example, say you have a table as follows
<table id="myTable">
<tr>
<td>apple</td>
<td>orange</td>
</tr>
<tr>
<td>car</td>
<td>bus</td>
</tr>
</table>
I would like to read the values - apple, orange, car and bus alone. I am unable to find a
way to do it? Any help would be of great help.
Thanks,
J
This should work:
var values = $$('#myTable td').collect(function(element) {
// stripTags(), if you're only interested in the actual content
return element.innerHTML.stripTags();
});
The following returns an array of strings.
$$('#myTable td').pluck('innerHTML');