How do I parse an HTML table with Nokogiri? - html

I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.
What about this table? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.
Please note that there are few tables in the HTML document? I am after one particular table with its tbody, <tbody id="threadbits_forum_251">. The name will be always the same (I hope). Can I use the tbody and the name in the code?
<table >
<tbody>
<tr> <!-- table header --> </tr>
</tbody>
<!-- show threads -->
<tbody id="threadbits_forum_251">
<tr>
<td></td>
<td></td>
<td>
<div>
<a href="showthread.php?t=230708" >Vb4 Gold Released</a>
</div>
<div>
<span><a>Paul M</a></span>
</div>
</td>
<td>
06 Jan 2010 <span class="time">23:35</span><br />
by shane943
</div>
</td>
<td>24</td>
<td>1,320</td>
</tr>
</tbody>
</table>

#!/usr/bin/ruby1.8
require 'nokogiri'
require 'pp'
html = <<-EOS
(The HTML from the question goes here)
EOS
doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
detail = {}
[
[:title, 'td[3]/div[1]/a/text()'],
[:name, 'td[3]/div[2]/span/a/text()'],
[:date, 'td[4]/text()'],
[:time, 'td[4]/span/text()'],
[:number, 'td[5]/a/text()'],
[:views, 'td[6]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp details
# => [{:time=>"23:35",
# => :title=>"Vb4 Gold Released",
# => :number=>"24",
# => :date=>"06 Jan 2010",
# => :views=>"1,320",
# => :name=>"Paul M"}]

Related

How to read an HTML table and account for line breaks within cells

I have an HTML table output from a program that separates values within a cell with <br>. I've tried using XML::readHTMLTable and htmltab but they glom together the values without any separators. I need them to be comma-separated, but I don't see any arguments to those functions to account for this. I've posted a psuedo example of the file below. Currently it reads into two vectors c("ABC","DEF","GHI") and c("JKLMNO","PQR","STU") but I need the "JKLMNO" element to instead be "JKL,MNO".
<table>
<tr>
<td>
ABC<br/>
</td>
<td>
DEF<br/>
</td>
<td>
GHI<br/>
</td>
</tr>
<tr>
<td>
JKL<br/>
MNO<br/>
</td>
<td>
PQR<br/>
</td>
<td>
STU<br/
</td>
</tr>
</table>
I had this problem with in X being deleted by:
xTabs <- XML::readHTMLTable(X)
I fixed the problem as follows:
X1 <- gsub('<br/>', '\n', X)
xTabs <- XML::readHTMLTable(X1)
If I wanted '', I could then do a find and replace in xTabs. However, I'm happier with '\n'.
library(rvest)
library(dplyr)
doc <- read_html("<table>
<tr>
<td>
ABC<br/>
</td>
<td>
DEF<br/>
</td>
<td>
GHI<br/>
</td>
</tr>
<tr>
<td>
JKL<br/>
MNO<br/>
</td>
<td>
PQR<br/>
</td>
<td>
STU<br/
</td>
</tr>
</table>")
tab <- html_table(doc)[[1]]
mutate(tab, X1=gsub("[\r\n][[:space:]]+", ",", X1))
## X1 X2 X3
## 1 ABC DEF GHI
## 2 JKL,MNO PQR STU
UPDATE
For folks who have HTML in a different format and may not feel up to the strain of posting, if you had, say:
doc <- read_html("<table>
<tr>
<td>ABC<br/></td>
<td>DEF<br/></td>
<td>GHI<br/></td>
</tr>
<tr>
<td>JKL<br/>MNO<br/></td>
<td>PQR<br/></td>
<td>STU<br/</td>
</tr>
</table>")
the aforementioned solution won't work because it's not the same data the OP had. I know…it's shocking.
If that is the case, copying and pasting a solution is definitely easier than typing a new question and you can use the following:
library(rvest)
library(dplyr)
library(purrr)
map(1:3, function(col) {
html_nodes(doc, xpath=sprintf(".//tr/td[%d]", col)) %>%
map_chr(~paste0(html_nodes(., xpath=".//text()"), collapse=","))
}) %>%
set_names(sprintf("X%d", 1:3)) %>%
as_data_frame()
But — amazingly enough — if you had different tags and data in the TD tags or had to work with a more complex table structure, this solution would likely require adaptation as well. The mind, boggles.

How to parse a date using Nokogiri in Ruby

I am trying to parse this page and pull the date that begins after
>p>From Date:
I get the error
Invalid predicate: //b[text() = '<p>From Date: ' (Nokogiri::XML::XPath::SyntaxError)
The xpath from "inspect element" is
/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p
This is an example of the code:
#/usr/bin/ruby
require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
noko.xpath("//b[text() = '<p>From Date: ").each do |b|
puts b.next_sibling.content.strip
end
This is file://china.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>File </title>
</head>
<body>
<div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
<p><table cellspacing="0">
<tr>
<td width="2%"> </td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide"> </td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <span class="bidi">Meeting in China</span></p>
<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p> Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות 1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions"> </td>
</tr>
</table>
</p>
</div>
</body></html>
Amadan's answer
original.rb
#/usr/bin/ruby
require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
date = noko.at_xpath("//p[starts-with(text(),'From Date: ')]").text()
puts date
formatted = date[/From Date: (.*)/, 1]
puts formatted
gives an error original.rb:5:in '<main>': undefined method 'text' for nil:NilClass (NoMethodError)
You can't use
noko = Nokogiri::HTML('china.html')
Nokogiri::HTML is a shortcut to Nokogiri::HTML::Document.parse. The documentation says:
.parse(string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML) {|options| ... } ⇒ Object`
... string_or_io may be a String, or any object that responds to read and close such as an IO, or StringIO. ...
While 'china.html' is a String, it's not HTML. It appears you're thinking that a filename will suffice, however Nokogiri doesn't open anything, it only understands strings containing markup, either HTML or XML, or an IO-type object that responds to the read method. Compare these:
require 'nokogiri'
doc = Nokogiri::HTML('china.html')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>china.html</p></body></html>\n"
versus:
doc = Nokogiri::HTML('<html><body><p>foo</p></body></html>')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"
and:
doc = Nokogiri::HTML(open('http://www.example.org'))
doc.to_html[0..99]
# => "<!DOCTYPE html>\n<html>\n<head>\n <title>Example Domain</title>\n\n <meta charset=\"utf-8\">\n <met"
The last works because OpenURI adds the ability to read URLs to open, which responds to read:
open('http://www.example.org').respond_to?(:read) # => true
Moving on to the question:
require 'nokogiri'
require 'open-uri'
html = <<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>File </title>
</head>
<body>
<div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
<p><table cellspacing="0">
<tr>
<td width="2%"> </td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide"> </td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <span class="bidi">Meeting in China</span></p>
<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p> Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות 1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions"> </td>
</tr>
</table>
</p>
</div>
</body></html>
EOT
doc = Nokogiri::HTML(html)
Once the document is parsed, it's easy to find a particular <p> tag using the
<table cellspacing="0" cellpadding="0" class="resultsTypes">
as a placemarker:
from_date = doc.at('table.resultsTypes p[6]').text
# => "From Date: 02/14/1936"
It looks like its going to be tougher pulling the title = "Meeting in China" and link = "bing.com"; since they are on the same line.
I'm using CSS selectors to define the path to the desired text. CSS is more easily read than XPath, though XPath is more powerful and descriptive. Nokogiri allows us to use either, and lets us use search or at with either. at is equivalent to search('some selector').first. There are also CSS and XPath specific versions of search and at, described in Nokogiri::XML::Node.
title_link = doc.at('table.resultsTypes p[2] a')['href'] # => "http://www.bing.com"
title = doc.at('table.resultsTypes p[2] span').text # => "Meeting in China"
You're trying to use the XPath:
/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p
however, it's not valid for the HTML you're working with.
Notice tbody in the selector. Look at the HTML, immediately after either of the <table> tags, neither occurrence has a <tbody> tag, so the XPath is wrong. I suspect that was generated by your browser, which is doing a fix-up of the HTML to add <tbody> according to the specification, however Nokogiri doesn't do a fix-up to add <tbody> and the HTML doesn't match, causing the search to fail. So, don't rely on the selector defined by the browser, nor should you trust the browser's idea of the actual HTML source.
Instead of using an explicit selector, it's better, easier, and smarter, to look for specific way-points in the markup, and use those to navigate to the node(s) you want. Here's an example of doing everything above, only using a placeholder, and a mix of XPath and CSS:
doc.at('//p[starts-with(., "Title:")]').text # => "Title: Meeting in China"
title_node = doc.at('//p[starts-with(., "Title:")]')
title_url = title_node.at('a')['href'] # => "http://www.bing.com"
title = title_node.at('span').text # => "Meeting in China"
So, it's fine to mix and match CSS and XPath.
from_date = noko.at_xpath('//p[starts-with(text(), "From Date:")]').text()
date = from_date[/From Date: (.*)/, 1]
# => "02/14/1936"
EDIT:
Explanation: Get the first node (#at_xpath) anywhere in the document (//) such that ([...]) text content (text()) starts with (starts-with(string, stringStart)) "From Date" ("From Date:"), and take its text content (#text()), storing it (=) into the variable from_date (from_date). Then, extract the first group (#[regexp, 1]) from that text (from_date) by using the regular expression (/.../) that matches the literal characters "From Date: ", followed by any number (*) of any characters (.), that will be captured ((...)) in the first capture group to be extracted by #[regexp, 1].
Also,
Amadan's answer [...] gives an error
I did not notice that your Nokogiri construction is broken, as explained by the Tin Man. The line noko = Nokogiri::HTML('china.html') (which was not a part of my answer) will give you a single node document that only has the text "china.html" in it, and no <p> nodes at all.

Is there a better way to get this element/node with Nokogiri?

I think the best way to explain this is via some code. Basically the only way to identify the TR I need inside the table (i've already reached the table itself and named it annual_income_statement) is by the text of the first TD in the TR, like this:
this may be helpful to know, too:
actual html:
doc = Nokogiri::HTML(open('https://www.google.com/finance?q=NYSE%3AAA&fstype=iii'))
html snippet:
<div id="incannualdiv">
<table id="fs-table">
<tbody>
<tr>..</tr>
...
<tr>
<td>Net Income</td>
<td>100</td>
</tr>
<tr>..</tr>
</tbody>
</table>
</div>
original xpath
irb(main):161:0> annual_income_statement = doc.xpath("//div[#id='incannualdiv']/table[#id='fs-table']/tbody")
irb(main):121:0> a = nil
=> nil
irb(main):122:0> annual_income_statement.children.each { |e| if e.text.include? "Net Income" and e.text.exclude? "Ex"
irb(main):123:2> a = e.text
irb(main):124:2> end }
=> 0
irb(main):125:0> a
=> "Net Income\n\n191.00\n611.00\n254.00\n-1,151.00\n"
irb(main):127:0> a.split "\n"
=> ["Net Income", "", "191.00", "611.00", "254.00", "-1,151.00"]
but is there a better way?
more details:
doc = Nokogiri::HTML(open('https://www.google.com/finance?q=NYSE%3AAA&fstype=iii'))
div = doc.at "div[#id='incannualdiv']" #div containing the table i want
table = div.at 'table' #table containing tbody i want
tbody = table.at 'tbody' #tbody containing tr's I want
trs = tbody.at 'tr' #SHOULD be all tr's of that table/tbody - but it's only the first TR?
I expect that last bit to give me ALL the TR's (which would include the TD i'm looking for)
but in fact it only gives me the first TR
Best is probably:
table.at 'tr:has(td[1][text()="Net Income"])'
Edit
More info:
doc = Nokogiri::HTML <<EOF
<div id="incannualdiv">
<table id="fs-table">
<tbody>
<tr>..</tr>
...
<tr>
<td>Net Income</td>
<td>100</td>
</tr>
<tr>..</tr>
</tbody>
</table>
</div>
EOF
table = doc.at 'table'
table.at('tr:has(td[1][text()="Net Income"])').to_s
#=> "<tr>\n<td>Net Income</td>\n <td>100</td>\n </tr>\n"

CodeIgniter HTML table generator with custom layout inside

CI table->generate($data1, $data2, $data3) will output my data in a form of simple table like:
<table>
<tr>
<td>data1</td>
<td>data2</td>
<td>data3</td>
</tr>
</table>
What if I need a complex cell layout with multiple $vars within each cell:
$data1 = array('one', 'two', 'three');
and I want something like this:
<table>
<tr>
<td>
<div class="caption">$data1[0]</div>
<span class="span1">$data1[1] and here goes <strong>$data1[2]</strong></span>
</td>
<td>...</td>
<td>...</td>
</tr>
</table>
How should I code that piece?
For now I just generate the content of td in a model and then call generate(). But this means that my HTML for the cell is in the model but I would like to keep it in views.
What I would suggest is have a view that you pass the data that Generates the td structure. Capture the output of the view and pass this to the table generator. This keeps your structure in the view albeit a different one.
Hailwood's answer isn't the best way to do it.
the html table class has a data element on the add_row method. so the code would be:
$row = array();
$row[] = array('data' => "<div class='caption'>{$data1[0]}</div><span class='span1'>{$data1[1]} and here goes <strong>{$data1[2]}</strong></span>");
$row[] = $col2;
$row[] = $col3;
$this->table->add_row($row)
echo $this->table->generate();
as an aside, having a class named caption in a table is semantically confusing because table has a caption tag.

How to read the values of a table in HTML file and Store them in Perl?

I read many questions and many answers but I couldn't find a straight answer to my question. All the answers were either very general or different from what I want to do. I got so far that i need to use HTML::TableExtract or HTML::TreeBuilder::XPath but I couldn't really use them to store the values. I could somehow get table row values and show them with Dumper.
Something like this:
foreach my $ts ($tree->table_states) {
foreach my $row ($ts->rows) {
push (#fir , (Dumper $row));
} }
print #sec;
But this is not really doing what I'm looking for. I will add the structure of the HTML table that I want to store the values:
<table><caption><b>Table 1 </b>bla bla bla</caption>
<tbody>
<tr>
<th ><p>Foo</p>
</th>
<td ><p>Bar</p>
</td>
</tr>
<tr>
<th ><p>Foo-1</p>
</th>
<td ><p>Bar-1</p>
</td>
</tr>
<tr>
<th ><p>Formula</p>
</th>
<td><p>Formula1-1</p>
<p>Formula1-2</p>
<p>Formula1-3</p>
<p>Formula1-4</p>
<p>Formula1-5</p>
</td>
</tr>
<tr>
<th><p>Foo-2</p>
</th>
<td ><p>Bar-2</p>
</td>
</tr>
<tr>
<th ><p>Foo-3</p>
</th>
<td ><p>Bar-3</p>
<p>Bar-3-1</p>
</td>
</tr>
</tbody>
</table>
It would be convenient if I can store the row values as pairs together.
expected output would be something like an array with values of:
(Foo , Bar , Foo-1 , Bar-1 , Formula , Formula-1 Formula-2 Formula-3 Formula-4 Formula-5 , ....)
The important thing for me is to learn how to store the values of each tag and how to move around in the tag tree.
Learn XPath and DOM manipulation.
use strictures;
use HTML::TreeBuilder::XPath qw();
my $dom = HTML::TreeBuilder::XPath->new;
$dom->parse_file('10280979.html');
my %extract;
#extract{$dom->findnodes_as_strings('//th')} =
map {[$_->findvalues('p')]} $dom->findnodes('//td');
__END__
# %extract = (
# Foo => [qw(Bar)],
# 'Foo-1' => [qw(Bar-1)],
# 'Foo-2' => [qw(Bar-2)],
# 'Foo-3' => [qw(Bar-3 Bar-3-1)],
# Formula => [qw(Formula1-1 Formula1-2 Formula1-3 Formula1-4 Formula1-5)],
# )