How to parse a date using Nokogiri in Ruby - html

I am trying to parse this page and pull the date that begins after
>p>From Date:
I get the error
Invalid predicate: //b[text() = '<p>From Date: ' (Nokogiri::XML::XPath::SyntaxError)
The xpath from "inspect element" is
/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p
This is an example of the code:
#/usr/bin/ruby
require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
noko.xpath("//b[text() = '<p>From Date: ").each do |b|
puts b.next_sibling.content.strip
end
This is file://china.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>File </title>
</head>
<body>
<div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
<p><table cellspacing="0">
<tr>
<td width="2%"> </td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide"> </td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <span class="bidi">Meeting in China</span></p>
<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p> Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות 1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions"> </td>
</tr>
</table>
</p>
</div>
</body></html>
Amadan's answer
original.rb
#/usr/bin/ruby
require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
date = noko.at_xpath("//p[starts-with(text(),'From Date: ')]").text()
puts date
formatted = date[/From Date: (.*)/, 1]
puts formatted
gives an error original.rb:5:in '<main>': undefined method 'text' for nil:NilClass (NoMethodError)

You can't use
noko = Nokogiri::HTML('china.html')
Nokogiri::HTML is a shortcut to Nokogiri::HTML::Document.parse. The documentation says:
.parse(string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML) {|options| ... } ⇒ Object`
... string_or_io may be a String, or any object that responds to read and close such as an IO, or StringIO. ...
While 'china.html' is a String, it's not HTML. It appears you're thinking that a filename will suffice, however Nokogiri doesn't open anything, it only understands strings containing markup, either HTML or XML, or an IO-type object that responds to the read method. Compare these:
require 'nokogiri'
doc = Nokogiri::HTML('china.html')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>china.html</p></body></html>\n"
versus:
doc = Nokogiri::HTML('<html><body><p>foo</p></body></html>')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"
and:
doc = Nokogiri::HTML(open('http://www.example.org'))
doc.to_html[0..99]
# => "<!DOCTYPE html>\n<html>\n<head>\n <title>Example Domain</title>\n\n <meta charset=\"utf-8\">\n <met"
The last works because OpenURI adds the ability to read URLs to open, which responds to read:
open('http://www.example.org').respond_to?(:read) # => true
Moving on to the question:
require 'nokogiri'
require 'open-uri'
html = <<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>File </title>
</head>
<body>
<div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
<p><table cellspacing="0">
<tr>
<td width="2%"> </td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide"> </td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <span class="bidi">Meeting in China</span></p>
<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p> Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות 1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions"> </td>
</tr>
</table>
</p>
</div>
</body></html>
EOT
doc = Nokogiri::HTML(html)
Once the document is parsed, it's easy to find a particular <p> tag using the
<table cellspacing="0" cellpadding="0" class="resultsTypes">
as a placemarker:
from_date = doc.at('table.resultsTypes p[6]').text
# => "From Date: 02/14/1936"
It looks like its going to be tougher pulling the title = "Meeting in China" and link = "bing.com"; since they are on the same line.
I'm using CSS selectors to define the path to the desired text. CSS is more easily read than XPath, though XPath is more powerful and descriptive. Nokogiri allows us to use either, and lets us use search or at with either. at is equivalent to search('some selector').first. There are also CSS and XPath specific versions of search and at, described in Nokogiri::XML::Node.
title_link = doc.at('table.resultsTypes p[2] a')['href'] # => "http://www.bing.com"
title = doc.at('table.resultsTypes p[2] span').text # => "Meeting in China"
You're trying to use the XPath:
/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p
however, it's not valid for the HTML you're working with.
Notice tbody in the selector. Look at the HTML, immediately after either of the <table> tags, neither occurrence has a <tbody> tag, so the XPath is wrong. I suspect that was generated by your browser, which is doing a fix-up of the HTML to add <tbody> according to the specification, however Nokogiri doesn't do a fix-up to add <tbody> and the HTML doesn't match, causing the search to fail. So, don't rely on the selector defined by the browser, nor should you trust the browser's idea of the actual HTML source.
Instead of using an explicit selector, it's better, easier, and smarter, to look for specific way-points in the markup, and use those to navigate to the node(s) you want. Here's an example of doing everything above, only using a placeholder, and a mix of XPath and CSS:
doc.at('//p[starts-with(., "Title:")]').text # => "Title: Meeting in China"
title_node = doc.at('//p[starts-with(., "Title:")]')
title_url = title_node.at('a')['href'] # => "http://www.bing.com"
title = title_node.at('span').text # => "Meeting in China"
So, it's fine to mix and match CSS and XPath.

from_date = noko.at_xpath('//p[starts-with(text(), "From Date:")]').text()
date = from_date[/From Date: (.*)/, 1]
# => "02/14/1936"
EDIT:
Explanation: Get the first node (#at_xpath) anywhere in the document (//) such that ([...]) text content (text()) starts with (starts-with(string, stringStart)) "From Date" ("From Date:"), and take its text content (#text()), storing it (=) into the variable from_date (from_date). Then, extract the first group (#[regexp, 1]) from that text (from_date) by using the regular expression (/.../) that matches the literal characters "From Date: ", followed by any number (*) of any characters (.), that will be captured ((...)) in the first capture group to be extracted by #[regexp, 1].
Also,
Amadan's answer [...] gives an error
I did not notice that your Nokogiri construction is broken, as explained by the Tin Man. The line noko = Nokogiri::HTML('china.html') (which was not a part of my answer) will give you a single node document that only has the text "china.html" in it, and no <p> nodes at all.

Related

How can I render html in a field in Blazor

I have a field saved in the DB with HTML. I am using TinyMCE for my text editor and it is correctly saving the HTML tags in the DB. However, when I render the field, it still shows the tags. Initaily I had this:
<td>
#objInv.Notes
</td>
My latest attempt to resolve this is:
<td>
#(new HtmlString(objInv.Notes))
</td>
Either way, it still renders as:
<p>New laptops 09/07/2022 <strong>test</strong></p>
What I desire is:
New laptops 09/07/2022 test
Raw HTML can be rendered in Blazor by using the MarkupString. You can set the raw HTML as a string to any parameter and cast it in a markup string.
You can render it like this:
<table class="table table-striped">
<thead>
<tr>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>#((MarkupString)myNote)</td>
</tr>
</tbody>
</table>
#code {
string myNote = "<p>New laptops 09/07/2022 <strong>test</strong></p>";
}
Output:

How do I handle the MailChimp API response for the HTML variable in VBA?

A string variable oldHTMLContent contains a text string from a MailChimp API request response that represents the current content of an email campaign. Here is the string but it includes a bunch of \r\n that you can't see in the display below:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<style type="text/css">
#media only screen and (max-width: 480px) {
table#canspamBar td {
font-size:14px !important;
}
table#canspamBar td a {
display:block !important;
margin-top:10px !important;
}
}
</style>
</head>
<body>
<p> </p>
<div class="userBot">
<img src="http://dev.mydev.org/wp-content/uploads/2018/07/CynthiaNixon.jpg" width="1012" height="592" alt="CynthiaNixon.jpg">
<p>When we ask ourselves why so many people are signing up for Cynthia For New York volunteer events this weekend, this is what ... (click for more)</p>
</div> <center>
<br>
<br>
<br>
<br>
<br>
<br>
<table border="0" cellpadding="0" cellspacing="0" width="100%" id="canspamBarWrapper" style="background-color:#FFFFFF;border-top:1px solid #E5E5E5;">
<tr>
<td align="center" valign="top" style="padding-top:20px;padding-bottom:20px;">
<table border="0" cellpadding="0" cellspacing="0" id="canspamBar">
<tr>
<td align="center" valign="top" style="color:#606060;font-family:Helvetica, Arial, sans-serif;font-size:11px;line-height:150%;padding-right:20px;padding-bottom:5px;padding-left:20px;text-align:center;">
This email was sent to *|EMAIL|*
<br><em>why did I get this?</em> unsubscribe from this list update subscription preferences
<br>*|LIST:ADDRESSLINE|*
<br>
<br>
</td>
</tr>
</table>
</td>
</tr>
</table>
</center>
</body>
</html>
I want to extract just the "userBot" class but I can't seem to access it with getElementsByClassName.
When this code executes, the result is always zero.
Dim oldHTMLContent As String
Dim oldHtmlDoc As MSHTML.HTMLDocument
Set oldHtmlDoc = New HTMLDocument
oldHtmlDoc.body.innerText=oldHTMLContent
debug.Print oldHtmlDoc.getElementsByClassName("userBot").length
How do I define the right object and load it with the HTML string so I can work with the userBot class? I can see I'm loading the whole DOM, including
Transfer as .innerHTML to the new HTMLDocument then use a CSS class selector, ".", as shown below. Also, your naming seems a little confusing. IMO it would be clearer if you were transferring oldInnerHTML to newHTMLDoc, or something like that.
Option Explicit
Public Sub test()
Dim html As New HTMLDocument
html.body.innerHTML = [A1] '<= This is your oldHTMLContent. I am reading from a cell.
Debug.Print html.querySelector(".userBot").innerText
End Sub
This is the same as saying:
Debug.Print html.getElementsByClassName("userBot")(0).innerText
Sample of output:

Generate HTML file with both quote ' and " from R

I need to create an HTML file from R software. The problem is that javascript implies simple quote and styles double quote in the string generated.
cat() function returns a quite good text removing backslashs in front of ". But I did not found how to print it like this in an html file using write.table(Text, "index.html", sep="\t")
Thanks in advance for any help.
NB : I removed a "<" character in front of /script in order to be able to post it =)
For exemple :
Text=paste0('<html>
<script type="text/javascript">',
"function lang1(event) {
var iframe = document.getElementById('id1');
var target = event.target || event.srcElement;
iframe.src = event.target.innerHTML + '.html';
}
/script>",
'<body style="overflow:hidden; margin:0">
<div id="main">
<div id="content">
<table style="border: 0; height:100%;width:100%;">
<tr style="height:5%;">
<td colspan="2" style="text-align:center;">
<h2>',paste0("some text"),'</h2>
</td>
</tr>
<tr>
<td style="width: 10%;font-size:14px;">
<ul onclick="lang1(event);">',
paste('<li>',c("link1","link2"),'</li>',collapse=""),
'</ul>
</td>
<td style="width: 90%;">
<iframe id="id1" width="99%" height="99%"></iframe>
</td>
</tr>
</table>
</div>
</div>
</body>
</html>')
Text=gsub("\n","",Text)
I'm not sure I fully understand the question but whenever I generate html files from text in R, I use the \ character to escape quote marks that are needed in the JavaScript.
Then I open a file connection and use the writeLines function to correctly write my text to the file
Text<-"<!doctype html>
<html lang=\"en\">
<head>
<meta charset=\"utf-8\">
<style>
body {
font-size : 16px;
font-family: \"Helvetica Neue\",Helvetica,Arial,sans-serif;
}
</style>
</head>
</html>
"
fileConn<-file("mywebpage.html")
writeLines(Text, fileConn)
close(fileConn)
Maybe that will help you.

Nokogiri XML to node

I'm reading a local HTML document with Nokogiri like so:
f = File.open(local_xml)
#doc = Nokogiri::XML(f)
f.close
#doc contains a Nokogiri XML object that I can parse using at_css.
I want to modify it using Nokogiri's XML::Node, and I'm absolutely stuck. How do I take this Nokogiri XML document and work with it using node methods?
For example:
#doc.at_css('rates tr').add_next_sibling(element)
returns:
undefined method `add_next_sibling' for nil:NilClass (NoMethodError)
despite the fact that #doc.class is Nokogiri::XML::Document.
For completeness, here is the markup I'm trying to edit.
<html>
<head>
<title>Exchange Rates</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<table class="rates">
<tr>
<td class="up"><div></div></td>
<td class="date">Saturday, Jan 12</td>
<td class="rate up">3.83</td>
</tr>
<tr>
<td class="up"><div></div></td>
<td class="date">Friday, Jan 11</td>
<td class="rate up">3.70</td>
</tr>
<tr>
<td class="down"><div></div></td>
<td class="date">Thursday, Jan 10</td>
<td class="rate down">3.68</td>
</tr>
<tr>
<td class="down"><div></div></td>
<td class="date">Wedensday, Jan 9</td>
<td class="rate down">3.70</td>
</tr>
<tr>
<td class="up"><div></div></td>
<td class="date">Tuesday, Jan 8</td>
<td class="rate up">3.66</td>
</tr>
</table>
</body>
</html>
This is an example how to do what you are trying to do. Starting with f containing a shortened version of the HTML you want to parse:
require 'nokogiri'
f = '
<html>
<head>
<title>Exchange Rates</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<table class="rates">
<tr>
<td class="up"><div></div></td>
<td class="date">Saturday, Jan 12</td>
<td class="rate up">3.83</td>
</tr>
</table>
</body>
</html>
'
doc = Nokogiri::HTML(f)
doc.at('.rates tr').add_next_sibling('<p>foobar</p>')
puts doc.to_html
Your code is incorrectly trying to find the class="rates" parameter for <table>. In CSS we'd use .rates. An alternate way to do it using CSS is table[class="rates"].
Your example didn't define the node you were trying to add to the HTML, so I appended <p>foobar</p>. Nokogiri will let you build a node from scratch and append it, or use markup and add that, or you could find a node from one place in the HTML, remove it, and then insert it somewhere else.
That code outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Exchange Rates</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<table class="rates">
<tr>
<td class="up"><div></div></td>
<td class="date">Saturday, Jan 12</td>
<td class="rate up">3.83</td>
</tr>
<p>foobar</p>
</table>
</body>
</html>
It's not necessary to use at_css or at_xpath instead of at. Nokogiri senses what type of accessor you're using and handles it. The same applies using xpath or css instead of search. Also, at is equivalent to search('some accessor').first, so it finds the first occurrence of the matching node.
Try to load as HTML instead of XML Nokogiri::HTML(f)
Not getting in much detail on how Nokogiri works, lets say that XML does not have css right? So the method at_css doesn't make sense (maybe it does I dunno). So it should work loading as Html.
Update
Just noticed one thing. You want to do at_css('.rates tr') insteand of at_css('rates tr') because that's how you select a class in css. Maybe it works with XML now.

How do I parse an HTML table with Nokogiri?

I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.
What about this table? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.
Please note that there are few tables in the HTML document? I am after one particular table with its tbody, <tbody id="threadbits_forum_251">. The name will be always the same (I hope). Can I use the tbody and the name in the code?
<table >
<tbody>
<tr> <!-- table header --> </tr>
</tbody>
<!-- show threads -->
<tbody id="threadbits_forum_251">
<tr>
<td></td>
<td></td>
<td>
<div>
<a href="showthread.php?t=230708" >Vb4 Gold Released</a>
</div>
<div>
<span><a>Paul M</a></span>
</div>
</td>
<td>
06 Jan 2010 <span class="time">23:35</span><br />
by shane943
</div>
</td>
<td>24</td>
<td>1,320</td>
</tr>
</tbody>
</table>
#!/usr/bin/ruby1.8
require 'nokogiri'
require 'pp'
html = <<-EOS
(The HTML from the question goes here)
EOS
doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
detail = {}
[
[:title, 'td[3]/div[1]/a/text()'],
[:name, 'td[3]/div[2]/span/a/text()'],
[:date, 'td[4]/text()'],
[:time, 'td[4]/span/text()'],
[:number, 'td[5]/a/text()'],
[:views, 'td[6]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp details
# => [{:time=>"23:35",
# => :title=>"Vb4 Gold Released",
# => :number=>"24",
# => :date=>"06 Jan 2010",
# => :views=>"1,320",
# => :name=>"Paul M"}]