Parsing html with regex in ruby - html

Hello every one i have a html code as code bellow. I want to get the text inside <a>(.*)</a>
I want to get this result:
data 1 : hello1
data 2 : hello2
data 3 : hello3
from that input:
<a>
hello1
</a>
<a>
hello2
</a>
<a>
hello3
</a>

To expand on the two comments, the following Nokogiri code will work for your example. You can use either xpath or CSS. A dedicated parser is much more powerful than rolling your own regex.
> require 'nokogiri'
=> true
> doc = Nokogiri::HTML("<a>hello1</a><a>hello2</a><a>hello3</a>")
=> #<Nokogiri::HTML::Document:0x3ffec2494f48 name="document" children=[#<Nokogiri::XML::DTD:0x3ffec2494bd8 name="html">, #<Nokogiri::XML::Element:0x3ffec2494458 name="html" children=[#<Nokogiri::XML::Element:0x3ffec2494250 name="body" children=[#<Nokogiri::XML::Element:0x3ffec2494048 name="a" children=[#<Nokogiri::XML::Text:0x3ffec2493e40 "hello1">]>, #<Nokogiri::XML::Element:0x3ffec249dc88 name="a" children=[#<Nokogiri::XML::Text:0x3ffec249da80 "hello2">]>, #<Nokogiri::XML::Element:0x3ffec249d878 name="a" children=[#<Nokogiri::XML::Text:0x3ffec249d670 "hello3">]>]>]>]>
> doc.css('a').each { |node| p node.text }
"hello1"
"hello2"
"hello3"
=> 0
Update: You'll need the nokogiri gem if you don't have it installed already.
sudo gem install nokogiri
Depending on your setup, you may also need to prepend
require 'rubygems'

Related

HTML: How to refer to span.title inside a class?

I am building a webscraper and I have this block of HTML code:
<div class = 'example-1'
<ul class = 'example-2'
<li>
<span title = 'data1' > 155 </span>
/
<span title = 'data2' > 155 </span>
And I want to scrape the numbers 155 and 145 inside the span title
In my code using scrapy, I identified this as:
'size': detail.css('ul.example-2 ::text').get(),
but it is not returning me anything. How do I fix this?
The correct CSS selectors are:
span[title="data1"]
span[title="data2"]
Alternatively, you can select both at the same time with:
span[title^="data"]
I am unfamiliar with scrapy syntax, but I believe your scrapy selector should look something like this:
response.css('span[title^="data"]::text').getall()
Further info:
In CSS, square brackets denotes the attribute selector.
You can select:
an element with an attribute : span[title]
an element with a specific attribute-value : span[title="data1"]
an element with the start pattern of an attribute-value : span[title^="data"]
an element with the end pattern of an attribute-value : span[title$="1"]
and more.

Nokogiri not parsing exported bookmark html from Delicious correctly

I cannot seem to figure why Nokogori is not parsing this html file correctly. This html file is a bookmark export from Delicious. It has 400 links in it but always only parses out 254 links. I have other Delicious html export files that also only find 254 links (that have differing link amounts) and one that parses the links correctly (over 2000 links), so it seems as though there may be specific links that are causing the issue, but I'm really not sure. I'm linking to the html here, since the html puts the body of this post over the character limit. This is an example of the html (the actual html has over 400 tags):
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<!-- This is an automatically generated file.
It will be read and overwritten.
Do Not Edit! -->
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
<DT>Le Cartel | Le Cartel Clothing Inc.
<DT>Parkdale Project Read
<DT>Dark mp3
<DT>The Family Law: Watch the series | Programs
<DT>Asians Doing Everything
</DL></p>
I'm uploading the html file with the Carrierwave gem and parsing it. This code I've been using is (where html_upload is a model instance using Carrierwave):
doc = Nokogiri::HTML.parse html_upload.file.read
puts doc.css('a').count
When Nokogiri does not parse a document as you'd expect, always check doc.errors.
Here's what I get when I try to parse the raw content from your gist:
require 'nokogiri'
doc = Nokogiri.HTML(DATA.read)
puts doc.errors.last
#=> Excessive depth in document: 256 use XML_PARSE_HUGE option
The problem here is that the HTML file has tons of unclosed tags (mostly <DT>, which Nokogiri (or rather, libxml2) is trying to nest within one another. Illustrated:
doc = Nokogiri.XML(html,&:noblanks)
puts doc.to_xhtml(indent:2)
#=> <TITLE>Bookmarks</TITLE>
#=> <H1>Bookmarks</H1>
#=> <DL>
#=> <p>
#=> <DT>
#=> <A HREF="http://boomjacak.com/" ...>BOOM JACAK</A>
#=> <DT>
#=> <A HREF="http://tropicaliainfursnyc.com/" ...>Tropicalia in Furs Baby!</A>
#=> <DT>
#=> <A HREF="https://uptimerobot.com/" ...>Uptime Robot</A>
#=> <DT>
#=> <A HREF="http://yagphotovoice.tumblr.com/" ...>EYE SPY</A>
#=> <DT>
#=> <A HREF="http://glitterbeat.com/" ...>Glitterbeat – Vibrant Global Sounds</A>
#=> <DT>
#=> <A HREF="http://www.puzz.com/stickelsframegames.html" ...>Stickels Frame Games</A>
#=> <DT>
#=> <A HREF="http://silentdiscosquad.com/" ...>Silent Disco Squad</A>
#=> <DT>
#=> <A HREF="http://innerfire.ca/" ...>None</A>
#=> <DT>
#=> <A HREF="http://lidopepper.tumblr.com/" ...>Lido Pimienta - La Papessa</A>
#=> <DT>
#=> <A HREF="http://cabaretdiaspora.wordpress.com/" ...>Radio Cabaret Diaspora | Musiques urbaines</A>
#=> <DT>
You can tell Nokogiri to forge on using the 'huge' config option:
doc = Nokogiri.HTML( myhtml, &:huge )
I'd personally just lightly fix up the HTML in question using gsub:
html = DATA.read
html.gsub! /<DT>.+?<\/A>$/, '\\0</DT>'
doc = Nokogiri.HTML(html)
p doc.css('a').length
#=> 399
(I checked: there are only 399 links in the file, not 400.)

Sanitizing HTML using Nokogiri

I'm trying to clean up some CMS entered HTML that has extraneous paragraph tags and br tags everywhere. The Sanitize gem has proved very useful to do this but I am stuck with a particular issue.
The problem is when there is a br tag directly after/before a paragraph tag eg
<p>
<br />
Some text here
<br />
Some more text
<br />
</p>
I would like to strip out the extraneous first and last br tags, but not the middle one.
I'm very much hoping I can use a sanitize transformer to do this but can't seem to find the right matcher to achieve this.
Any help would be much appreciated.
Here's how to locate the particular <br> nodes that are contained by <p>:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>
<br />
Some text here
<br />
Some more text
<br />
</p>
EOT
doc.search('p > br').map(&:to_html)
# => ["<br>", "<br>", "<br>"]
Once we know we can find them, it's easy to remove specific ones:
br_nodes = doc.search('p > br')
br_nodes.first.remove
br_nodes.last.remove
doc.to_html
# => "<p>\n \n Some text here\n <br>\n Some more text\n \n</p>\n"
Notice that Nokogiri removed them, but their associated Text nodes that are their immediate siblings, containing their "\n" are left behind. A browser will gobble those up and not display the line-ends, but you might be feeling OCD, so here's how to remove those also:
br_nodes = doc.search('p > br')
[br_nodes.first, br_nodes.last].each do |br|
br.next_sibling.remove
br.remove
end
doc.to_html
# => "<p>\n <br>\n Some more text\n </p>\n"
initial_linebreak_transformer = lambda {|options|
node = options[:node]
if node.present? && node.element? && node.name.downcase == 'p'
first_child = node.children.first
if first_child.name.downcase == 'br'
first_child.unlink
initial_linebreak_transformer.call options
end
end
}

how to retrieve data from html between <span> and </span>

I want to get the rate that is from 1 to 5 in amazon customer reviews.
I check the source, and find this part looks as
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;"><span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" ><span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;"><b>Works great right out of the box with Surface Pro</b>, <nobr>October 5, 2013</nobr></span>
</div>
I want to get 5.0 out of 5 stars from
<span>5.0 out of 5 stars</span></span> </span>
how can i use xpathSApply to get it?
Thank you!
I would recommend using the selectr package, which uses css selectors in place of xpath.
library(XML)
doc <- htmlParse('
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;">
<span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" >
<span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;">
<b>Works great right out of the box with Surface Pro</b>,
<nobr>October 5, 2013</nobr></span>
</div>', asText = TRUE
)
library(selectr)
xmlValue(querySelector(doc, 'div > span > span > span'))
UPDATE: If you are looking to use xpath, you can use the css_to_xpath function in selectr to figure out the appropriate xpath command, which in this case turns out to be
"descendant-or-self::div/span/span/span"
I do not know r much but I can give you the XPath string. It seems you want the first span's text which has no attribute and this would be:
//span[not(#*)][1]/text()
You can put this string into xpathSApply.

Indenting generated markup in Jekyll/Ruby

Well this is probably kind of a silly question but I'm wondering if there's any way to have the generated markup in Jekyll to preserve the indentation of the Liquid-tag. World doesn't end if it isn't solvable. I'm just curious since I like my code to look tidy, even if compiled. :)
For example I have these two:
base.html:
<body>
<div id="page">
{{content}}
</div>
</body>
index.md:
---
layout: base
---
<div id="recent_articles">
{% for post in site.posts %}
<div class="article_puff">
<img src="/resources/images/fancyi.jpg" alt="" />
<h2>{{post.title}}</h2>
<p>{{post.description}}</p>
Read more
</div>
{% endfor %}
</div>
Problem is that the imported {{content}}-tag is rendered without the indendation used above.
So instead of
<body>
<div id="page">
<div id="recent_articles">
<div class="article_puff">
<img src="/resources/images/fancyimage.jpg" alt="" />
<h2>Gettin' down with responsive web design</h2>
<p>Everyone's talking about it. Your client wants it. You need to code it.</p>
Read more
</div>
</div>
</div>
</body>
I get
<body>
<div id="page">
<div id="recent_articles">
<div class="article_puff">
<img src="/resources/images/fancyimage.jpg" alt="" />
<h2>Gettin' down with responsive web design</h2>
<p>Everyone's talking about it. Your client wants it. You need to code it.</p>
Read more
</div>
</div>
</div>
</body>
Seems like only the first line is indented correctly. The rest starts at the beginning of the line... So, multiline liquid-templating import? :)
Using a Liquid Filter
I managed to make this work using a liquid filter. There are a few caveats:
Your input must be clean. I had some curly quotes and non-printable chars that looked like whitespace in a few files (copypasta from Word or some such) and was seeing "Invalid byte sequence in UTF-8" as a Jekyll error.
It could break some things. I was using <i class="icon-file"></i> icons from twitter bootstrap. It replaced the empty tag with <i class="icon-file"/> and bootstrap did not like that. Additionally, it screws up the octopress {% codeblock %}s in my content. I didn't really look into why.
While this will clean the output of a liquid variable such as {{ content }} it does not actually solve the problem in the original post, which is to indent the html in context of the surrounding html. This will provide well formatted html, but as a fragment that will not be indented relative to tags above the fragment. If you want to format everything in context, use the Rake task instead of the filter.
-
require 'rubygems'
require 'json'
require 'nokogiri'
require 'nokogiri-pretty'
module Jekyll
module PrettyPrintFilter
def pretty_print(input)
#seeing some ASCII-8 come in
input = input.encode("UTF-8")
#Parsing with nokogiri first cleans up some things the XSLT can't handle
content = Nokogiri::HTML::DocumentFragment.parse input
parsed_content = content.to_html
#Unfortunately nokogiri-pretty can't use DocumentFragments...
html = Nokogiri::HTML parsed_content
pretty = html.human
#...so now we need to remove the stuff it added to make valid HTML
output = PrettyPrintFilter.strip_extra_html(pretty)
output
end
def PrettyPrintFilter.strip_extra_html(html)
#type declaration
html = html.sub('<?xml version="1.0" encoding="ISO-8859-1"?>','')
#second <html> tag
first = true
html = html.gsub('<html>') do |match|
if first == true
first = false
next
else
''
end
end
#first </html> tag
html = html.sub('</html>','')
#second <head> tag
first = true
html = html.gsub('<head>') do |match|
if first == true
first = false
next
else
''
end
end
#first </head> tag
html = html.sub('</head>','')
#second <body> tag
first = true
html = html.gsub('<body>') do |match|
if first == true
first = false
next
else
''
end
end
#first </body> tag
html = html.sub('</body>','')
html
end
end
end
Liquid::Template.register_filter(Jekyll::PrettyPrintFilter)
Using a Rake task
I use a task in my rakefile to pretty print the output after the jekyll site has been generated.
require 'nokogiri'
require 'nokogiri-pretty'
desc "Pretty print HTML output from Jekyll"
task :pretty_print do
#change public to _site or wherever your output goes
html_files = File.join("**", "public", "**", "*.html")
Dir.glob html_files do |html_file|
puts "Cleaning #{html_file}"
file = File.open(html_file)
contents = file.read
begin
#we're gonna parse it as XML so we can apply an XSLT
html = Nokogiri::XML(contents)
#the human() method is from nokogiri-pretty. Just an XSL transform on the XML.
pretty_html = html.human
rescue Exception => msg
puts "Failed to pretty print #{html_file}: #{msg}"
end
#Yep, we're overwriting the file. Potentially destructive.
file = File.new(html_file,"w")
file.write(pretty_html)
file.close
end
end
We can accomplish this by writing a custom Liquid filter to tidy the html, and then doing {{content | tidy }} to include the html.
A quick search suggests that the ruby tidy gem may not be maintained but that nokogiri is the way to go. This will of course mean installing the nokogiri gem.
See advice on writing liquid filters, and Jekyll example filters.
An example might look something like this: in _plugins, add a script called tidy-html.rb containing:
require 'nokogiri'
module TextFilter
def tidy(input)
desired = Nokogiri::HTML::DocumentFragment.parse(input).to_html
end
end
Liquid::Template.register_filter(TextFilter)
(Untested)