Extract specific nodes in HTML using Nokogiri

Extract specific nodes in HTML using Nokogiri - html

I'm want to extract a few values from HTML using Nokogiri in this ruby script:
#!/usr/bin/ruby
require 'Nokogiri'
doc = Nokogiri::HTML(<<-END_OF_HTML)
<html>
<meta content="text/html; charset=UTF-8"/>
<body style='margin:20px'>
<p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>
<ul style='list-style-type:none; margin:25px 15px;'>
<li><b>User name:</b> Test User</li>
<li><b>User email:</b> test#abc.com</li>
<li><b>Identifier:</b> abc123def132afd1213afas</li>
<li><b>Description:</b> Tom's iPad</li>
<li><b>Model:</b> iPad 3</li>
<li><b>Platform:</b> </li>
<li><b>App:</b> Test app name</li>
<li><b>UserID:</b> </li>
</ul>
<p>Review user: https://cirrus.app47.com/users?search=test#abc.com</p> <hr style='height=2px; color:#aaa'/>
<p>We hope you enjoy the app store experience!</p>
<p style='font-size:18px; color:#999'>Powered by App47</p>
<img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
Specifically I want to get the values of the some of the list members like "Identifier:" and "User name:" and store them in strings.
I'm sure I need to use xpath but that's about it. My understanding is that xpath does node selection.
What do I need to specify with xpath and then how do I get the selection into some variables?
Full Solution
Ultimately I was really asking two questions.
Question 1 (implicit): How can I see the results of a search using xpath?
doc.xpath("SPECIFY_SEARCH_HERE").each do |node|
puts node
end
This works because xpath returns an array that you can parse and then you can do what you want with the results (in my case, print).
Question 2: How do I get the value of a particular list item?
str = doc.xpath("//ul/li[contains(b, 'Identifier')]/text()").to_s.strip
My analysis on this line is limited, but it looks like it does this:
Find the location of the li child keys with: //ul/li
Select the bolded key (b) containing 'Identifier'
Extract the value of the selection from #2: /text()
.to_s.strip converts the selection to a string and removes leading/trailing whitespace
For anyone better versed in HTML/Ruby/Xpath, feel free to update the explanation for precision.

That will return both values you asked for
//ul/li[contains(b, 'Identifier') or contains(b, 'User name')]/text()
Of course you can modify xpath and get only 1 value at one time.
//ul/li[contains(b, 'Identifier')]/text()

Related

Scraping for a specific title using Nokogiri in Ruby

I'm currently practicing web scraping using the NYT Best Sellers website. I want to get the title of the #1 book on the list and found the HTML element:
<div class="book-body">
<p class="freshness">12 weeks on the list</p>
<h3 class="title" itemprop="name">CRAZY RICH ASIANS</h3>
<p class="author" itemprop="author">by Kevin Kwan</p>
<p itemprop="description" class="description">A New Yorker gets a surprise when she spends the summer with her boyfriend in Singapore.</p>
</div>
I'm using the following code to grab the specific text:
doc.css(".title").text
However, it returns the titles of every book on the list. How would I go about getting just the specific book title, "CRAZY RICH ASIANS"?

If you look at the return from doc.css(".title") you will see it is a collection of all the titles. As Nokogiri::XML::Element Objects
CSS to my knowledge does not have a selector for targeting the first element of a given class. (Someone may certainly correct me if I am wrong) but to get just the first element from a Nokogiri::XML::NodeSet is still very simple as it acts like an Array in many cases. For Example:
doc.css(".title")[0].text
You could also use xpath to select just the first one (since XPath does support index based selection) like so:
doc.xpath(doc.xpath("(//h3[#class='title'])[1]").text
Please Note:
Ruby indexes start at 0 as in the first example;
XPath indexes start at 1 as in the second example.

How to get all text within node on same line using XPath 1.0

<a href="/company/10676229"
onclick="javascript:_paq.push(['trackEvent', 'SearchSuggestions']);"
title="View company">
<strong>RECRUIT</strong>
" ZONE "
<strong>RECRUITMENT</strong>
" LIMITED "
</a>
I'm trying to extract the text from the above a node in the form "RECRUIT ZONE RECRUITMENT LIMITED" - all on one line - but so far can only get them on separate lines. Since I'm running over a few hundred of these records, all with different patterns of bold and regular text it would be good if i can use an XPath expression to extract all the text on one line straight out, rather than having to use loads of logic afterwards to try and concatenate them together. Stuck with XPath 1.0.
I feel like there would be an expression to do this but struggled with research so far and not sure what else to try.
So far I've tried:
//a[#title="View company"]//text()[normalize-space()]
which returns a list but the text has been separated so all bold text is appearing on different lines to the rest for each a node

XPath 1.0
As already answered by #Andersson (+1), this XPath,
normalize-space(//a[#title="View company"])
will return
RECRUIT " ZONE " RECRUITMENT " LIMITED "
for the markup shown in your question.
In the comments, you've said that your actual markup will include multiple such a elements and that you'd like to select and similarly obtain the text for each. This is not possible with XPath 1.0 alone; you'll have to iterate over selected nodes and process them in the hosting language. In XPath 1.0, only the first of all such a elements will be processed by normalize-space().
XPath 2.0
XPath 2.0 can handle the task with this XPath,
//a[#title="View company"]/normalize-space()
which will apply normalize-space(), which first takes the string value and then trims leading and trailing space and consolidates interior space, for each node selected in the previous step.

Try below to get text content of link as single string:
normalize-space(//a[#title="View company"])

What is regex expression for a string after I use nokogiri to scrape

I have this string and it is in an html document of 100 other names that are formatted the same:
<li>Physical education sed<span class="meta"><ul><li>15184745922</li></ul></span>
</li>
And I want to save 'Physical education sed under a name column and '15184745922' under a number column.
I was wondering how do you do this in Ruby.
In nokogiri I can get only the li's by doing this:
puts page.css("ul li").text
but then it comes out all in one word:"Physical education sed15184745922"
I was thinking regex is the way to go but I am stumped with that.
I did split it on the li
full_contact = page.css("ul li")[22]
split_contact_on_li = full_contact.to_s.split(/(\W|^)li(\W|$)/).map(&:to_sym)
puts split_contact_on_li
and I get this
<
>
Physical education sed<span class="meta"><ul>
<
>
15184745922<
/
>
</ul></span>
<
/
>
The same number of lines will be shown for each contact_info and the name is always the third line before the span class and the number is always the 6th line.
There is an instance where there might be an email address instead on the 6th line put not often.
So should I match the second and the third angular bracket and pull the information up to the third and fourth bracket then shove it into an array called name and number?

You shouldn't use a regex to parse xhtml since the regex engine might mess up things, you should use a html parser instead. However, if you want to use a regex, you can use a regex like this:
<li>(.*?)<.*?<li>(.*?)<
Working demo
The idea behind this regex is to use capturing groups (using paretheses) to capture the content you want. So, for you sample input the match information is:
MATCH 1
Group 1. [4-26] `Physical education sed`
Group 2. [53-64] `15184745922`
For example;
#!/usr/bin/env ruby
string = "<li>Physical education sed<span class=\"meta\"><ul><li>15184745922</li></ul></span></li>"
one, two = string.match(/<li>(.*?)<.*?<li>(.*?)</i).captures
p one #=> "Physical education sed"
p two #=> "15184745922"

Why don't you just do a regex on the string "physical education sed15184745922"? You can match on the first digit, and get back the number and the preceding text.

I don't know how to use Ruby, but if I understand your question correctly I would take advantage of the gsub function (or Ruby's equivalent). It might not be the prettiest approach, but since we just want the text in one variable and the numbers in another, we can just replace the characters we don't want with empty values.
v1 = page.css('ul li').text
v2 = gsub('\d*', '', v1)
v3 = gsub('(^\d)', '', v1)
v1 gets the full text value, v2 replaces all numeric characters with '', and v3 replaces all alpha characters with '', giving us two new variables to put wherever we please.
Again, I don't know how to use Ruby, but in R I know that I could get all of the values from the page using the xpath you provided ("ul li") into a vector and then loop across the vector performing the above steps on each element. I'm not sure if that adequately answers your question, but hopefully the gsub function gets you closer to what you want.

You need to use your HTML parser (Nokogiri) and regular expressions together. First, use Nokogiri to traverse down to the first parent node that contains all the text you need, then regex the text to get what you need.
Also, consider using .xpath instead of .css it provides much more functionality to search and scrape for just what you want. Given your example, you could do like so:
page.xpath("//span[#class='meta']/parent::li").map do |i|
i.text.scan(/^([a-z\s]+)(\d+)$/i).flatten
end
#=> [['Physical education sed', '15184745922'], ['the next string', '1234567890'], ...]
And now you have a two-dimensional array you can iterate over and save each pair.
This bit of xpath business: "//span[#class='meta']/parent::li" is doing what .css can't do, returning the parent node that has the text and specific children nodes you want to scrape against.

What is the Xpath query for my XML?

I have a XML document with this specitic structure :
<ul>
<li>
the
dog
is black
</li>
<li >
the
cat
is white
</li>
</ul>
But I have also this :
<ul>
<li>
the bird is blue
</li>
<li >
the
frog
</li>
</ul>
I don't know if there is a <a> in my <li> and where is it.
I would like the XPath query to get sentences like "the dog is black", "the cat is white", "the bird is blue" and "the frog"
Thanks !

If you're bound to XPath 1.0, you cannot get the sentences as separated tokens. You can get all text in all list elements using
//ul//text()
, but for the first HTML snippet this will return something like "the dog is black the cat is white".
If you need the sentences seperated, retrieve the list items and but the sentences together from outside XPath (eg. PHP, Java, ...; whatever you're using). How to do this differs from language to language, have a look at the reference or refine question / ask another question.
//ul/li
With XPath 2.0 you've got more luck and you can use one of these queries:
//ul/li/data(.)
//ul/li/string-join(.//text. ' ')
If the first one returns what you need use it, if there are problems with whitespace (whitespace handling is different for different implementations, but usually can be configured) go for the more flexible second query and adjust it as needed.

Thanks for your repply, I use Xpath for an iOS application with an HTML Parser : hpple (https://github.com/topfunky/hpple)
I think it use Xpath 1.0, because the log say me string-join function isn't recognized
//ul//text()
works but he return one word per word, and not one line per line

Find the character index of a node within its parent node with Hpricot

Suppose I have the following HTML:
html = Four score and seven <b>years ago</b>
I want to parse this with Hpricot:
doc = Hpricot(html)
Find the <b> node:
node = doc.at('b')
and then get the character index of the <b> node within its parent:
node.character_index
=> 22
How can I do this (i.e., what's the real version of the character_index() function I just made up)?

I don't think Hpricot works like that. Here is what I get doing a "node.inspect" based on your example
node.inspect
"{elem <b> \"years\" </b>}"
So, the position in the overall text that you are asking for just isn't there.
However, there are limited number of things you'd probably like to use the index for and you may be able to do these through the standard Hpricot methods

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extract specific nodes in HTML using Nokogiri - html

That will return both values you asked for //ul/li[contains(b, 'Identifier') or contains(b, 'User name')]/text() Of course you can modify xpath and get only 1 value at one time. //ul/li[contains(b, 'Identifier')]/text()

Related

Scraping for a specific title using Nokogiri in Ruby

How to get all text within node on same line using XPath 1.0

What is regex expression for a string after I use nokogiri to scrape

What is the Xpath query for my XML?

Find the character index of a node within its parent node with Hpricot

Categories

Resources