Parsing with Xpath - html

Consider the following HTML:
<div class='data'>
<div class='user_name'>Lankesh</div>
<div class='user_details'>
<div class='country'>Srilanka</div>
<div class='age'>9</div>
</div>
<div class='user_name'>Bob</div>
<div class='user_details'>
<div class='country'>US</div>
<div class='age'>54</div>
</div>
<div class='user_name'>Deiter</div>
<div class='user_details'>
<div class='country'>Germany</div>
<div class='age'>34</div>
</div>
<div class='user_name'>Yakob</div>
<div class='user_details'>
<div class='country'>Syria</div>
<div class='age'>90</div>
</div>
<div class='user_name'>Qureshi</div>
<div class='user_details'>
<div class='country'>Afgan</div>
<div class='age'>56</div>
</div>
<div class='user_name'>Smith George</div>
<div class='user_details'>
<div class='country'>India</div>
<div class='age'>23</div>
</div>
</div>
And the following Ruby code:
require 'nokogiri'
sample_html = File.open("r.htm", "r").read
n = Nokogiri::HTML::parse sample_html
xpaths = {}
xpaths[:name] = "//div[#class = 'user_name']/text()"
xpaths[:country] = "//div[#class = 'country']/text()"
xpaths[:age] = "//div[#class = 'age']/text()"
full_path = xpaths.values.join(" | ")
n.xpath(full_path).each do |i|
puts i
end
This works to extract data, but how can I chunk (name, age and country) so that I can extract the parsed data into a structure more easily.
Since name is outside the user_details block, I am unable to write a query like: //div[#class = 'user_details'] and extract each attribute.
I know I can chunk the array into groups of 3; but I am looking for xpath based solution, because my actual need has varying number of child properties.
Silly, but: anyway to somehow inject characters to the extracted text, during parsing?
Any ideas?

Let me start out by saying it would be better to adjust the HTML to wrap each user block in its own containing div:
<div class='user'>
<div class='name'>John</div>
<div class='details'>
<div class='country'>US</div>
...
</div>
</div>
Then you could simply query each user block separately using "//div[#class = 'user']". You are probably not in control of the HTML, though.
Given the current situation I would propose to simply obtain the user_name divs, as well as the user_details divs and zip them together. Then, you can create a Hash from the user details based on the child divs (.xpath("div")) which will work for any amount of user_details and uses their class attribute as a Hash key and their text as a value. Note this implementation only works on single-level user_details. Of course this will have to be adjusted if not all user_details child divs will have a class attribute. But judging from your example input they do.
require 'pp'
require 'nokogiri'
sample_html = File.open("r.htm", "r").read
n = Nokogiri::HTML::parse sample_html
user_names = n.xpath("//div[#class = 'user_name']")
user_details = n.xpath("//div[#class = 'user_details']")
users = user_names.zip(user_details).map do |name, details|
{
name: name.text,
details: Hash[details.xpath("div").map { |d| [d['class'].to_sym, d.text] }]
}
end
pp users
# [{:name=>"Lankesh", :details=>{:country=>"Srilanka", :age=>"9"}},
# {:name=>"Bob", :details=>{:country=>"US", :age=>"54"}},
# {:name=>"Deiter", :details=>{:country=>"Germany", :age=>"34"}},
# {:name=>"Yakob", :details=>{:country=>"Syria", :age=>"90"}},
# {:name=>"Qureshi", :details=>{:country=>"Afgan", :age=>"56"}},
# {:name=>"Smith George", :details=>{:country=>"India", :age=>"23"}}]

Related

Beautiful Soup: extracting from deeply nested <div>'s

Trying to extract Message text from:
<div class="Item ItemDiscussion Role_Member" id="Discussion_2318">
<div class="Discussion">
<div class="Item-BodyWrap">
<div class="Item-Body">
<div class="Message">
Hello<br/>I have a very interesting observation on nature of birds in Alaska ... <br/>
Was there 10/19/18 has anyone heard of this </div>
<div class="ReactionRecord"></div><div class="Reactions"></div> </div>
</div>
</div>
</div>
I have got this bit with:
tag = soup.find('div', {'class' : 'ItemDiscussion'})
Next I am trying to go down with:
s = str((tag.contents)[1])
sp = BeautifulSoup(s)
sp.contents
But this does not help much. How to get message text from <div class="Message"> ?
you can find the element from soup directly.
discussion_div = soup.find("div", {"class": "ItemDiscussion"})
message_text = discussion_div.find("div", {"class": "Message"}).text
You can select any element using select_one() function by entering the CSS Selector to the element. select_one() function will only return one element if you want more than one element then you can use select() which will return a list of found elements. here is the example for you,
soup = BeautifulSoup(html, "html.parser")
print soup.select_one("div.Item div.Discussion div.Item-BodyWrap div.Item-Body div.Message").text
You can also select your element using a single class if it is
unique.
print soup.select_one("div.Message").text

html merge 2 columns in 1 with 100% width bootstrap 2

I have the following HTML markup in my .NET MVC project:
<div class="row">
<div class="span6">#Model.Data</div>
<div class="span6">#Model.OtherData</div>
</div>
I'm getting data from server. So I want to do the following:
If data is empty then show other data with width = 100%.
Just to clarify I want to do something like that:
<div class="row">
<div class="span12">#Model.OtherData</div>
</div>
Or vice versa.
Is there a way to do that? Maybe with using different HTML tags / CSS classes.
Essentially you just want conditionally display the #Model.Data only if it isn't null. You can also set the col class with a variable, and conditionally change that variable depending if #Model.Data exists or not. Try something like this:
# {
var colClass = 'span6';
if (#Model.Data == null) {
colClass = 'span12';
}
}
<div class="row">
#if (#Model.Data != null) {
<div class="#colClass">#Model.Data</div>
}
<div class="#colClass">#Model.OtherData</div>
</div>

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children

How can I get list of elements or data which are on same level with same attributes?

I have one web application which have one HTML page.
In this page structure is like this:
<div class = 'abc'>
<div class = 'pqr'>test1</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>test2</div>
</div>
<div class = 'abc'>
<div class = 'pqr'>-</div>
</div>
Here I want to take data from test1 to test2.
I have tried xpath with [Node Number] But I have found all nodes at [1] level.
Is there any way to get all data or List of elements test1 to test2 with "-" ?
I have seen this kind of issue before.
You have to use following-sibling here.
First I use this type of xpath :
//div[text()='test1']/..//following-sibling::div[#class='pqr' and not(contains(text(),'test'))]
Then you need to change script. "Note : I have written code in JAVA"
Logic :
while(element found text = '-')
{
//get data here
}
Please try this approach.
I guess you want the following xpath :
(//div[#class='pqr'])[position()<=4]
Notice the brackets () before position() predicate.
output in xpath tester :
Element='<div class="pqr">test1</div>'
Element='<div class="pqr">-</div>'
Element='<div class="pqr">-</div>'
Element='<div class="pqr">test2</div>'
I think you can't use the Test1 and Test2 elements as identifiers because they are on the same line as the nodes you want to collect. Otherwise, I think you can use findElements(by.Xpath("patern_to_search")). that will return you a collection of elements that are matching your pattern.
one more way without using xpath:
List<WebElement> element = driver.findElements(By.className("pqr"));
for(int i=0;i<element.size()-1;i++){
System.out.println(element.get(i).getText());
}

How to select div by text content using Beautiful Soup?

Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc.
Imagine everyone takes 3-5 classes. One of them is always Biology. Their report card is always alphabetized. I want everybody's Biology grade.
I've already scraped all this HTML into a text, now how to fish out the Biology grades?
<div class = "student">
<div class = "score">Algebra C-</div>
<div class = "score">Biology A+</div>
<div class = "score">Chemistry B</div>
</div>
<div class = "student">
<div class = "score">Biology B</div>
<div class = "score">Chemistry A</div>
</div>
<div class = "student">
<div class = "score">Alchemy D</div>
<div class = "score">Algebra A</div>
<div class = "score">Biology B</div>
</div>
<div class = "student">
<div class = "score">Algebra A</div>
<div class = "score">Biology B</div>
<div class = "score">Chemistry C+</div>
</div>
<div class = "student">
<div class = "score">Alchemy D</div>
<div class = "score">Algebra A</div>
<div class = "score">Bangladeshi History C</div>
<div class = "score">Biology B</div>
</div>
I'm using beautiful soup, and I think I'm going to have to find divs where Text includes "Biology"?
This is only for a quick scrape and I'm open to hard-coding and fiddling in Excel or whatnot. Yes, it's a shoddy website! Yes, they do have an API, and I don't know a thing about WDSL.
Short version: http://www.legis.ga.gov/Legislation/en-US/Search.aspx ,to find the date of last action on every bill, FWIW. It's troublesome because if a bill has no sponsors in the second chamber, instead of a div containing nothing, they just don't have a div there at all. So sometimes the timeline is in div 3, sometimes 2, etc.
(1) To just get the biology grade only, it is almost one liner.
import bs4, re
soup = bs4.BeautifulSoup(html)
scores_string = soup.find_all(text=re.compile('Biology'))
scores = [score_string.split()[-1] for score_string in scores_string]
print scores_string
print scores
The output looks like this:
[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
[u'A+', u'B', u'B', u'B', u'B']
(2) You locate the tags and maybe for further tasks, you need to find the parent:
import bs4, re
soup = bs4.BeautifulSoup(html)
scores = soup.find_all(text=re.compile('Biology'))
divs = [score.parent for score in scores]
print divs
Output looks like this:
[<div class="score">Biology A+</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>]
*In conclusion, you can use find_siblings/parent/...etc to move around the HTML tree.*
More information about how to navigate the tree.
And Good luck with your work.
Another way (using css selector) is:
divs = soup.select('div:-soup-contains("Biology")')
EDIT:
BeautifulSoup4 4.7.0+ (SoupSieve) is required
You can extract them searching for any <div> element that has score as class attribute value, and use a regular expression to extract its biology score:
from bs4 import BeautifulSoup
import sys
import re
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for div in soup.find_all('div', attrs={'class': 'score'}):
t = re.search(r'Biology\s+(\S+)', div.string)
if t: print(t.group(1))
Run it like:
python3 script.py htmlfile
That yields:
A+
B
B
B
B