Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc.
Imagine everyone takes 3-5 classes. One of them is always Biology. Their report card is always alphabetized. I want everybody's Biology grade.
I've already scraped all this HTML into a text, now how to fish out the Biology grades?
<div class = "student">
<div class = "score">Algebra C-</div>
<div class = "score">Biology A+</div>
<div class = "score">Chemistry B</div>
</div>
<div class = "student">
<div class = "score">Biology B</div>
<div class = "score">Chemistry A</div>
</div>
<div class = "student">
<div class = "score">Alchemy D</div>
<div class = "score">Algebra A</div>
<div class = "score">Biology B</div>
</div>
<div class = "student">
<div class = "score">Algebra A</div>
<div class = "score">Biology B</div>
<div class = "score">Chemistry C+</div>
</div>
<div class = "student">
<div class = "score">Alchemy D</div>
<div class = "score">Algebra A</div>
<div class = "score">Bangladeshi History C</div>
<div class = "score">Biology B</div>
</div>
I'm using beautiful soup, and I think I'm going to have to find divs where Text includes "Biology"?
This is only for a quick scrape and I'm open to hard-coding and fiddling in Excel or whatnot. Yes, it's a shoddy website! Yes, they do have an API, and I don't know a thing about WDSL.
Short version: http://www.legis.ga.gov/Legislation/en-US/Search.aspx ,to find the date of last action on every bill, FWIW. It's troublesome because if a bill has no sponsors in the second chamber, instead of a div containing nothing, they just don't have a div there at all. So sometimes the timeline is in div 3, sometimes 2, etc.
(1) To just get the biology grade only, it is almost one liner.
import bs4, re
soup = bs4.BeautifulSoup(html)
scores_string = soup.find_all(text=re.compile('Biology'))
scores = [score_string.split()[-1] for score_string in scores_string]
print scores_string
print scores
The output looks like this:
[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
[u'A+', u'B', u'B', u'B', u'B']
(2) You locate the tags and maybe for further tasks, you need to find the parent:
import bs4, re
soup = bs4.BeautifulSoup(html)
scores = soup.find_all(text=re.compile('Biology'))
divs = [score.parent for score in scores]
print divs
Output looks like this:
[<div class="score">Biology A+</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>,
<div class="score">Biology B</div>]
*In conclusion, you can use find_siblings/parent/...etc to move around the HTML tree.*
More information about how to navigate the tree.
And Good luck with your work.
Another way (using css selector) is:
divs = soup.select('div:-soup-contains("Biology")')
EDIT:
BeautifulSoup4 4.7.0+ (SoupSieve) is required
You can extract them searching for any <div> element that has score as class attribute value, and use a regular expression to extract its biology score:
from bs4 import BeautifulSoup
import sys
import re
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for div in soup.find_all('div', attrs={'class': 'score'}):
t = re.search(r'Biology\s+(\S+)', div.string)
if t: print(t.group(1))
Run it like:
python3 script.py htmlfile
That yields:
A+
B
B
B
B
Related
Please help I don't know how to select specific div using BeautifulSoup when multiple divs have the same class name no id tag.
Web page that I am trying to scrape: https://www.helpmefind.com/rose/l.php?l=2.65689.
I want to select contents of specific divs independently and then pass to a csv file. Got stuck since find_all returns multiple divs and I don't know how to restrict further.
rose_div = rose.find_all("div", class_="hdg")
Returns:
[<div class="hdg">HMF Ratings:</div>, <div class="hdg">Origin:</div>, <div class="hdg">Class:</div>, <div class="hdg">Bloom:</div>, <div class="hdg">Parentage:</div>, <div class="hdg">Notes:</div>, <div class="hdg"> </div>]
I want to select individually below divs:
<div class="hdg">Origin:</div>
<div class="hdg">Class:</div>
<div class="hdg">Bloom:</div>
<div class="hdg">Parentage:</div>
You can use CSS selector div.hdg:contains("Origin:") to select <div> with class="hdg" that contains word "Origing:". To get next element with class grp, you can add + .grp.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.helpmefind.com/rose/l.php?l=2.65689'
soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
origin = soup.select_one('div.hdg:contains("Origin:") + .grp').text
class_ = soup.select_one('div.hdg:contains("Class:") + .grp').text
bloom = soup.select_one('div.hdg:contains("Bloom:") + .grp').text
parentage = soup.select_one('div.hdg:contains("Parentage:") + .grp').text
print(origin)
print(class_)
print(bloom)
print(parentage)
Prints:
Bred by Arai (Japan, before 2009).
Floribunda.
Light pink and white, yellow stamens. Single (4-8 petals), cluster-flowered bloom form. Blooms in flushes throughout the season.
If you know the parentage of this rose, or other details, please contact us.
Trying to extract Message text from:
<div class="Item ItemDiscussion Role_Member" id="Discussion_2318">
<div class="Discussion">
<div class="Item-BodyWrap">
<div class="Item-Body">
<div class="Message">
Hello<br/>I have a very interesting observation on nature of birds in Alaska ... <br/>
Was there 10/19/18 has anyone heard of this </div>
<div class="ReactionRecord"></div><div class="Reactions"></div> </div>
</div>
</div>
</div>
I have got this bit with:
tag = soup.find('div', {'class' : 'ItemDiscussion'})
Next I am trying to go down with:
s = str((tag.contents)[1])
sp = BeautifulSoup(s)
sp.contents
But this does not help much. How to get message text from <div class="Message"> ?
you can find the element from soup directly.
discussion_div = soup.find("div", {"class": "ItemDiscussion"})
message_text = discussion_div.find("div", {"class": "Message"}).text
You can select any element using select_one() function by entering the CSS Selector to the element. select_one() function will only return one element if you want more than one element then you can use select() which will return a list of found elements. here is the example for you,
soup = BeautifulSoup(html, "html.parser")
print soup.select_one("div.Item div.Discussion div.Item-BodyWrap div.Item-Body div.Message").text
You can also select your element using a single class if it is
unique.
print soup.select_one("div.Message").text
I'm trying to get the text inside the class="hardfact" but is also getting the text of the class="hardfactlabel color_f_03" because this class is inside hardfact.
.text.strip() get the text of both class because they are nested.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import requests
import lxml
my_url = 'https://www.immowelt.de/expose/2QC5D4A?npv=52'
page = requests.get(my_url)
ct = soup(page.text, 'lxml')
specs = ct.find('div', class_="hardfacts clear").findAll('div', class_="hardfact")
for items in specs:
e = items.text.strip()
print(e)
I'm getting this
82.500 €
Kaufpreis
47 m²
Wohnfläche (ca.)
1
Zimmer
and i want this
82.500 €
47 m²
1
Here is the html content you are trying to crawl:
<div class="hardfact ">
<strong>82.500 € </strong>
<div class="hardfactlabel color_f_03">
Kaufpreis
</div>
</div>
<div class="hardfact ">
47 m²
<div class="hardfactlabel color_f_03">
Wohnfläche (ca.)
</div>
</div>
<div class="hardfact rooms">
1
<div class="hardfactlabel color_f_03">
Zimmer
</div>
</div>
What you want to achieve is to remove the div tags within, so you can just decompose the div:
for items in specs:
items.div.decompose()
e = items.text.strip()
print(e)
If your first "hardfact" class doesn't contain the "strong" tag, you can just find the first element like so
e = items.find().text.strip()
but we can't do this so you have to decompose the div tag.
You can use stripped strings. You probably want to add a condition to ensure at least length of 3 before attempting to slice list.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.immowelt.de/expose/2QC5D4A?npv=52')
soup = bs(r.content, 'lxml')
items = soup.select('.hardfact')[:3]
for item in items:
strings = [string for string in item.stripped_strings]
print(strings[0])
I am new to html scraping and R, so this is a tricky problem for me. I have an html structure specified like below ( only body part). I have to separate sections, each with x number of paragraphs. What I want is to pick out all paragraphs in section1 to one object, and all paragraphs in section 2 in one object.
My current code looks like this:
docx <- read_html("Page.html")
sections = html_nodes(docx, xpath="//div [#class='sections']/*")
This gives me an xml_nodes object, List of 2, that has the paragraphs within. My problem then is that I cannot use xpathApply to a nodeset because it throws an error. But I want to pick out all the paragraphs like this:
subsparagraphs1 = html_nodes(sections[[1]], xpath="//p "),
but it then picks out all paragraphs from the WHOLE html page, not the first section.
I tried to be more specific:
subsections = html_nodes(sections[[1]], xpath="./div/div/p")
then it picks out nothing, or this:
subsections = html_nodes(sections[[1]], xpath="/p [#class = 'pwrapper']")
which also results in nothing. Can anyone help me get around this problem?
best, Mia
This is the html structure I have where I want Text1, text 2 and text 3 save in one object and 4,5 and 6 save in one object.
<div class = "content">
<div class = "title"> ... </div>
<div class = "sections">
<div> ... >/div
<div class = "sectionHeader">
<div>
<p class = "pwrapper"> Text 1 </p>
<p class = "pwrapper"> Text 2 </p>
<p class = "pwrapper"> Text 3 </p>
</div>
<div> ... </div>
<div> ... </div>
<div> ... >/div
<div class = "sectionHeader">
<div>
<p class = "pwrapper"> Text 4 </p>
<p class = "pwrapper"> Text 5 </p>
<p class = "pwrapper"> Text 6 </p>
</div>
<div> ... </div>
<div> ... </div>
</div>
</div>
Even if your input XML contains syntax errors, I will presume that the sectionHeader elements are siblings (they are on the same level under the same parent (sections).
In that case, your XPaths will be:
//div[#class = 'sections']//div[#class='sectionHeader'][1]//p[#class = 'pwrapper']/text()
//div[#class = 'sections']//div[#class='sectionHeader'][2]//p[#class = 'pwrapper']/text()
All that varies is the index into the //div[#class='sectionHeader'] sequence (1 and 2 – XPath starts with 1, not 0).
Please let me know if the structure of the input XML is different than what I observed/assumed.
P.S.: You may simplify the XPaths by removing the first path portion: //div[#class = 'sections'].
Consider the following HTML:
<div class='data'>
<div class='user_name'>Lankesh</div>
<div class='user_details'>
<div class='country'>Srilanka</div>
<div class='age'>9</div>
</div>
<div class='user_name'>Bob</div>
<div class='user_details'>
<div class='country'>US</div>
<div class='age'>54</div>
</div>
<div class='user_name'>Deiter</div>
<div class='user_details'>
<div class='country'>Germany</div>
<div class='age'>34</div>
</div>
<div class='user_name'>Yakob</div>
<div class='user_details'>
<div class='country'>Syria</div>
<div class='age'>90</div>
</div>
<div class='user_name'>Qureshi</div>
<div class='user_details'>
<div class='country'>Afgan</div>
<div class='age'>56</div>
</div>
<div class='user_name'>Smith George</div>
<div class='user_details'>
<div class='country'>India</div>
<div class='age'>23</div>
</div>
</div>
And the following Ruby code:
require 'nokogiri'
sample_html = File.open("r.htm", "r").read
n = Nokogiri::HTML::parse sample_html
xpaths = {}
xpaths[:name] = "//div[#class = 'user_name']/text()"
xpaths[:country] = "//div[#class = 'country']/text()"
xpaths[:age] = "//div[#class = 'age']/text()"
full_path = xpaths.values.join(" | ")
n.xpath(full_path).each do |i|
puts i
end
This works to extract data, but how can I chunk (name, age and country) so that I can extract the parsed data into a structure more easily.
Since name is outside the user_details block, I am unable to write a query like: //div[#class = 'user_details'] and extract each attribute.
I know I can chunk the array into groups of 3; but I am looking for xpath based solution, because my actual need has varying number of child properties.
Silly, but: anyway to somehow inject characters to the extracted text, during parsing?
Any ideas?
Let me start out by saying it would be better to adjust the HTML to wrap each user block in its own containing div:
<div class='user'>
<div class='name'>John</div>
<div class='details'>
<div class='country'>US</div>
...
</div>
</div>
Then you could simply query each user block separately using "//div[#class = 'user']". You are probably not in control of the HTML, though.
Given the current situation I would propose to simply obtain the user_name divs, as well as the user_details divs and zip them together. Then, you can create a Hash from the user details based on the child divs (.xpath("div")) which will work for any amount of user_details and uses their class attribute as a Hash key and their text as a value. Note this implementation only works on single-level user_details. Of course this will have to be adjusted if not all user_details child divs will have a class attribute. But judging from your example input they do.
require 'pp'
require 'nokogiri'
sample_html = File.open("r.htm", "r").read
n = Nokogiri::HTML::parse sample_html
user_names = n.xpath("//div[#class = 'user_name']")
user_details = n.xpath("//div[#class = 'user_details']")
users = user_names.zip(user_details).map do |name, details|
{
name: name.text,
details: Hash[details.xpath("div").map { |d| [d['class'].to_sym, d.text] }]
}
end
pp users
# [{:name=>"Lankesh", :details=>{:country=>"Srilanka", :age=>"9"}},
# {:name=>"Bob", :details=>{:country=>"US", :age=>"54"}},
# {:name=>"Deiter", :details=>{:country=>"Germany", :age=>"34"}},
# {:name=>"Yakob", :details=>{:country=>"Syria", :age=>"90"}},
# {:name=>"Qureshi", :details=>{:country=>"Afgan", :age=>"56"}},
# {:name=>"Smith George", :details=>{:country=>"India", :age=>"23"}}]