I am new to html scraping and R, so this is a tricky problem for me. I have an html structure specified like below ( only body part). I have to separate sections, each with x number of paragraphs. What I want is to pick out all paragraphs in section1 to one object, and all paragraphs in section 2 in one object.
My current code looks like this:
docx <- read_html("Page.html")
sections = html_nodes(docx, xpath="//div [#class='sections']/*")
This gives me an xml_nodes object, List of 2, that has the paragraphs within. My problem then is that I cannot use xpathApply to a nodeset because it throws an error. But I want to pick out all the paragraphs like this:
subsparagraphs1 = html_nodes(sections[[1]], xpath="//p "),
but it then picks out all paragraphs from the WHOLE html page, not the first section.
I tried to be more specific:
subsections = html_nodes(sections[[1]], xpath="./div/div/p")
then it picks out nothing, or this:
subsections = html_nodes(sections[[1]], xpath="/p [#class = 'pwrapper']")
which also results in nothing. Can anyone help me get around this problem?
best, Mia
This is the html structure I have where I want Text1, text 2 and text 3 save in one object and 4,5 and 6 save in one object.
<div class = "content">
<div class = "title"> ... </div>
<div class = "sections">
<div> ... >/div
<div class = "sectionHeader">
<div>
<p class = "pwrapper"> Text 1 </p>
<p class = "pwrapper"> Text 2 </p>
<p class = "pwrapper"> Text 3 </p>
</div>
<div> ... </div>
<div> ... </div>
<div> ... >/div
<div class = "sectionHeader">
<div>
<p class = "pwrapper"> Text 4 </p>
<p class = "pwrapper"> Text 5 </p>
<p class = "pwrapper"> Text 6 </p>
</div>
<div> ... </div>
<div> ... </div>
</div>
</div>
Even if your input XML contains syntax errors, I will presume that the sectionHeader elements are siblings (they are on the same level under the same parent (sections).
In that case, your XPaths will be:
//div[#class = 'sections']//div[#class='sectionHeader'][1]//p[#class = 'pwrapper']/text()
//div[#class = 'sections']//div[#class='sectionHeader'][2]//p[#class = 'pwrapper']/text()
All that varies is the index into the //div[#class='sectionHeader'] sequence (1 and 2 – XPath starts with 1, not 0).
Please let me know if the structure of the input XML is different than what I observed/assumed.
P.S.: You may simplify the XPaths by removing the first path portion: //div[#class = 'sections'].
Related
I've found the lowest class: <span class="pill css-1a10nyx e1pqc3131"> of multiple elements of a website but now I want to find the related/linked upper-class so for example the highest <div class="css-1v73czv eh8fd9011" xpath="1">. I've got the soup but can't figure out a way to get from the 'lowest' class to the 'highest' class, any idea?
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">
<div class="css-1rkuvma eh8fd908">
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">
End result would be:
INPUT- Search on on all elements of a page with class <span class="pill css-1a10nyx e1pqc3131">(lowest)
OUTPUT - Get all related titles or headers of said class.
I've tried it with if-statements but that doesn't work consistently. Something with an if class = (searchable class) then get (desired higher class) should work.
I can add any more details if needed please let me know, thanks in advance!
EDIT: Picture per clarification where the title(highest class) = "Wooferland Festival 2022" and the number(lowest class) = 253
As mentioned, question needs some more information, to give a concret answer.
Assuming you like to scrape the information in the picture based on your example HTML you select your pill and use .find_previous() to locate your elements:
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
Assuming there is a cotainer tag in HTML structure like <a> or other you would select this based on the condition, that it contains a <span> wit class pill:
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Note: Instead of using css classes, that can be highly dynamic, try use more static attributes or the HTML structure.
Example
See both options, for first one the <a> do not matter.
from bs4 import BeautifulSoup
html='''
<a>
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">some date information</header>
<div class="css-1rkuvma eh8fd908">some title</div>
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">some number</span>
<footer>
</div>
</div>
</a>
'''
soup = BeautifulSoup(html)
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
print('---------')
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Output
some date information
some title
some number
---------
some date information
some date information
some number
#Example 1
<span class="levelone">
<span class="leveltwo" dir="auto">
::before
"Blue"
::after
</span>
</span>
#Example 2
<div class="itemlist">
<div dir="auto" style="text-align: start;">
"mobile"
</div>
</div>
#Example 3
<div class="quantity">
<div class="color">...</div>
<span class="num">10</span>
</div>
Hi, I am trying to use selenium to extract content from html. I managed to extract the content for example 1 & 2, the code that I have used is
example1 = driver.find_elements_by_css_selector("span[class='leveltwo']")
example2 = driver.find_elements_by_css_selector("div[class='itemlist']")
and printed out as text with
data = [dt.text for dt in example1]
print(data)
I got "Blue" for example 1 & "mobile" for example 2. For simplicity purposes, the html given above is for one iteration, I have scraped all elements with the class mentioned above
However, for the 3rd example, I tried to use
example3a = driver.find_elements_by_css_selector("div[class='quantity']")
and
example3b = driver.find_elements_by_css_selector("div[class='num']")
and
example3c = driver. find_element_by_class_name("num")
but all of it returned an empty list. I'm not sure is it because there is no dir in example 3? What method should I use to extract the "10"?
for 3rd example, you can try the below css :
div.quantity span.num
in code you can write like this :
example3a = driver.find_elements_by_css_selector("div.quantity span.num")
print(example3a.text)
or
print(example3a.get_attribute('innerHTML'))
To extract specifically the 10 you can use
example3a = driver.find_elements_by_css_selector("div.quantity span.num")
To extract both elements inside <div class="quantity"> you can use
example3 = driver.find_elements_by_xpath("//div[#class='quantity']//*")
for el in example3:
print(el.text)
I'm trying to obtain the nested unnamed div inside:
div class="pod three columns closable content-sizes clearfix">
The nested unnamed div is also the first div inside the div above (see image)
I have tried the following:
for div in soup.findAll('div', attrs={'class':'pod three columns closable content-sizes clearfix'}):
print(div.text)
The length of
soup.findAll('div',attrs={'class':'pod three columns closable
content-sizes clearfix'})
is just one despite this div having many nested divs. So, the for-loop runs only once and prints everything.
I need all the text inside only the first nested div div (see image):
Project...
Reference Number...
Other text
Try:
from bs4 import BeautifulSoup
html_doc = """
<div class="pod three columns closable content-sizes clearfix">
<div>
<b>Project: ...</b>
<br>
<br>
<b>Reference Number: ...</b>
<br>
<br>
<b>Other text ...</b>
<br>
<br>
</div>
<div>
Other text I don't want
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(
soup.select_one("div.pod.three.columns > div").get_text(
strip=True, separator="\n"
)
)
Prints:
Project: ...
Reference Number: ...
Other text ...
Or without CSS selector:
print(
soup.find(
"div",
attrs={"class": "pod three columns closable content-sizes clearfix"},
)
.find("div")
.get_text(strip=True, separator="\n")
)
try this:-
result = soup.find('div', class_ = "pod three columns closable content-sizes clearfix").find("div")
print(result.text)
output:-
Project: .............
Reference Number: ....
Other text ...........
Trying to extract Message text from:
<div class="Item ItemDiscussion Role_Member" id="Discussion_2318">
<div class="Discussion">
<div class="Item-BodyWrap">
<div class="Item-Body">
<div class="Message">
Hello<br/>I have a very interesting observation on nature of birds in Alaska ... <br/>
Was there 10/19/18 has anyone heard of this </div>
<div class="ReactionRecord"></div><div class="Reactions"></div> </div>
</div>
</div>
</div>
I have got this bit with:
tag = soup.find('div', {'class' : 'ItemDiscussion'})
Next I am trying to go down with:
s = str((tag.contents)[1])
sp = BeautifulSoup(s)
sp.contents
But this does not help much. How to get message text from <div class="Message"> ?
you can find the element from soup directly.
discussion_div = soup.find("div", {"class": "ItemDiscussion"})
message_text = discussion_div.find("div", {"class": "Message"}).text
You can select any element using select_one() function by entering the CSS Selector to the element. select_one() function will only return one element if you want more than one element then you can use select() which will return a list of found elements. here is the example for you,
soup = BeautifulSoup(html, "html.parser")
print soup.select_one("div.Item div.Discussion div.Item-BodyWrap div.Item-Body div.Message").text
You can also select your element using a single class if it is
unique.
print soup.select_one("div.Message").text
I want to pull the text inside the span tags but when I try and use .text or get_text() I get errors (either after print spans or in the for loop). What am I missing? I have it set just now to just do this for the first div of class col, just to test if it is working, but I will want it to work for the 2nd aswell.
Thanks
My Code -
premier_soup1 = player_soup.find('div', {'class': 'row-table details -bp30'})
premier_soup_tr = premier_soup1.find_all('div', {'class': 'col'})
for x in premier_soup_tr[0]:
spans = x.find('span')
print (spans)
Output
-1
<span itemprop="name">Alisson Ramses Becker</span>
-1
<span itemprop="birthDate">02/10/1992</span>
-1
<span itemprop="nationality"> Brazil</span>
-1
>>>
The HTML
<div class="col">
<p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span> </strong></p>
<p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>
<p>Place of birth:<span itemprop="nationality"> Brazil</span></p>
</div>
<div class="col">
<p>Club: <span itemprop="affiliation">Liverpool</span></p>
<p>Squad: 13</p><p>Position: Goal Keeper</p>
</div>
If you just want the text in the spans you can search specifically for the spans:
soup = BeautifulSoup(html, 'html.parser')
spans = soup.find_all('span')
for span in spans:
print(span.text)
If you want to find the spans with the specific divs, then you can do:
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
spans = div.find_all('span')
for span in spans:
print(span.text)
If you just want all of the values after the colons, you can search for the paragraph tags:
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
ps = div.find_all('p')
for p in ps:
print(p.text.split(":")[1].strip())
Kyle's answer is good, but to avoid printing the same value multiple times like you said happened, you need to change up the logic a little bit. First you parse and add all matches you find to a list and THEN you loop through the list with all the matches and print them.
Another thing that you may have to consider is this problem:
<div class=col>
<div class=col>
<span/>
</div>
</div>
By using a list instead of printing right away, you can handle any matches that are identical to any existing records
in the above html example you can see how the span could be added twice with how you find matches in the answer suggested by Kyle. It's all about making sure you create a logic that will only find the matches you need. How you do it is often/always dependant on how the html is formatted, but its also important to be creative!
Good luck.