selenium, xpath: How to select a node within node? - html

I have a webpage that have a structure like this:
<div class="l_post j_l_post l_post_bright "...>
...
<div class="j_lzl_c_b_a core_reply_content">
<li class="lzl_single_post j_lzl_s_p first_no_border" ...>
<div class="lzl_cnt">
content
</div>
</li>
<li class="lzl_single_post j_lzl_s_p first_no_border" ...>
...
</li>
</div>
</div>
<div class="l_post j_l_post l_post_bright "...>
...(contain content, same as above)
</div>
...
Currently I could select all the content in one step like this:
for i in driver.find_elements_by_xpath('//*[#class="lzl_cnt"]'):
print(i.text)
But as you could see, the webpage consist of repetitive blocks that contain the contents that I need, therefore I want to get those contents separately along with other information that differs between those repetitive blocks(<div class="l_post j_l_post l_post_bright "...>...</div>), moreover, I want those contents within <li class ="lzl_single_post"...>to be separated so as to be easier for me to process the contents later . I tried this:
items = []
# get each blocks
for sel in driver.find_elements_by_xpath('//div[#class="l_post j_l_post l_post_bright "]'):
name = sel.find_element_by_css_selector('.d_name').text
try: content = sel.find_element_by_css_selector('.j_d_post_content').text
except: content = '',
try:
reply = []
# get each post within specific block
for i in sel.find_elements_by_xpath('//*[#class="lzl_cnt"]'):
reply.append(i.text)
except: reply = []
items.append({'name': name, 'content': content, 'reply': reply})
But the result shows that I am getting all the replies on the webpage every time the outer for-loop runs instead of a set of replies for each individual block that I wanted
Any suggestions?

Just add . (context pointer) to XPath as
sel.find_elements_by_xpath('.//*[#class="lzl_cnt"]')
Note that //*[#class="lzl_cnt"] means all nodes in DOM with "lzl_cnt" class name while .//*[#class="lzl_cnt"] means all nodes that are descendant of sel with "lzl_cnt" class name

Related

How to get a div or span class from a related span class?

I've found the lowest class: <span class="pill css-1a10nyx e1pqc3131"> of multiple elements of a website but now I want to find the related/linked upper-class so for example the highest <div class="css-1v73czv eh8fd9011" xpath="1">. I've got the soup but can't figure out a way to get from the 'lowest' class to the 'highest' class, any idea?
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">
<div class="css-1rkuvma eh8fd908">
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">
End result would be:
INPUT- Search on on all elements of a page with class <span class="pill css-1a10nyx e1pqc3131">(lowest)
OUTPUT - Get all related titles or headers of said class.
I've tried it with if-statements but that doesn't work consistently. Something with an if class = (searchable class) then get (desired higher class) should work.
I can add any more details if needed please let me know, thanks in advance!
EDIT: Picture per clarification where the title(highest class) = "Wooferland Festival 2022" and the number(lowest class) = 253
As mentioned, question needs some more information, to give a concret answer.
Assuming you like to scrape the information in the picture based on your example HTML you select your pill and use .find_previous() to locate your elements:
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
Assuming there is a cotainer tag in HTML structure like <a> or other you would select this based on the condition, that it contains a <span> wit class pill:
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Note: Instead of using css classes, that can be highly dynamic, try use more static attributes or the HTML structure.
Example
See both options, for first one the <a> do not matter.
from bs4 import BeautifulSoup
html='''
<a>
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">some date information</header>
<div class="css-1rkuvma eh8fd908">some title</div>
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">some number</span>
<footer>
</div>
</div>
</a>
'''
soup = BeautifulSoup(html)
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
print('---------')
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Output
some date information
some title
some number
---------
some date information
some date information
some number

cannot get tag however it is appear on html

I am trying a scraping job using BeatifulSoup and find methods, I get the HTML with lxml parser as following :
result = requests.get('https://wuzzuf.net/jobs/p/xgUqkfYngXZL-Senior-Python-Developer-Remote---Part-Time-Cairo-Egypt?o=2&l=sp&t=sj&a=python|search-v3|hpb')
#print(result.status_code)
soup1 =BeautifulSoup(result.content , "html5lib")
sections = soup1.find( 'section' ,class_="css-3kx5e2")
divs = sections.find_all('div')
spans = sections.find_all('span')
span = divs[3].find('span' , class_ ='css-47jx3m')
divs[3]
I get the following
<div class="css-rcl8e5"><span class="css-wn0avc">Salary<!-- -->:</span></div>
however, the original HTML is
<div class="css-rcl8e5"><span class="css-wn0avc">Salary<!-- -->:</span>
<span class="css-47jx3m"><span class="css-8il94u">Confidential, Hourly Based</span>
</span>
</div>
I need to get the ('span class="css-8il94u"') which have the text ('Confidential, Hourly Based') but it does not appear
thanks

How to select specific div using BeautifulSoup when multiple divs have the same class name no id tag?

Please help I don't know how to select specific div using BeautifulSoup when multiple divs have the same class name no id tag.
Web page that I am trying to scrape: https://www.helpmefind.com/rose/l.php?l=2.65689.
I want to select contents of specific divs independently and then pass to a csv file. Got stuck since find_all returns multiple divs and I don't know how to restrict further.
rose_div = rose.find_all("div", class_="hdg")
Returns:
[<div class="hdg">HMF Ratings:</div>, <div class="hdg">Origin:</div>, <div class="hdg">Class:</div>, <div class="hdg">Bloom:</div>, <div class="hdg">Parentage:</div>, <div class="hdg">Notes:</div>, <div class="hdg"> </div>]
I want to select individually below divs:
<div class="hdg">Origin:</div>
<div class="hdg">Class:</div>
<div class="hdg">Bloom:</div>
<div class="hdg">Parentage:</div>
You can use CSS selector div.hdg:contains("Origin:") to select <div> with class="hdg" that contains word "Origing:". To get next element with class grp, you can add + .grp.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.helpmefind.com/rose/l.php?l=2.65689'
soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
origin = soup.select_one('div.hdg:contains("Origin:") + .grp').text
class_ = soup.select_one('div.hdg:contains("Class:") + .grp').text
bloom = soup.select_one('div.hdg:contains("Bloom:") + .grp').text
parentage = soup.select_one('div.hdg:contains("Parentage:") + .grp').text
print(origin)
print(class_)
print(bloom)
print(parentage)
Prints:
Bred by Arai (Japan, before 2009).
Floribunda.  
Light pink and white, yellow stamens.  Single (4-8 petals), cluster-flowered bloom form.  Blooms in flushes throughout the season.  
If you know the parentage of this rose, or other details, please contact us.

BeatifulSoup - Trying to get text inside span tags

I want to pull the text inside the span tags but when I try and use .text or get_text() I get errors (either after print spans or in the for loop). What am I missing? I have it set just now to just do this for the first div of class col, just to test if it is working, but I will want it to work for the 2nd aswell.
Thanks
My Code -
premier_soup1 = player_soup.find('div', {'class': 'row-table details -bp30'})
premier_soup_tr = premier_soup1.find_all('div', {'class': 'col'})
for x in premier_soup_tr[0]:
spans = x.find('span')
print (spans)
Output
-1
<span itemprop="name">Alisson Ramses Becker</span>
-1
<span itemprop="birthDate">02/10/1992</span>
-1
<span itemprop="nationality"> Brazil</span>
-1
>>>
The HTML
<div class="col">
<p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span> </strong></p>
<p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>
<p>Place of birth:<span itemprop="nationality"> Brazil</span></p>
</div>
<div class="col">
<p>Club: <span itemprop="affiliation">Liverpool</span></p>
<p>Squad: 13</p><p>Position: Goal Keeper</p>
</div>
If you just want the text in the spans you can search specifically for the spans:
soup = BeautifulSoup(html, 'html.parser')
spans = soup.find_all('span')
for span in spans:
print(span.text)
If you want to find the spans with the specific divs, then you can do:
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
spans = div.find_all('span')
for span in spans:
print(span.text)
If you just want all of the values after the colons, you can search for the paragraph tags:
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
ps = div.find_all('p')
for p in ps:
print(p.text.split(":")[1].strip())
Kyle's answer is good, but to avoid printing the same value multiple times like you said happened, you need to change up the logic a little bit. First you parse and add all matches you find to a list and THEN you loop through the list with all the matches and print them.
Another thing that you may have to consider is this problem:
<div class=col>
<div class=col>
<span/>
</div>
</div>
By using a list instead of printing right away, you can handle any matches that are identical to any existing records
in the above html example you can see how the span could be added twice with how you find matches in the answer suggested by Kyle. It's all about making sure you create a logic that will only find the matches you need. How you do it is often/always dependant on how the html is formatted, but its also important to be creative!
Good luck.

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children