web data scraping : split html content - html

I'm scraping a website and I was able to reduce a variable called "gender" to this :
[<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>, <span style="text-decoration: none;">associé gérant </span>]
And now I'd like to have only "associé" in the variable but I can't find a way to split this html code.
The reason is that I want to know if it's "associé" (male) or "associée" (female).
does anyone have any ideas ?
Cheers
----- edit ----
here my code which gets me the html output
url = "http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false"
r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
table = soup.select_one("#adm").find_next("table") #select_one finds only the first tag that matches a selector:
table2 = soup.select_one("#adm").find_all_next("table")
output = table.select("td span[style^=text-decoration:]", limit=2) #.text.split(",", 1)[0].strip()
print(output)

Whatever the parent of the two elements is you can call span:nth-of-type(2) to get the second span, then just check the text:
html = """<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>
<span style="text-decoration: none;">associé gérant </span>"""
soup = BeautifulSoup(html)
text = soup.select_one("span:nth-of-type(2)").text
Or if it not always the second span you can search for the span by the partial text associé:
import re
text = soup.find("span", text=re.compile(ur"associé")).text
For your edit, all you need is to extract the text last element and use .split(None, 1)[1] to get the gender:
text = table.select("td span[style^=text-decoration:]", limit=2)[-1].text
gender = text.split(None, 1)[1] # > gérant

Related

Getting specific span tag text in python (BeautifulSoup)

Im scraping some information off MyAnimeList using BeautifulSoup on python3 and am trying to get information about a show's 'Status', but am having trouble accessing it.
Here is the html:
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
All of this is also contained within another div tag but I only included the portion of the html that I want to scrape. To clarify, I want to obtain the text 'Finished Airing' contained within 'Status'.
Here's the code I have so far but I'm not really sure if this is the best approach or where to go from here:
Page_soup = soup(Page_html, "html.parser")
extra_info = Page_soup.find('td', attrs={'class': 'borderClass'})
span_html = extra_info.select('span')
for i in range(len(span_html)):
if 'Status:' in span_html[i].getText():
Any help would be appreciated, thanks!
To get the text next to the <span> with "Status:", you can use:
from bs4 import BeautifulSoup
html_doc = """
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('span:-soup-contains("Status:")').find_next_sibling(text=True)
print(txt.strip())
Prints:
Finished Airing
Or:
txt = soup.find("span", text="Status:").find_next_sibling(text=True)
print(txt.strip())
Another solution (maybe):
f = soup.find_all('span',attrs={'class':'dark_text'})
for i in f:
if i.text == 'Status:':
print(i.parent.text)
And change 'Status:' to whatever other thing you want to find.
Hope I helped!

cannot get tag however it is appear on html

I am trying a scraping job using BeatifulSoup and find methods, I get the HTML with lxml parser as following :
result = requests.get('https://wuzzuf.net/jobs/p/xgUqkfYngXZL-Senior-Python-Developer-Remote---Part-Time-Cairo-Egypt?o=2&l=sp&t=sj&a=python|search-v3|hpb')
#print(result.status_code)
soup1 =BeautifulSoup(result.content , "html5lib")
sections = soup1.find( 'section' ,class_="css-3kx5e2")
divs = sections.find_all('div')
spans = sections.find_all('span')
span = divs[3].find('span' , class_ ='css-47jx3m')
divs[3]
I get the following
<div class="css-rcl8e5"><span class="css-wn0avc">Salary<!-- -->:</span></div>
however, the original HTML is
<div class="css-rcl8e5"><span class="css-wn0avc">Salary<!-- -->:</span>
<span class="css-47jx3m"><span class="css-8il94u">Confidential, Hourly Based</span>
</span>
</div>
I need to get the ('span class="css-8il94u"') which have the text ('Confidential, Hourly Based') but it does not appear
thanks

How to find a specific tagged interval within a find_all interval using beautifulsoup

I am trying to first find all the spans, after that look for those that have "Número de documento" in their get_text (), to continue with extracting from those span those that have the "nombcampo="it_ndoc_tom"" attribute.
This would be the structure of the spans that have "Número de documento" and with them their respective attributes that have different values, of which I only want to extract those that have "it_ndoc_tom"
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1152723\"' nombcampo="it_ndoc_tom" paso="" vidc0='\"890102999\"'>890102999</span>]
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1152725\"' nombcampo="ia_ndoc_ase" paso="" vidc0='\"52865608\"'>52865608</span>]
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1152726\"' nombcampo="ib_ndoc_ben" paso="" vidc0='\"52865608\"'>52865608</span>]
[<span class='\"rSpnTitulo\"'>Número de documento<div campo="" class='\"glyphicon' editable.\"="" es="" glyphicon-edit\"="" style='\"float:right;\"' title='\"Éste'></div></span>, <span cetxt\"="" class='\"rSpnValor' idc0='\"1120863\"' nombcampo="ib_ndoc_ben" paso="" vidc0='\"860002082\"'>860002082</span>]
Currently I can already access those spans that have "Número de documento", but the problem is trying to search within those it found, those that have the attribute (nombcampo = it_ndoc_tom).
I share a bit of the code:
dataArray = soup.find_all('div', {'class': '\\\"rDivDatosAseg'})
for data in dataArray:
data_container = data
ndoc_tom = data_container.find_all('span')
if ndoc_tom[0].get_text() == "Número de documento":
for span in ndoc_tom:
filt_ndoc_tom = span.find_all('span', {'nombcampo': 'it_ndoc_tom'})
print(filt_ndoc_tom)

Need to extract the contents of two html classes with same name

I want to extract the contents of the following html code
<div class="card__content">
<div class="card__title">
Sajjad Haider Khan
<svg class="icon link__icon" height="10" width="10">
<use xlink:href="#icon-arrow-right">
</use>
</svg>
</div>
<div class="card__position">
Student,
<span class="card__text">
Computer Science Engineering (CSE)
</span>
</div>
<div class="card__position">
<span class="card__text">
Career Interests:
</span>
Computer Science; Python; Machine Learning
</div>
</div>
I need all the contents of both the card__position classes. In the second card__position class, there is a double quoted text after the span class closing. I want to extract that text as well.
I am able to extract only the card__title and the contents of only first card__position class with the following code.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = '<url>'
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.find_all("div", {"class": "card__content"})
container = containers[0]
title = container.findAll("div", {"class": "card__title"})
#print(name[0].text)
position = container.findAll("div", {"class": "card__position"})
#print(position[0].text)
However, I want to print the results as in the following manner:
Sajjad Haider Khan
Student
Computer Science Engineering (CSE)
Career Interests: Computer Science; Python; Machine Learning
The code has the correct approach. However I'd recommend separating code into fetching html & scraping results. The scrape results code should return results in some sort of response object. My example below uses NamedTupple (for ease of use).
find_all() should be used if you wish to find more than one html element. Otherwise use find() to get the first item. Also there is a slight typo of findAll().
Also this question only provides a snippet of html content. This makes it difficult to make assumptions. However I assume the html always contains 2 divs with class card__position. I use assert to throw an exception if this is not true.
Anyway below is an example, which achieves the desired output:
from collections import namedtuple
from bs4 import BeautifulSoup
ExtractResponse = namedtuple('ExtractResponse', ['Name', 'Role', 'Course', 'MiscInfoTitle', 'MiscInfoDetails']) # This is used to store date frome extract. Not too sure on correct namings
html = '''<div class="card__content">
<div class="card__title">
Sajjad Haider Khan
<svg class="icon link__icon" height="10" width="10">
<use xlink:href="#icon-arrow-right">
</use>
</svg>
</div>
<div class="card__position">
Student,
<span class="card__text">
Computer Science Engineering (CSE)
</span>
</div>
<div class="card__position">
<span class="card__text">
Career Interests:
</span>
Computer Science; Python; Machine Learning
</div>
</div>'''
def extract(soup):
card_container_soup = soup.find("div", {"class": "card__content"}) # use find() if you know there is 1 instance (e.g <div class="card__content">)
title_soup = card_container_soup.find("div", {"class": "card__title"})
card_positions_soup = card_container_soup.find_all("div", {"class": "card__position"})
assert len(card_positions_soup) == 2 # Validate that there are always 2 card__position. If theres is not, the scrape results will be wrong.
# This code assumes the first card__position contains course data and the second contains intrest
card_text_1_soup = card_positions_soup[0].find('span', {'class': 'card__text'})
card_text_1 = card_text_1_soup.text.strip()
card_text_1_soup.extract() # remove card text. This is so that the non span text can be scraped.
card_text_2_soup = card_positions_soup[1].find('span', {'class': 'card__text'})
card_text_2 = card_text_2_soup.text.strip()
card_text_2_soup.extract()
response = ExtractResponse(title_soup.text.strip(), card_positions_soup[0].text.strip().replace(',', ''), card_text_1, card_text_2, card_positions_soup[1].text.strip())
return response
soup = BeautifulSoup(html, "html.parser")
results = extract(soup)
print(results.Name)
print(results.Role)
print(results.Course)
print(f"{results.MiscInfoTitle} {results.MiscInfoDetails}")
Output:
Sajjad Haider Khan
Student
Computer Science Engineering (CSE)
Career Interests: Computer Science; Python; Machine Learning

Python web scraping using BeautifulSoup, how to merge two <p> text into one element of list

I use BeautifulSoup to do the web scraping, the put the result into a list,
html shows like this:
<p class="attrgroup">
<span><b>2013 Volkswagen Passat</b></span>
<br>
</p>
<p class="attrgroup">
<span>condition: <b>excellent</b></span>
<br>
</p>
my code is:
title=[]
text=[]
for newpage in list:
webpage = urlopen(newpage).read()
soup = BeautifulSoup(webpage,'html.parser')
header=soup.find_all("span",attrs={"id":"titletextonly"})
info = soup.find_all("p",attrs={"class":"attrgroup"})
for h in header:
title.append(h.get_text())
for m in info:
text.append(m.get_text())
the text list result is:
["2013 Volkswagen Passat","condition:excellent"]
But i want the result like this:
["2013 Volkswagen Passat condition:excellent"]
How to merge the two text when put into a list? please help!!!
Use join() function of lists.
title = []
for h in header:
title.append(h.get_text())
title = ''.join([title])
Else, add elements to the list instead of text and use list comprehension to join texts.
title = []
for h in header:
title.append(h)
title = ''.join([i.text for i in title])
Hope this helps! Cheers!
you can use stripped_strings
from bs4 import BeautifulSoup
html = """<p class="attrgroup">
<span><b>2013 Volkswagen Passat</b></span>
<br>
</p>
<p class="attrgroup">
<span>condition: <b>excellent</b></span>
<br>
</p>"""
tag = BeautifulSoup(html, 'html.parser')
data = (' '.join(tag.stripped_strings))
print data