Finding product name in a html text - html

I am trying to scrape a website: www.gall.nl in order to create a database of all wines that are sold on this platform. I have the following code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.gall.nl/wijn/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
mydivs = soup.find_all("div", {"class": "c-product-tile"})
print(len(mydivs))
first_wijn = mydivs[0]
print(first_wijn)
result = first_wijn.find()
So, this provides 12 results, which is correct.
Printing the first result provides the following:
<div class="c-product-tile" data-product='{"name":"Faustino V Rioja Reserva","id":"143561","currencyCode":"EUR","price":13.99,"discount":0,"brand":"Faustino","category":"Wijn","variant":"75CL","list":"productoverzicht","position":1,"dimension13":"2","dimension37":"Ja"}' itemprop="item" itemscope="" itemtype="https://schema.org/Product" js-hook-product-tile="">
<meta content="143561" itemprop="sku">
<meta content="8410441412065" itemprop="gtin8">
<meta content="Faustino" itemprop="brand">
<div class="product-tile__header">
<div class="product-tile__category-label">
<div class="m-product-taste-tooltip">
<span aria-label="Classic Red" class="a-tooltip-trigger" data-content="Stevig & Ferm" data-placement="bottom-start" js-hook-tooltip="">
<div class="tooltip-trigger__icon product-taste-tooltip__icon u-taste-profile-icon classic-red-red
....
<input class="add-to-cart-url" type="hidden" value="/on/demandware.store/Sites-gall-nl-Site/nl_NL/Cart-AddProduct"/>
</div>
</meta></meta></meta></div>
And I'm interested in getting the data from the first line:
<div class="c-product-tile" data-product='{"name":"Faustino V Rioja Reserva","id":"143561","currencyCode":"EUR","price":13.99,"discount":0,"brand":"Faustino","category":"Wijn","variant":"75CL","list":"productoverzicht","position":1,"dimension13":"2","dimension37":"Ja"}' itemprop="item" itemscope="" itemtype="https://schema.org/Product" js-hook-product-tile="">
In order to get the name, price and brand.
Can somebody help me with retrieving these data?

Use beautifulsoup's .attrs.get to get the data-product from the div
Then, convert to JSON to read desired values.
import json
import requests
from bs4 import BeautifulSoup
URL = 'https://www.gall.nl/wijn/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# Get all products
mydivs = soup.find_all("div", {"class": "c-product-tile"})
# Loop through each product
for div in mydivs:
# Get data-product
product = div.attrs.get("data-product", None)
# Convert string to json
jsonProduct = json.loads(product.encode('utf-8').decode('ascii', 'ignore'))
# Show name - brand - price
print('{0:<40} {1:<20} {2:>10}'.format(
jsonProduct['name'],
jsonProduct['brand'],
jsonProduct['price']
))
Using the format() to create 3 columns, the above code produces the following output:
Faustino V Rioja Reserva Faustino 13.99
Mucho Ms Tinto Mucho Mas 5.99
Cantina di Verona Valpolicella Ripasso Terre Di Verona 11.99
Villa Jeantel Villa Jeantel 8.99
Ondarre Rioja Reserva Ondarre 13.59
Valdivieso Chardonnay Valdivieso 5.99
Domaine Lamourie Ros Domaine Lamourie 7.99
Oveja Negra Chardonnay Viognier Oveja Negra 6.59
La Palma Merlot La Palma 6.59
Alamos Chardonnay Alamos 8.99
Les Hautes Pentes ros Les Hautes Pentes 7.99
Piccini Memoro Rosso Piccini 7.29

Related

Need to extract the contents of two html classes with same name

I want to extract the contents of the following html code
<div class="card__content">
<div class="card__title">
Sajjad Haider Khan
<svg class="icon link__icon" height="10" width="10">
<use xlink:href="#icon-arrow-right">
</use>
</svg>
</div>
<div class="card__position">
Student,
<span class="card__text">
Computer Science Engineering (CSE)
</span>
</div>
<div class="card__position">
<span class="card__text">
Career Interests:
</span>
Computer Science; Python; Machine Learning
</div>
</div>
I need all the contents of both the card__position classes. In the second card__position class, there is a double quoted text after the span class closing. I want to extract that text as well.
I am able to extract only the card__title and the contents of only first card__position class with the following code.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = '<url>'
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.find_all("div", {"class": "card__content"})
container = containers[0]
title = container.findAll("div", {"class": "card__title"})
#print(name[0].text)
position = container.findAll("div", {"class": "card__position"})
#print(position[0].text)
However, I want to print the results as in the following manner:
Sajjad Haider Khan
Student
Computer Science Engineering (CSE)
Career Interests: Computer Science; Python; Machine Learning
The code has the correct approach. However I'd recommend separating code into fetching html & scraping results. The scrape results code should return results in some sort of response object. My example below uses NamedTupple (for ease of use).
find_all() should be used if you wish to find more than one html element. Otherwise use find() to get the first item. Also there is a slight typo of findAll().
Also this question only provides a snippet of html content. This makes it difficult to make assumptions. However I assume the html always contains 2 divs with class card__position. I use assert to throw an exception if this is not true.
Anyway below is an example, which achieves the desired output:
from collections import namedtuple
from bs4 import BeautifulSoup
ExtractResponse = namedtuple('ExtractResponse', ['Name', 'Role', 'Course', 'MiscInfoTitle', 'MiscInfoDetails']) # This is used to store date frome extract. Not too sure on correct namings
html = '''<div class="card__content">
<div class="card__title">
Sajjad Haider Khan
<svg class="icon link__icon" height="10" width="10">
<use xlink:href="#icon-arrow-right">
</use>
</svg>
</div>
<div class="card__position">
Student,
<span class="card__text">
Computer Science Engineering (CSE)
</span>
</div>
<div class="card__position">
<span class="card__text">
Career Interests:
</span>
Computer Science; Python; Machine Learning
</div>
</div>'''
def extract(soup):
card_container_soup = soup.find("div", {"class": "card__content"}) # use find() if you know there is 1 instance (e.g <div class="card__content">)
title_soup = card_container_soup.find("div", {"class": "card__title"})
card_positions_soup = card_container_soup.find_all("div", {"class": "card__position"})
assert len(card_positions_soup) == 2 # Validate that there are always 2 card__position. If theres is not, the scrape results will be wrong.
# This code assumes the first card__position contains course data and the second contains intrest
card_text_1_soup = card_positions_soup[0].find('span', {'class': 'card__text'})
card_text_1 = card_text_1_soup.text.strip()
card_text_1_soup.extract() # remove card text. This is so that the non span text can be scraped.
card_text_2_soup = card_positions_soup[1].find('span', {'class': 'card__text'})
card_text_2 = card_text_2_soup.text.strip()
card_text_2_soup.extract()
response = ExtractResponse(title_soup.text.strip(), card_positions_soup[0].text.strip().replace(',', ''), card_text_1, card_text_2, card_positions_soup[1].text.strip())
return response
soup = BeautifulSoup(html, "html.parser")
results = extract(soup)
print(results.Name)
print(results.Role)
print(results.Course)
print(f"{results.MiscInfoTitle} {results.MiscInfoDetails}")
Output:
Sajjad Haider Khan
Student
Computer Science Engineering (CSE)
Career Interests: Computer Science; Python; Machine Learning

Scraping from specific div without attribute where tags can have the same name

I am trying to scrape a table from wowhead. The issue is that the span classes are the same for 2 different types of data (Sell for: and Buy for:).
The division under which the spans are have no class and just the stings I wrote in the brackets.
I've tried
import requests
from bs4 import BeautifulSoup
import urllib.request
import re
import lxml
session = requests.session()
url1 = 'https://classicdb.ch/?item=4291'
response = session.get(url1)
soup = BeautifulSoup(response.text, 'lxml')
x=(soup.find('table', attrs={'class': "infobox"}))
y=x.find('td')
y=y.find('ul')
sell_silver = soup.select_one('div:contains("Sells for:") .moneysilver').text
buy_silver = y.select_one('div:contains("Buy for:") .moneysilver').text
print(sell_silver)
print(buy_silver)
but then I only get the first span.
The relevant HTML after i get the table looks like this
<div>
Buy for:
<span class="moneysilver">5</span>
</div>
</li>
<li>
<div>
Sells for:
<span class="moneysilver">1</span> <span class="moneycopper">25</span>
</div>
....
The end result should allow me to sort the data into
Buy_silver=5
Sell_silver=1
edit to clarify question and shoutout #QHarr
BS4 4.7.1+ you can use :contains to target by Buy for or Sells for
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://classicdb.ch/?item=4291')
soup = bs(r.content, 'lxml')
buy_silver, sell_silver = soup.select_one('li:contains("Buy for") .moneysilver').text , soup.select_one('li:contains("Sells for") .moneysilver').text
Assuming that the buy and sell quotes are always in the same order and distance from each other, you can try to use this:
metal = """
<li>
<div>
Buyfor:
<span class="moneysilver">5</span>
</div>
</li>
<li>
<div>
Sells for:
<span class="moneysilver">1</span> <span class="moneycopper">25</span>
</div>
</li>
"""
from bs4 import BeautifulSoup as bs
soup = bs(metal, 'lxml')
silver=soup.find_all('div')
print("buy silver =",silver[0].find("span", class_="moneysilver").text)
print("sell silver =",silver[1].find("span", class_="moneysilver").text)
Output:
buy silver = 5
sell silver = 1

How do i get the text inside a class while ignoring the text of the next class that is inside

I'm trying to get the text inside the class="hardfact" but is also getting the text of the class="hardfactlabel color_f_03" because this class is inside hardfact.
.text.strip() get the text of both class because they are nested.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import requests
import lxml
my_url = 'https://www.immowelt.de/expose/2QC5D4A?npv=52'
page = requests.get(my_url)
ct = soup(page.text, 'lxml')
specs = ct.find('div', class_="hardfacts clear").findAll('div', class_="hardfact")
for items in specs:
e = items.text.strip()
print(e)
I'm getting this
82.500 € 
Kaufpreis
47 m²
Wohnfläche (ca.)
1
Zimmer
and i want this
82.500 €
47 m²
1
Here is the html content you are trying to crawl:
<div class="hardfact ">
<strong>82.500 € </strong>
<div class="hardfactlabel color_f_03">
Kaufpreis
</div>
</div>
<div class="hardfact ">
47 m²
<div class="hardfactlabel color_f_03">
Wohnfläche (ca.)
</div>
</div>
<div class="hardfact rooms">
1
<div class="hardfactlabel color_f_03">
Zimmer
</div>
</div>
What you want to achieve is to remove the div tags within, so you can just decompose the div:
for items in specs:
items.div.decompose()
e = items.text.strip()
print(e)
If your first "hardfact" class doesn't contain the "strong" tag, you can just find the first element like so
e = items.find().text.strip()
but we can't do this so you have to decompose the div tag.
You can use stripped strings. You probably want to add a condition to ensure at least length of 3 before attempting to slice list.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.immowelt.de/expose/2QC5D4A?npv=52')
soup = bs(r.content, 'lxml')
items = soup.select('.hardfact')[:3]
for item in items:
strings = [string for string in item.stripped_strings]
print(strings[0])

Python web scraping using BeautifulSoup, how to merge two <p> text into one element of list

I use BeautifulSoup to do the web scraping, the put the result into a list,
html shows like this:
<p class="attrgroup">
<span><b>2013 Volkswagen Passat</b></span>
<br>
</p>
<p class="attrgroup">
<span>condition: <b>excellent</b></span>
<br>
</p>
my code is:
title=[]
text=[]
for newpage in list:
webpage = urlopen(newpage).read()
soup = BeautifulSoup(webpage,'html.parser')
header=soup.find_all("span",attrs={"id":"titletextonly"})
info = soup.find_all("p",attrs={"class":"attrgroup"})
for h in header:
title.append(h.get_text())
for m in info:
text.append(m.get_text())
the text list result is:
["2013 Volkswagen Passat","condition:excellent"]
But i want the result like this:
["2013 Volkswagen Passat condition:excellent"]
How to merge the two text when put into a list? please help!!!
Use join() function of lists.
title = []
for h in header:
title.append(h.get_text())
title = ''.join([title])
Else, add elements to the list instead of text and use list comprehension to join texts.
title = []
for h in header:
title.append(h)
title = ''.join([i.text for i in title])
Hope this helps! Cheers!
you can use stripped_strings
from bs4 import BeautifulSoup
html = """<p class="attrgroup">
<span><b>2013 Volkswagen Passat</b></span>
<br>
</p>
<p class="attrgroup">
<span>condition: <b>excellent</b></span>
<br>
</p>"""
tag = BeautifulSoup(html, 'html.parser')
data = (' '.join(tag.stripped_strings))
print data

web data scraping : split html content

I'm scraping a website and I was able to reduce a variable called "gender" to this :
[<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>, <span style="text-decoration: none;">associé gérant </span>]
And now I'd like to have only "associé" in the variable but I can't find a way to split this html code.
The reason is that I want to know if it's "associé" (male) or "associée" (female).
does anyone have any ideas ?
Cheers
----- edit ----
here my code which gets me the html output
url = "http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false"
r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
table = soup.select_one("#adm").find_next("table") #select_one finds only the first tag that matches a selector:
table2 = soup.select_one("#adm").find_all_next("table")
output = table.select("td span[style^=text-decoration:]", limit=2) #.text.split(",", 1)[0].strip()
print(output)
Whatever the parent of the two elements is you can call span:nth-of-type(2) to get the second span, then just check the text:
html = """<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>
<span style="text-decoration: none;">associé gérant </span>"""
soup = BeautifulSoup(html)
text = soup.select_one("span:nth-of-type(2)").text
Or if it not always the second span you can search for the span by the partial text associé:
import re
text = soup.find("span", text=re.compile(ur"associé")).text
For your edit, all you need is to extract the text last element and use .split(None, 1)[1] to get the gender:
text = table.select("td span[style^=text-decoration:]", limit=2)[-1].text
gender = text.split(None, 1)[1] # > gérant