How to get "275,47" value (BeautifulSoup,Python3) - html

I am new in web scraping, so I need your help.
I have this html code (look at the picture) and I want to get this specific value --> "275,47".
I wrote down this code, but something is going wrong... Please help me! :)
import requests
from bs4 import BeautifulSoup as bs
url="https://www.skroutz.gr/s/11504255/Apple-iPhone-SE-32GB.html"
page=requests.get(url)
soup=bs(page.text,"html.parser")
D={"href":"/products/show/30871132","rel":"nofollow","class":"js-product-
link","data-func":"trigger_shop_uservoice","data-uservoice-
pid":"30871132","data-append-element":".shop-details","data-uservoice-
shopid":"1913","data-type":"final_price"}
value=soup.find_all("a",attrs=D)
print(value.string)

So, you are close! Your error is being thrown because the variable "value" does not have an attribute called string. Value is currently a list of items. You want to iterate over all of the anchors and find the one you are looking for.
My suggestion:
import requests
from bs4 import BeautifulSoup as bs
url="https://www.skroutz.gr/s/11504255/Apple-iPhone-SE-32GB.html"
page=requests.get(url)
soup=bs(page.text,"html.parser")
value=soup.find_all("a")
for item in value:
if '30871132' in item.get('href'):
print item.text
item will be the current anchor tag we are iterating over in the loop.
We can get it's href attribute (or any attribute) by using the .get method.
We then check if '308711332' is in the href, and if so, print out it's text.

Related

How to extract text from aria-label attribute?

So basically I am trying to work on webscraping. I need to scrap the work life balance rating from indeed website. But the challenge that I am facing is that I do not know how to extract the text from the aria-label, so I can get the output 4.0 out of 5 stars.
<div role="img" aria-label="4.0 out of 5 stars."><div class="css-eub7j6 eu4oa1w0"><div data-testid="filledStar" style="width:42.68px" class="css-i84nrz eu4oa1w0"></div></div></div>
You need to identify the element and use the get attribute aria-label to get the value.
If you are using python. code will be
print(diver.find_element(By.XPATH, "//div[#role='img']").get_attribute("aria-label"))
Update:
print(diver.find_element(By.XPATH, "//div[#role='img' and #aria-label]").get_attribute("aria-label"))
Or
print(diver.find_element(By.XPATH, "//div[#role='img' and #aria-label][.//div[#data-testid='filledStar']]").get_attribute("aria-label"))
In case you can locate that element attribute value can be retrieven with selenium with get_attribute() method.
Let's say you are using By.CSS_SELECTOR and the locator is css_selector.
Python syntax is:
aria_label_value = driver.driver.find_element(By.CSS_SELECTOR, css_selector).get_attribute("aria-label")
Same can be done with other programming languages similarly with slight syntax changes
To retrive the value of the aria-label attribute i.e. "4.0 out of 5 stars." you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR and role="img":
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[role='img'][aria-label]"))).get_attribute("aria-label"))
Using XPATH and data-testid="filledStar":
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#data-testid='filledStar']//ancestor::div[#role='img' and #aria-label]"))).get_attribute("aria-label"))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in Python Selenium - get href value

How to load more element from web pages using BeautifulSoup and/or selenium

I would like to get each link that the each boxes contain, the page is https://www.quattroruote.it/listino/audi
In this webpage there are all the model that this brand is producing, and each model is a boxes that links to another page (the one in which I should work with).
My problem is that the initial page do not load all the boxes the first time, you have to scroll down and press the red button "Carica altri modelli" (which mean "Load other models").
Is there a way to automatically store in one variable all the links that i need? For example, the first links of the first box is "/listino/audi/a1"
Thanks in advance to anyone who try to help me!!
Not sure exactly what links you want, but you can make the requests iterating through the itemStart parameter.
import requests
from bs4 import BeautifulSoup
for i in range(1,100):
print('\t\tList start %s' %i)
url = 'https://www.quattroruote.it/listino/ricerca-more-desktop.html'
payload = {
'area': 'NEW',
'itemStart': '%s' %(i*8),
'_': '1634219611449'}
response = requests.get(url, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
for link in links:
print(link['href'])

How to extract link from a specific division on a webpage using beautifulsoup

I am new to web scraping and a bit confused with my current situation. Is there a way to extract the link for all the sector from this website(where I circled in red) From the html inspector, it seems like it is under the "performance-section" class and it is also under the "heading" class. My idea was to start from the "performance-section" then reach the "a" tag href in the end to get the link.
I tried to use the following code but it is giving me "None" as a result. I stopped here because if I am already getting None before getting the "a" tag, then I think there is no point of keep going.
import requests
import urllib.request
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
response = requests.get(url)
results_page = BeautifulSoup(response.content,'lxml')
heading =results_page.find('performance-section',{'class':"heading"})
Thanks in advance!
You are on the right track with your mind game.
Problem
You should take another look at the documentation, because currently you don't even try to select tags, but try a mix of classes - It is also possible, but to learn you should start step by step.
Solution to get the <a> and its href
This will select all <a> in <div> with class heading
that parents are <div> with class performance-section
soup.select('div.performance-section div.heading a')
.
import requests
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')
[link['href'] for link in soup.select('div.performance-section div.heading a')]

BeautifulSoup doesn't find elements

I started to use BeautifulSoup and unfortunately it doesn't work as expected.
In the following link https://www.globes.co.il/news/article.aspx?did=1001285059 includes the following element:
<div class="sppre_message-data-wrapper">... </div>
I tried to get this element by writing the following code:
html = urlopen("https://www.globes.co.il/news/article.aspx?did=1001285059")
bsObj = BeautifulSoup(html.read(), features="html.parser")
comments = bsObj.find_all('div', {'class': ["sppre_message-data-wrapper"]})
print(comments)
'comments' gave an empty array
It's in an iframe. Make your request to the iframe src
https://spoxy-shard2.spot.im/v2/spot/sp_8BE2orzs/post/1001285059/?elementId=6a97624752c75d958352037d2b36df77&spot_im_platform=desktop&host_url=https%3A%2F%2Fwww.globes.co.il%2Fnews%2Farticle.aspx%3Fdid%3D1001285059&host_url_64=aHR0cHM6Ly93d3cuZ2xvYmVzLmNvLmlsL25ld3MvYXJ0aWNsZS5hc3B4P2RpZD0xMDAxMjg1MDU5&pageSize=1&count=1&spot_im_ph__prerender_deferred=true&prerenderDeferred=true&sort_by=newest&conversationSkin=light&isStarsRatingEnabled=false&enableMessageShare=true&enableAnonymize=true&isConversationLiveBlog=false&enableSeeMoreButton=true
py
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://spoxy-shard2.spot.im/v2/spot/sp_8BE2orzs/post/1001285059/?elementId=6a97624752c75d958352037d2b36df77&spot_im_platform=desktop&host_url=https%3A%2F%2Fwww.globes.co.il%2Fnews%2Farticle.aspx%3Fdid%3D1001285059&host_url_64=aHR0cHM6Ly93d3cuZ2xvYmVzLmNvLmlsL25ld3MvYXJ0aWNsZS5hc3B4P2RpZD0xMDAxMjg1MDU5&pageSize=1&count=1&spot_im_ph__prerender_deferred=true&prerenderDeferred=true&sort_by=newest&conversationSkin=light&isStarsRatingEnabled=false&enableMessageShare=true&enableAnonymize=true&isConversationLiveBlog=false&enableSeeMoreButton=true')
soup= bs(r.content,'html.parser')
comments = [item.text for item in soup.select('.sppre_message-data-wrapper')]
print(comments)
BeautifulSoup doesn't support deep combinator (which is now retired I think anyway) but you can see this in browser (Chrome) using:
*/deep/.sppre_message-data-wrapper
Wouldn't have mattered ultimately as content is not present in requests response from original url.
You could alternatively use selenium I guess and switch to iframe. Whilst there is an id of 401bccf8039377de3e9873905037a855-iframe i.e. #401bccf8039377de3e9873905037a855-iframe for find_element_by_css_selector, to then switch to, a more robust (in case id is dynamic) selector would be .sppre_frame-container iframe

Python - How to use finditer regex?

I would like to find every instance of img src="([^"]+)" that is preceded by the div class="grid" and succeeded by div class="orderplacebut" in some HTML code i.e. I want to find all the images in the div container called "grid".
If I used findall it will only return one image because div class="grid" appears just once on the webpage and therefore it will only return one of the following image URLs (makes sense). So I would like to iterate the findall regex so that it runs again, and returns the second instance of the image URL, and then the third and so forth. Is this possible using finditer, and how would I use it in the code?
The code below is my findall regex that only returns the one URL.
from urllib import urlopen
from re import findall
import re
dennisov_url = 'https://denissov.ru/en/'
dennisov_html = urlopen(dennisov_url).read()
# Print all images between div class="grid" and div class="orderplacebut"
# Because the regex spans over several lines, use DOTALL flag to include
# every character between, including new lines
watch_image_urls = findall('<div class="grid".*<img src="([^"]+)".*<div class="orderplacebut"', dennisov_html, flags=re.DOTALL)
print watch_image_urls
Really, use another approach with a parser (not tested due to .ru domain which is blocked here):
import requests
from bs4 import BeautifulSoup
dennisov_url = 'https://denissov.ru/en/'
dennisov_html = requests.get(dennisov_url)
soup = BeautifulSoup(dennisov_html.text, 'lxml')
images = soup.select('div.grid > img')