I started to use BeautifulSoup and unfortunately it doesn't work as expected.
In the following link https://www.globes.co.il/news/article.aspx?did=1001285059 includes the following element:
<div class="sppre_message-data-wrapper">... </div>
I tried to get this element by writing the following code:
html = urlopen("https://www.globes.co.il/news/article.aspx?did=1001285059")
bsObj = BeautifulSoup(html.read(), features="html.parser")
comments = bsObj.find_all('div', {'class': ["sppre_message-data-wrapper"]})
print(comments)
'comments' gave an empty array
It's in an iframe. Make your request to the iframe src
https://spoxy-shard2.spot.im/v2/spot/sp_8BE2orzs/post/1001285059/?elementId=6a97624752c75d958352037d2b36df77&spot_im_platform=desktop&host_url=https%3A%2F%2Fwww.globes.co.il%2Fnews%2Farticle.aspx%3Fdid%3D1001285059&host_url_64=aHR0cHM6Ly93d3cuZ2xvYmVzLmNvLmlsL25ld3MvYXJ0aWNsZS5hc3B4P2RpZD0xMDAxMjg1MDU5&pageSize=1&count=1&spot_im_ph__prerender_deferred=true&prerenderDeferred=true&sort_by=newest&conversationSkin=light&isStarsRatingEnabled=false&enableMessageShare=true&enableAnonymize=true&isConversationLiveBlog=false&enableSeeMoreButton=true
py
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://spoxy-shard2.spot.im/v2/spot/sp_8BE2orzs/post/1001285059/?elementId=6a97624752c75d958352037d2b36df77&spot_im_platform=desktop&host_url=https%3A%2F%2Fwww.globes.co.il%2Fnews%2Farticle.aspx%3Fdid%3D1001285059&host_url_64=aHR0cHM6Ly93d3cuZ2xvYmVzLmNvLmlsL25ld3MvYXJ0aWNsZS5hc3B4P2RpZD0xMDAxMjg1MDU5&pageSize=1&count=1&spot_im_ph__prerender_deferred=true&prerenderDeferred=true&sort_by=newest&conversationSkin=light&isStarsRatingEnabled=false&enableMessageShare=true&enableAnonymize=true&isConversationLiveBlog=false&enableSeeMoreButton=true')
soup= bs(r.content,'html.parser')
comments = [item.text for item in soup.select('.sppre_message-data-wrapper')]
print(comments)
BeautifulSoup doesn't support deep combinator (which is now retired I think anyway) but you can see this in browser (Chrome) using:
*/deep/.sppre_message-data-wrapper
Wouldn't have mattered ultimately as content is not present in requests response from original url.
You could alternatively use selenium I guess and switch to iframe. Whilst there is an id of 401bccf8039377de3e9873905037a855-iframe i.e. #401bccf8039377de3e9873905037a855-iframe for find_element_by_css_selector, to then switch to, a more robust (in case id is dynamic) selector would be .sppre_frame-container iframe
Related
I would like to get each link that the each boxes contain, the page is https://www.quattroruote.it/listino/audi
In this webpage there are all the model that this brand is producing, and each model is a boxes that links to another page (the one in which I should work with).
My problem is that the initial page do not load all the boxes the first time, you have to scroll down and press the red button "Carica altri modelli" (which mean "Load other models").
Is there a way to automatically store in one variable all the links that i need? For example, the first links of the first box is "/listino/audi/a1"
Thanks in advance to anyone who try to help me!!
Not sure exactly what links you want, but you can make the requests iterating through the itemStart parameter.
import requests
from bs4 import BeautifulSoup
for i in range(1,100):
print('\t\tList start %s' %i)
url = 'https://www.quattroruote.it/listino/ricerca-more-desktop.html'
payload = {
'area': 'NEW',
'itemStart': '%s' %(i*8),
'_': '1634219611449'}
response = requests.get(url, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
for link in links:
print(link['href'])
I am new to web scraping and a bit confused with my current situation. Is there a way to extract the link for all the sector from this website(where I circled in red) From the html inspector, it seems like it is under the "performance-section" class and it is also under the "heading" class. My idea was to start from the "performance-section" then reach the "a" tag href in the end to get the link.
I tried to use the following code but it is giving me "None" as a result. I stopped here because if I am already getting None before getting the "a" tag, then I think there is no point of keep going.
import requests
import urllib.request
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
response = requests.get(url)
results_page = BeautifulSoup(response.content,'lxml')
heading =results_page.find('performance-section',{'class':"heading"})
Thanks in advance!
You are on the right track with your mind game.
Problem
You should take another look at the documentation, because currently you don't even try to select tags, but try a mix of classes - It is also possible, but to learn you should start step by step.
Solution to get the <a> and its href
This will select all <a> in <div> with class heading
that parents are <div> with class performance-section
soup.select('div.performance-section div.heading a')
.
import requests
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')
[link['href'] for link in soup.select('div.performance-section div.heading a')]
In the attached html screenshot, I want to get the text summary in the 'lemma-summary' section. It's usually the first sentence of a wikipedia entry. This is a Chinese wikipedia entry. I used this code through BeautifulSoup
summaries = doc.getElements('div', attr='label-module', value='para').text
But this returns all text sections of the html page without using the 'lemma-summary'. If I do this:
summary = soup.select(".lemma-summary")
This does gives the right section (only the summary section), but it returns a ResultSet object, and I don't know how to get down to the exact text part.
How to extract the text part from this tag?
The URL of the page is here:
https://baike.baidu.com/item/tt%E8%AF%AD%E9%9F%B3
I want to extract this summary text:
"ika是深圳缇卡基因美容生物科技有限公司的一个化妆品品牌。"
I had to use selenium to get the page to load. If you can get the right html without selenium that work too.
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
url = 'https://baike.baidu.com/item/tt%E8%AF%AD%E9%9F%B3'
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
This
soup.find('div', attrs={'class': 'para', 'label-module': 'para'}).text
gets you
'TT语音App,提供游戏组队开黑、职业电竞培养、达人娱乐互动等游戏社交场景。\n[1]\xa0\n'
and this
summary = soup.select(".lemma-summary")
for s in summary:
print(s.text)
gets you
TT语音App,提供游戏组队开黑、职业电竞培养、达人娱乐互动等游戏社交场景。
[1]
I am new in web scraping, so I need your help.
I have this html code (look at the picture) and I want to get this specific value --> "275,47".
I wrote down this code, but something is going wrong... Please help me! :)
import requests
from bs4 import BeautifulSoup as bs
url="https://www.skroutz.gr/s/11504255/Apple-iPhone-SE-32GB.html"
page=requests.get(url)
soup=bs(page.text,"html.parser")
D={"href":"/products/show/30871132","rel":"nofollow","class":"js-product-
link","data-func":"trigger_shop_uservoice","data-uservoice-
pid":"30871132","data-append-element":".shop-details","data-uservoice-
shopid":"1913","data-type":"final_price"}
value=soup.find_all("a",attrs=D)
print(value.string)
So, you are close! Your error is being thrown because the variable "value" does not have an attribute called string. Value is currently a list of items. You want to iterate over all of the anchors and find the one you are looking for.
My suggestion:
import requests
from bs4 import BeautifulSoup as bs
url="https://www.skroutz.gr/s/11504255/Apple-iPhone-SE-32GB.html"
page=requests.get(url)
soup=bs(page.text,"html.parser")
value=soup.find_all("a")
for item in value:
if '30871132' in item.get('href'):
print item.text
item will be the current anchor tag we are iterating over in the loop.
We can get it's href attribute (or any attribute) by using the .get method.
We then check if '308711332' is in the href, and if so, print out it's text.
I would like to find every instance of img src="([^"]+)" that is preceded by the div class="grid" and succeeded by div class="orderplacebut" in some HTML code i.e. I want to find all the images in the div container called "grid".
If I used findall it will only return one image because div class="grid" appears just once on the webpage and therefore it will only return one of the following image URLs (makes sense). So I would like to iterate the findall regex so that it runs again, and returns the second instance of the image URL, and then the third and so forth. Is this possible using finditer, and how would I use it in the code?
The code below is my findall regex that only returns the one URL.
from urllib import urlopen
from re import findall
import re
dennisov_url = 'https://denissov.ru/en/'
dennisov_html = urlopen(dennisov_url).read()
# Print all images between div class="grid" and div class="orderplacebut"
# Because the regex spans over several lines, use DOTALL flag to include
# every character between, including new lines
watch_image_urls = findall('<div class="grid".*<img src="([^"]+)".*<div class="orderplacebut"', dennisov_html, flags=re.DOTALL)
print watch_image_urls
Really, use another approach with a parser (not tested due to .ru domain which is blocked here):
import requests
from bs4 import BeautifulSoup
dennisov_url = 'https://denissov.ru/en/'
dennisov_html = requests.get(dennisov_url)
soup = BeautifulSoup(dennisov_html.text, 'lxml')
images = soup.select('div.grid > img')