How to load more element from web pages using BeautifulSoup and/or selenium - html

I would like to get each link that the each boxes contain, the page is https://www.quattroruote.it/listino/audi
In this webpage there are all the model that this brand is producing, and each model is a boxes that links to another page (the one in which I should work with).
My problem is that the initial page do not load all the boxes the first time, you have to scroll down and press the red button "Carica altri modelli" (which mean "Load other models").
Is there a way to automatically store in one variable all the links that i need? For example, the first links of the first box is "/listino/audi/a1"
Thanks in advance to anyone who try to help me!!

Not sure exactly what links you want, but you can make the requests iterating through the itemStart parameter.
import requests
from bs4 import BeautifulSoup
for i in range(1,100):
print('\t\tList start %s' %i)
url = 'https://www.quattroruote.it/listino/ricerca-more-desktop.html'
payload = {
'area': 'NEW',
'itemStart': '%s' %(i*8),
'_': '1634219611449'}
response = requests.get(url, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
for link in links:
print(link['href'])

Related

How to extract link from a specific division on a webpage using beautifulsoup

I am new to web scraping and a bit confused with my current situation. Is there a way to extract the link for all the sector from this website(where I circled in red) From the html inspector, it seems like it is under the "performance-section" class and it is also under the "heading" class. My idea was to start from the "performance-section" then reach the "a" tag href in the end to get the link.
I tried to use the following code but it is giving me "None" as a result. I stopped here because if I am already getting None before getting the "a" tag, then I think there is no point of keep going.
import requests
import urllib.request
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
response = requests.get(url)
results_page = BeautifulSoup(response.content,'lxml')
heading =results_page.find('performance-section',{'class':"heading"})
Thanks in advance!
You are on the right track with your mind game.
Problem
You should take another look at the documentation, because currently you don't even try to select tags, but try a mix of classes - It is also possible, but to learn you should start step by step.
Solution to get the <a> and its href
This will select all <a> in <div> with class heading
that parents are <div> with class performance-section
soup.select('div.performance-section div.heading a')
.
import requests
from bs4 import BeautifulSoup
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')
[link['href'] for link in soup.select('div.performance-section div.heading a')]

Get links from summary section of wikipedia page

I am trying to extract links from the summary section of a wikipedia page. I tried the below methods :
This url extracts all the links of the Deep learning page:
https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Deep%20learning
And for extracting links associated to any section I can filter based on the section id - for e.g.,
for the Definition section of same page I can use this url : https://en.wikipedia.org/w/api.php?action=parse&prop=links&page=Deep%20learning&section=1
for the Overview section of same page I can use this url : https://en.wikipedia.org/w/api.php?action=parse&prop=links&page=Deep%20learning&section=2
But I am unable to figure out how to extract only the links from summary section
I even tried using pywikibot to extract linkedpages and adjusting plnamespace variable but couldn't get links only for summary section.
You need to use https://en.wikipedia.org/w/api.php?action=parse&prop=links&page=Deep%20learning&section=0
Note that this also includes links in the
{{machine learning bar}} and {{Artificial intelligence|Approaches}} templates however (to the right of the screen).
You can use Pywikibot with the following commands
>>> import pywikibot
>>> from pwikibot import textlib
>>> site = pywikibot.Site('wikipedia:en') # create a Site object
>>> page = pywikibot.Page(site, 'Deep learning') # create a Page object
>>> sect = textlib.extract_sections(page.text, site) # divide content into sections
>>> links = sorted(link.group('title') for link in pywikibot.link_regex.finditer(sect.head))
Now links is a list containing all link titles in alphabethical order. If you prefer Page objects as result you may create them with
>>> pages = [pywikibot.Page(site, title) for title in links]
It's up to you to create a script with this code snippets.

How to extract a text summary from a wikipedia term entry in html tags?

In the attached html screenshot, I want to get the text summary in the 'lemma-summary' section. It's usually the first sentence of a wikipedia entry. This is a Chinese wikipedia entry. I used this code through BeautifulSoup
summaries = doc.getElements('div', attr='label-module', value='para').text
But this returns all text sections of the html page without using the 'lemma-summary'. If I do this:
summary = soup.select(".lemma-summary")
This does gives the right section (only the summary section), but it returns a ResultSet object, and I don't know how to get down to the exact text part.
How to extract the text part from this tag?
The URL of the page is here:
https://baike.baidu.com/item/tt%E8%AF%AD%E9%9F%B3
I want to extract this summary text:
"ika是深圳缇卡基因美容生物科技有限公司的一个化妆品品牌。"
I had to use selenium to get the page to load. If you can get the right html without selenium that work too.
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
url = 'https://baike.baidu.com/item/tt%E8%AF%AD%E9%9F%B3'
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
This
soup.find('div', attrs={'class': 'para', 'label-module': 'para'}).text
gets you
'TT语音App,提供游戏组队开黑、职业电竞培养、达人娱乐互动等游戏社交场景。\n[1]\xa0\n'
and this
summary = soup.select(".lemma-summary")
for s in summary:
print(s.text)
gets you
TT语音App,提供游戏组队开黑、职业电竞培养、达人娱乐互动等游戏社交场景。
[1]

BeautifulSoup doesn't find elements

I started to use BeautifulSoup and unfortunately it doesn't work as expected.
In the following link https://www.globes.co.il/news/article.aspx?did=1001285059 includes the following element:
<div class="sppre_message-data-wrapper">... </div>
I tried to get this element by writing the following code:
html = urlopen("https://www.globes.co.il/news/article.aspx?did=1001285059")
bsObj = BeautifulSoup(html.read(), features="html.parser")
comments = bsObj.find_all('div', {'class': ["sppre_message-data-wrapper"]})
print(comments)
'comments' gave an empty array
It's in an iframe. Make your request to the iframe src
https://spoxy-shard2.spot.im/v2/spot/sp_8BE2orzs/post/1001285059/?elementId=6a97624752c75d958352037d2b36df77&spot_im_platform=desktop&host_url=https%3A%2F%2Fwww.globes.co.il%2Fnews%2Farticle.aspx%3Fdid%3D1001285059&host_url_64=aHR0cHM6Ly93d3cuZ2xvYmVzLmNvLmlsL25ld3MvYXJ0aWNsZS5hc3B4P2RpZD0xMDAxMjg1MDU5&pageSize=1&count=1&spot_im_ph__prerender_deferred=true&prerenderDeferred=true&sort_by=newest&conversationSkin=light&isStarsRatingEnabled=false&enableMessageShare=true&enableAnonymize=true&isConversationLiveBlog=false&enableSeeMoreButton=true
py
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://spoxy-shard2.spot.im/v2/spot/sp_8BE2orzs/post/1001285059/?elementId=6a97624752c75d958352037d2b36df77&spot_im_platform=desktop&host_url=https%3A%2F%2Fwww.globes.co.il%2Fnews%2Farticle.aspx%3Fdid%3D1001285059&host_url_64=aHR0cHM6Ly93d3cuZ2xvYmVzLmNvLmlsL25ld3MvYXJ0aWNsZS5hc3B4P2RpZD0xMDAxMjg1MDU5&pageSize=1&count=1&spot_im_ph__prerender_deferred=true&prerenderDeferred=true&sort_by=newest&conversationSkin=light&isStarsRatingEnabled=false&enableMessageShare=true&enableAnonymize=true&isConversationLiveBlog=false&enableSeeMoreButton=true')
soup= bs(r.content,'html.parser')
comments = [item.text for item in soup.select('.sppre_message-data-wrapper')]
print(comments)
BeautifulSoup doesn't support deep combinator (which is now retired I think anyway) but you can see this in browser (Chrome) using:
*/deep/.sppre_message-data-wrapper
Wouldn't have mattered ultimately as content is not present in requests response from original url.
You could alternatively use selenium I guess and switch to iframe. Whilst there is an id of 401bccf8039377de3e9873905037a855-iframe i.e. #401bccf8039377de3e9873905037a855-iframe for find_element_by_css_selector, to then switch to, a more robust (in case id is dynamic) selector would be .sppre_frame-container iframe

create side panels in bokeh for displaying details of hovered data point

I have seen great examples of how bokeh allows you to hover over a data point and display pop up details for it. There are cases the details is so overwhelming voluminous, it really requires a side panel to display it all. Is bokeh a complete enough widget toolkit where I can create a side panel to the main display and show details of a data point following the cursor?
Can someone point out some sample code, or at least the relevant api's.
If you prefer a higher-level API for building and linking Bokeh-based plots, you can use HoloViews; see linking examples at http://holoviews.org/reference/index.html#streams and instructions at http://holoviews.org/user_guide/Custom_Interactivity.html . For example:
import param, numpy as np, holoviews as hv
from holoviews import opts, streams
hv.extension('bokeh')
xvals = np.linspace(0,4,202)
ys,xs = np.meshgrid(xvals, -xvals[::-1])
img = hv.Image(np.sin(((ys)**3)*xs))
pointer = streams.PointerXY(x=0,y=0, source=img)
dmap = hv.DynamicMap(lambda x, y: hv.Points([(x, y)]), streams=[pointer])
dmap = dmap.redim.range(x=(-0.5,0.5), y=(-0.5,0.5))
img + dmap.opts(size=10)
You can find many examples on https://docs.bokeh.org . What you want is possible by adding a callback and updating the relevant part. In this example the div is what you name a side panel in your question.
#for bokeh 1.0.4
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource,Div,Row
from bokeh.io import curdoc
from bokeh.events import Tap
#the data
d={'x':[1,2],'y':[3,4],'info':['some information on a first datapoint','some information on a second datapoint']}
source=ColumnDataSource(d)
tooltips = [("x", "$x"),("y", "$y"),("info","#info")]
fig=figure(tools="tap,reset",tooltips=tooltips)
c=fig.circle('x','y',source=source,size=15)
def callback(event):
indexActive=source.selected.indices[0]
layout.children[1]=Div(text=d['info'][indexActive])#adjust the info on the right
fig.on_event(Tap, callback)
div=Div(text=d['info'][0])
layout=Row(fig,div)
curdoc().add_root(layout)
To run this code, save it as code.py, open a cmd and type "bokeh serve code.py --show".