scraping specific data from html using BeautifulSoup - html

I was trying to get the best search result position of few product in the below link
https://www.purplle.com/search?q=hair%20fall%20shamboo
I used below tools to get the html details from the page
++
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.purplle.com/search?q=hair%20fall%20shamboo")
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
++
now I am confused how to get the product names and position(to get the best rank in the search )from this html.
I used the below method to get the details of the products but the output has a lot of unwanted things too.
details = soup.find('div', attrs={'class': 'pr'})
any idea how to solve this?

I don't know what you meant by position. However, the below script can fetch you the title of different products and its position (allegedly) from that page:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("https://www.purplle.com/search?q=hair%20fall%20shamboo")
soup = BeautifulSoup(driver.page_source, 'html.parser')
for item in soup.find_all(class_="prd-lstng pr"):
name = item.find_all(class_="pro-name el2")[0].text
position = item.find_all(class_="mrl5 tx-std30")[0].text
print(name,position)
driver.quit()

Related

How do i scrape the content that pops up when i move my mouse over another content? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
i am new to web scraping and i have been trying to scrape the text content that comes up when i hover my mouse on the 8g located on the background image from this site: https://www.dietdoctor.com/recipes/sullivans-kedough-breakfast-pizza but all my effort with beautifulSoup is futile.
please how do i go about this.
Thanks in advance.
This is the link the image that describe where the mouse hover content is located:
i have tried these following codes suggested by #Andrrej kesely
import requests
from bs4 import BeautifulSoup
url = 'https://www.dietdoctor.com/recipes/sullivans-kedough-breakfast-pizza'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
soup = BeautifulSoup(soup.select_one('.recipe-energy-mark-wrapper[data-js-popup]')['data-js-popup'], 'html.parser')
# print some data to screen:
for t in soup.select('title'):
print(t.text)
but it keeps giving TypeError: 'NoneType' object is not subscriptable
import requests
from bs4 import BeautifulSoup
url = 'https://www.dietdoctor.com/recipes/sullivans-kedough-breakfast-pizza'
response = requests.get(url)
# Takes the html source and feeds into BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')
# In the 'soup' object, we'll select the the first element with class ".recipe-energy-mark-wrapper"
# Within that element, the data is the value of the attribute 'data-js-popup'
jsPopup = soup.select_one('.recipe-energy-mark-wrapper')['data-js-popup']
# Normally you probably wouldn't need this part, but that value is stored as the html string
# that we are after, so we'll feed that string into BeautifulSoup to parse that html
soup = BeautifulSoup(jsPopup, 'html.parser')
print (soup.text, '\n')
for t in soup.select('title'):
print(t.text)

Unable to scrape class information with Find_All

I am trying to extract class information form below website https://www.programmableweb.com/category/all/apis. My code works fine for all pages except https://www.programmableweb.com/category/all/apis?page=2092.
from bs4 import BeautifulSoup
import requests
url = 'https://www.programmableweb.com/category/all/apis?page=2092'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
apis = soup.find_all('tr',{'class':['odd views-row-first', 'odd','even','even views-row-last']})
print(apis)
On 2092 page I get info about only 1 class as below
[<tr class="odd views-row-first views-row-last"><td class="views-field views-field-pw-version-title"> Inkling API<br/></td><td class="views-field views-field-search-api-excerpt views-field-field-api-description hidden-xs visible-md visible-sm col-md-8"> Our REST API allows you to replicate much of the functionality in our hosted marketplace solution to build custom widgets and stock tickers for your Intranet, create custom reports, add trading...</td><td class="views-field views-field-field-article-primary-category"> Financial</td><td class="views-field views-field-pw-version-links"> REST v0.0</td></tr>]
For any other page (like https://www.programmableweb.com/category/all/apis?page=2091), I get info about all the classes. The HTML structure seems similar in all pages.
This website is constantly adding new APIs to it's database So there is three scenarios here that might have caused this :
The selectors you are using is not accurate.
The website has
some kinda security measure for you sending too many requests.
at
the time of your scrape this page truly had one item on it .
scenario no 3 is most likely to believe .
from bs4 import BeautifulSoup
import requests
from time import sleep
for page in range(1,2094): #starting with 1 then the last page will be 2093
url = f'https://www.programmableweb.com/category/all/apis?page={page}'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
apis = soup.select('table[class="views-table cols-4 table"] tbody tr') # better selector
print(apis) #page 2093 currently has 6 items on it .
sleep(5) #This will sleep for 5 secs

embed html images tkinter

all. I am learning more about coding and tkinter and Python in general and I want to create a currency converter. I have got the converter working like I want, but I want to add the USD to EUR chart from this website into my GUI:
https://www.xe.com/currencyconverter/convert/?Amount=1&From=USD&To=EUR
Can I please get suggestions on how to go about doing that?
I have already used
import urllib.request
from bs4 import BeautifulSoup
html_code = urllib.request.urlopen(url).read()
self.soup = BeautifulSoup(html_code, 'html.parser')
self.result = self.soup.find('span', {'class': "uccResultAmount"}).string
to get some information from the website.
Thanks

How to read the html tag which is not reflecting in the beautiful soup object

I am trying to scrape the pages of anime tv series for my side project and stuck with this small issue where I am not able to extract a complete div tag.
link: https://gogoanime.in/category/boruto-naruto-next-generations
I am trying to scrape this thing here:
enter image description here
The code I am using is this:
from bs4 import BeautifulSoup
trialURL = 'https://gogoanime.in/category/boruto-naruto-next-generations'
#completeAnimeList_df.URL[0]
page = simple_get(trialURL)
soup = BeautifulSoup(page,'html.parser')
#working on number of comments
divComment = soup.find('span', attrs={"class" : "comment-count"})
print(divComment.get_text)
I am getting a NoneType output after running the code. Please let me know what could be done.

why can't find/parse this element in HTML?

Hi I'm practicing extracting information from a website.
(I'm using python, selenium, and beautifulsoup, which doesn't matter too much. The question is about finding an element in HTML.)
So (1) I want info in the table in graph. I located the table using Firefox Inspector: <table id='......'>
(2) but in my code I can't find it:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
url = 'http://corp.sec.state.ma.us/corpweb/UCCSearch/UCCSearch.aspx'
driver = webdriver.Firefox()
driver.get(url)
# navigate to the page I want using selenium
driver.find_element_by_id("MainContent_rdoSearchO").click()
driver.find_element_by_id("MainContent_txtName").send_keys("mcdonald")
Select(driver.find_element_by_id("MainContent_cboOState")).select_by_visible_text("Massachusetts")
Select(driver.find_element_by_id("MainContent_UCCSearchMethodO")).select_by_visible_text("Begins With")
driver.find_element_by_id("MainContent_btnSearch").click()
# now on next page, click link (selenium)
link_text = '95352026'
driver.find_element_by_link_text(link_text).click()
### real question starts here:
# now on the page I want
# in firefox inspector find: <table id="MainContent_tblFilingHistory">
table_id = 'MainContent_tblFilingHistory'
# try find it
table = driver.find_elements_by_id(table_id)
len(table) # length = 0, can't find it
html.find(table_id) # -1, HTML really doesn't have this string
The element you have trouble to locate is in another window. You need to tell the driver to switch the context to that window:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
driver = webdriver.Firefox()
driver.get('http://corp.sec.state.ma.us/corpweb/UCCSearch/UCCSearch.aspx')
driver.find_element_by_id("MainContent_rdoSearchO").click()
driver.find_element_by_id("MainContent_txtName").send_keys("mcdonald")
Select(driver.find_element_by_id("MainContent_cboOState")).select_by_visible_text("Massachusetts")
Select(driver.find_element_by_id("MainContent_UCCSearchMethodO")).select_by_visible_text("Begins With")
driver.find_element_by_id("MainContent_btnSearch").click()
driver.find_element_by_link_text('95352026').click()
#switch to the next window
driver.switch_to_window(driver.window_handles[1])
table = driver.find_elements_by_id('MainContent_tblFilingHistory')