Python 3 beautifulsoup html issue not finding game names - html

not sure what I am doing wrong but here is my code:
from requests import get
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?sort=desc&year_selected=2018"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
html_soup = BeautifulSoup(webpage, 'lxml')
game_names = html_soup.find_all("div", _class = "product_item product_title")
print(game_names)
I am try to find all the game name from this list and metascore but can't figure out the html for it, I think what I have should work but when I run it all I get is this:
[]
can someone explain?
I'm looking at the beautiuflsoup documentation but not sure where to look lol
Thank in advance for the help,

Related

Web-scraping a link from web-page

New to web-scraping here. I basically want to extract a link from a web page into my jupyter notebook as shown in the image below :
Following is the code that I tried out:
from flask import Flask, render_template, request, jsonify
from flask_cors import CORS, cross_origin
import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen as uReq
flipkart_url = "https://www.flipkart.com/search?q=" + 'acer-aspire-7-core-i5'
uClient = uReq(flipkart_url)
flipkartPage = uClient.read()
flipkart_html = bs(flipkartPage, "html.parser")
#Since I am only interested in the class "_1AtVbE col-12-12"
bigboxes = flipkart_html.findAll("div", {"class": "_1AtVbE col-12-12"})
Now here's the thing, I don't exactly understand what bigboxes is storing. The type of bigboxes is bs4.element.ResultSet, the length is 16.
Now if I run:
box = bigboxes[0]
productlink = "https://www.flipkart.com" + box.div.div.div.a['href']
I am getting an error. However when I run:
box = bigboxes[2]
productlink = "https://www.flipkart.com" + box.div.div.div.a['href']
I am successfully able to extract the link. Can someone please explain to me why the third element was able to read the link? I have a basic knowledge of HTML (at least I thought so) and I don't understand the layers to it. What exactly is bigboxes storing? Clearly, the HTML script shows no layers as such.
Your class filter is not very specific.
The first and second elements are pointing to html nodes which do not contain the link. Thus you are getting error.
A more specific class to check could be: _13oc-S
bigboxes = flipkart_html.findAll("div", {"class": "_13oc-S"})

I want to extract the html from a particular div class named "se_component_wrap sect_dsc __se_component_area"

I have already got a working python script, but i want to automate the url fetching from a page..I just need all the html code inside the div class se_component_wrap sect_dsc __se_component_area but currently i'm getting the html of the whole page
from lxml import html
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
from fetch import *
my_url=('https://m.post.naver.com/viewer/postView.nhn?volumeNo=20163796&memberNo=29747755')
#Opening Client
uClient = uReq(my_url)
#Opening the client
page_html = uClient.read()
#Closing connection
uClient.close()
page_soup = soup(page_html, "html.parser")
clear_file=page_soup.prettify()
with open("test.txt","w", encoding="utf-8") as outp:
outp.write(clear_file)
print (page_soup)
fetcher()
I expect the output to be the html code that's contained in that division instead of the complete page
A straight forward way would be:
soup = BeautifulSoup(page_html, "html.parser")
for div in soup.findAll('div',{'class':'se_component_wrap sect_dsc __se_component_area'}):
print div

parse a json object in a div while scraping using beautifulsoup python

I am learning scraping. I need to access the json string i encounter within a DIV. I am using beautifulsoup.
This is the json string i get in the DIV. I need the value (51.65) of the tag "lastprice". Please help. The JSON object is in json_d
import pip
import requests
import json
from bs4 import BeautifulSoup
print ('hi')
page = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=NBCC&illiquid=0&smeFlag=0&itpFlag=0')
soup = BeautifulSoup(page.text, 'html.parser')
json_d = soup.find(id='responseDiv')
print ('bye')
import bs4
import json
r= '''
<div id="responseDiv" style="display:none">{"tradedDate":"07DEC2018","data":[{"pricebandupper":"58.35","symbol":"NBCC","applicableMargin":"15.35","bcEndDate":"14-SEP-18","totalSellQuantity":"40,722","adhocMargin":"-","companyName":"NBCC (India) Limited","marketType":"N","exDate":"06-SEP-18","bcStartDate":"10-SEP-18","css_status_desc":"Listed","dayHigh":"53.55","basePrice":"53.05","securityVar":"10.35","pricebandlower":"47.75","sellQuantity5":"-","sellQuantity4":"-","sellQuantity3":"-","cm_adj_high_dt":"08-DEC-17","sellQuantity2":"-","dayLow":"51.55","sellQuantity1":"40,722","quantityTraded":"71,35,742","pChange":"-2.64","totalTradedValue":"3,714.15","deliveryToTradedQuantity":"40.23","totalBuyQuantity":"-","averagePrice":"52.05","indexVar":"-","cm_ffm":"2,424.24","purpose":"ANNUAL GENERAL MEETING\/DIVIDEND RE 0.56 PER SHARE","buyPrice2":"-","secDate":"7DEC2018","buyPrice1":"-","high52":"266.00","previousClose":"53.05","ndEndDate":"-","low52":"50.80","buyPrice4":"-","buyPrice3":"-","recordDate":"-","deliveryQuantity":"28,70,753","buyPrice5":"-","priceBand":"No Band","extremeLossMargin":"5.00","cm_adj_low_dt":"26-OCT-18","varMargin":"10.35","sellPrice1":"51.80","sellPrice2":"-","totalTradedVolume":"71,35,742","sellPrice3":"-","sellPrice4":"-","sellPrice5":"-","change":"-1.40","surv_indicator":"-","ndStartDate":"-","buyQuantity4":"-","isExDateFlag":false,"buyQuantity3":"-","buyQuantity2":"-","buyQuantity1":"-","series":"EQ","faceValue":"1.00","buyQuantity5":"-","closePrice":"51.80","open":"53.15","isinCode":"INE095N01031","lastPrice":"51.65"}],"optLink":"\/marketinfo\/sym_map\/symbolMapping.jsp?symbol=NBCC&instrument=-&date=-&segmentLink=17&symbolCount=2","otherSeries":["EQ"],"futLink":"\/live_market\/dynaContent\/live_watch\/get_quote\/GetQuoteFO.jsp?underlying=NBCC&instrument=FUTSTK&expiry=27DEC2018&type=-&strike=-","lastUpdateTime":"07-DEC-2018 15:59:59"}</div>'''
html = bs4.BeautifulSoup(r)
soup = html.find('div', {'id':'responseDiv'}).text
data = json.loads(soup)
last_price = data['data'][0]['lastPrice']
EDIT:
json_d = soup.find(id='responseDiv')
Try changing to
json_d = soup.find(‘div’, {‘id’:'responseDiv'})
Then you should be able to do
data = json.loads(json_d)
last_price = data['data'][0]['lastPrice']
See if that helps. I’m currently away from my computer until Tuesday so typing this up on my iPhone, so can’t test/play with it.
The other thing is the site might need to be read in after it’s loaded. In that case, I think you’d need to look into selenium package or html-requests packages.
Again, I can’t look until Tuesday when I get back home to my laptop.

beautiful soup findall not returning results

I am trying to webscrape and get the insurance dollars as listed in the below html.
Insurance
Insurance
Used the below code but it is not fetching anything. Can someone help? I am fairly new to python...
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.kbb.com/ford/escape/2017/s/?vehicleid=415933&intent=buy-new')
html_soup = BeautifulSoup(r.content, 'lxml')
test2 = html_soup.find_all('div',attrs={"class":"col-base-6"})
print(test2)
Not all the data you see on the page is actually the response to the get request to this URL. there are a lot of other requests the browser make in the background, which are initiated by javascript code.
Specifically, the request for the insurance data is made to this URL:
https://www.kbb.com/vehicles/hub/_costtoown/?vehicleid=415933
Here is a working code for what you need:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.kbb.com/vehicles/hub/_costtoown/?vehicleid=415933')
html_soup = BeautifulSoup(r.text, 'html.parser')
Insurance = html_soup.find('div',string="Insurance").find_next().text
print(Insurance)

BeautifulSoup and prettify() function

To parse html codes of a website, I decided to use BeautifulSoup class and prettify() method. I wrote the code below.
import requests
import bs4
response = requests.get("https://www.doviz.com")
soup = bs4.BeautifulSoup(response.content, "html.parser")
print(soup.prettify())
When I execute this code on Mac terminal, indentation of the codes are not set. On the other hand, If I execute this code on windows cmd or PyCharm, all codes are set.
Do you know the reason for this ?
try this code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.doviz.com")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())