To parse html codes of a website, I decided to use BeautifulSoup class and prettify() method. I wrote the code below.
import requests
import bs4
response = requests.get("https://www.doviz.com")
soup = bs4.BeautifulSoup(response.content, "html.parser")
print(soup.prettify())
When I execute this code on Mac terminal, indentation of the codes are not set. On the other hand, If I execute this code on windows cmd or PyCharm, all codes are set.
Do you know the reason for this ?
try this code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.doviz.com")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())
Related
I have already got a working python script, but i want to automate the url fetching from a page..I just need all the html code inside the div class se_component_wrap sect_dsc __se_component_area but currently i'm getting the html of the whole page
from lxml import html
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
from fetch import *
my_url=('https://m.post.naver.com/viewer/postView.nhn?volumeNo=20163796&memberNo=29747755')
#Opening Client
uClient = uReq(my_url)
#Opening the client
page_html = uClient.read()
#Closing connection
uClient.close()
page_soup = soup(page_html, "html.parser")
clear_file=page_soup.prettify()
with open("test.txt","w", encoding="utf-8") as outp:
outp.write(clear_file)
print (page_soup)
fetcher()
I expect the output to be the html code that's contained in that division instead of the complete page
A straight forward way would be:
soup = BeautifulSoup(page_html, "html.parser")
for div in soup.findAll('div',{'class':'se_component_wrap sect_dsc __se_component_area'}):
print div
I have created the script below to extract the text from the posts of an instagram user profile.
It works perfectly to extract the posts but there is a problem once i start using the scroll function of selenium as the json file does not seem to be updating.
I have created a loop for 2 repetitions for test purposes
but there seems to be a problem in the line pageSource=driver.page_source.
I am expecting the script to load the new json file linked to the new page that but when I test it the pageSource file is always the same even if selenium is correctly scrolling through the page.
import requests
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
url = #instagram url
driver = webdriver.Firefox()
driver.get(url)
for n in range(2):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
pageSource=driver.page_source
soup = BeautifulSoup(pageSource, 'html.parser')
body = soup.find('body')
script = body.find('script')
raw = script.text.strip().replace('window._sharedData =', '').replace(';', '')
json_data=json.loads(raw)
for post in json_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
text_src = post['node']['edge_media_to_caption']['edges'][0]['node']['text']
print (text_src)
time.sleep(5)
I am trying to webscrape and get the insurance dollars as listed in the below html.
Insurance
Insurance
Used the below code but it is not fetching anything. Can someone help? I am fairly new to python...
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.kbb.com/ford/escape/2017/s/?vehicleid=415933&intent=buy-new')
html_soup = BeautifulSoup(r.content, 'lxml')
test2 = html_soup.find_all('div',attrs={"class":"col-base-6"})
print(test2)
Not all the data you see on the page is actually the response to the get request to this URL. there are a lot of other requests the browser make in the background, which are initiated by javascript code.
Specifically, the request for the insurance data is made to this URL:
https://www.kbb.com/vehicles/hub/_costtoown/?vehicleid=415933
Here is a working code for what you need:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.kbb.com/vehicles/hub/_costtoown/?vehicleid=415933')
html_soup = BeautifulSoup(r.text, 'html.parser')
Insurance = html_soup.find('div',string="Insurance").find_next().text
print(Insurance)
not sure what I am doing wrong but here is my code:
from requests import get
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?sort=desc&year_selected=2018"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
html_soup = BeautifulSoup(webpage, 'lxml')
game_names = html_soup.find_all("div", _class = "product_item product_title")
print(game_names)
I am try to find all the game name from this list and metascore but can't figure out the html for it, I think what I have should work but when I run it all I get is this:
[]
can someone explain?
I'm looking at the beautiuflsoup documentation but not sure where to look lol
Thank in advance for the help,
how can i get all the link present in html file using bs4.
i am trying with this code but i am not getting the url
import urllib
import re
from bs4 import BeautifulSoup
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags = soup('a')
for tag in tags:
print (url+tag.get('href',None))
You could use urlparse.urljoin;
EDIT: To deduplicate, just put them in a set before display.
from bs4 import BeautifulSoup
import urllib
from urlparse import urljoin
urlInput = raw_input('enter - ')
html = urllib.urlopen(urlInput).read()
soup = BeautifulSoup(html)
tags = soup('a')
urls = set()
for tag in tags:
urls.add(urljoin(url, tag.get('href')))
for url in urls:
print url