If I have the line:
from bs4 import BeautifulSoup
soup = BeautifulSoup('A LOT OF HTML HERE', 'html.parser')
all_posts = soup.find_all('div',{'class': 'post'})
Is there an equivalent to this pesudocode:
for post in all_posts:
user = post.find('a',{'class':'user123'})
content = post.find('span',{'class':'content123'})`
Thanks in advance!
Related
I have already got a working python script, but i want to automate the url fetching from a page..I just need all the html code inside the div class se_component_wrap sect_dsc __se_component_area but currently i'm getting the html of the whole page
from lxml import html
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
from fetch import *
my_url=('https://m.post.naver.com/viewer/postView.nhn?volumeNo=20163796&memberNo=29747755')
#Opening Client
uClient = uReq(my_url)
#Opening the client
page_html = uClient.read()
#Closing connection
uClient.close()
page_soup = soup(page_html, "html.parser")
clear_file=page_soup.prettify()
with open("test.txt","w", encoding="utf-8") as outp:
outp.write(clear_file)
print (page_soup)
fetcher()
I expect the output to be the html code that's contained in that division instead of the complete page
A straight forward way would be:
soup = BeautifulSoup(page_html, "html.parser")
for div in soup.findAll('div',{'class':'se_component_wrap sect_dsc __se_component_area'}):
print div
I am trying to webscrape and get the insurance dollars as listed in the below html.
Insurance
Insurance
Used the below code but it is not fetching anything. Can someone help? I am fairly new to python...
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.kbb.com/ford/escape/2017/s/?vehicleid=415933&intent=buy-new')
html_soup = BeautifulSoup(r.content, 'lxml')
test2 = html_soup.find_all('div',attrs={"class":"col-base-6"})
print(test2)
Not all the data you see on the page is actually the response to the get request to this URL. there are a lot of other requests the browser make in the background, which are initiated by javascript code.
Specifically, the request for the insurance data is made to this URL:
https://www.kbb.com/vehicles/hub/_costtoown/?vehicleid=415933
Here is a working code for what you need:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.kbb.com/vehicles/hub/_costtoown/?vehicleid=415933')
html_soup = BeautifulSoup(r.text, 'html.parser')
Insurance = html_soup.find('div',string="Insurance").find_next().text
print(Insurance)
To parse html codes of a website, I decided to use BeautifulSoup class and prettify() method. I wrote the code below.
import requests
import bs4
response = requests.get("https://www.doviz.com")
soup = bs4.BeautifulSoup(response.content, "html.parser")
print(soup.prettify())
When I execute this code on Mac terminal, indentation of the codes are not set. On the other hand, If I execute this code on windows cmd or PyCharm, all codes are set.
Do you know the reason for this ?
try this code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.doviz.com")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())
not sure what I am doing wrong but here is my code:
from requests import get
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?sort=desc&year_selected=2018"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
html_soup = BeautifulSoup(webpage, 'lxml')
game_names = html_soup.find_all("div", _class = "product_item product_title")
print(game_names)
I am try to find all the game name from this list and metascore but can't figure out the html for it, I think what I have should work but when I run it all I get is this:
[]
can someone explain?
I'm looking at the beautiuflsoup documentation but not sure where to look lol
Thank in advance for the help,
how can i get all the link present in html file using bs4.
i am trying with this code but i am not getting the url
import urllib
import re
from bs4 import BeautifulSoup
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags = soup('a')
for tag in tags:
print (url+tag.get('href',None))
You could use urlparse.urljoin;
EDIT: To deduplicate, just put them in a set before display.
from bs4 import BeautifulSoup
import urllib
from urlparse import urljoin
urlInput = raw_input('enter - ')
html = urllib.urlopen(urlInput).read()
soup = BeautifulSoup(html)
tags = soup('a')
urls = set()
for tag in tags:
urls.add(urljoin(url, tag.get('href')))
for url in urls:
print url