Can i navigate through a find_all() beautiful soup object?

Can i navigate through a find_all() beautiful soup object? - html

If I have the line:
from bs4 import BeautifulSoup
soup = BeautifulSoup('A LOT OF HTML HERE', 'html.parser')
all_posts = soup.find_all('div',{'class': 'post'})
Is there an equivalent to this pesudocode:
for post in all_posts:
user = post.find('a',{'class':'user123'})
content = post.find('span',{'class':'content123'})`
Thanks in advance!

Related

I want to extract the html from a particular div class named "se_component_wrap sect_dsc __se_component_area"

I have already got a working python script, but i want to automate the url fetching from a page..I just need all the html code inside the div class se_component_wrap sect_dsc __se_component_area but currently i'm getting the html of the whole page
from lxml import html
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
from fetch import *
my_url=('https://m.post.naver.com/viewer/postView.nhn?volumeNo=20163796&memberNo=29747755')
#Opening Client
uClient = uReq(my_url)
#Opening the client
page_html = uClient.read()
#Closing connection
uClient.close()
page_soup = soup(page_html, "html.parser")
clear_file=page_soup.prettify()
with open("test.txt","w", encoding="utf-8") as outp:
outp.write(clear_file)
print (page_soup)
fetcher()
I expect the output to be the html code that's contained in that division instead of the complete page

A straight forward way would be:
soup = BeautifulSoup(page_html, "html.parser")
for div in soup.findAll('div',{'class':'se_component_wrap sect_dsc __se_component_area'}):
print div

beautiful soup findall not returning results

I am trying to webscrape and get the insurance dollars as listed in the below html.
Insurance
Insurance
Used the below code but it is not fetching anything. Can someone help? I am fairly new to python...
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.kbb.com/ford/escape/2017/s/?vehicleid=415933&intent=buy-new')
html_soup = BeautifulSoup(r.content, 'lxml')
test2 = html_soup.find_all('div',attrs={"class":"col-base-6"})
print(test2)

Not all the data you see on the page is actually the response to the get request to this URL. there are a lot of other requests the browser make in the background, which are initiated by javascript code.
Specifically, the request for the insurance data is made to this URL:
https://www.kbb.com/vehicles/hub/_costtoown/?vehicleid=415933
Here is a working code for what you need:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.kbb.com/vehicles/hub/_costtoown/?vehicleid=415933')
html_soup = BeautifulSoup(r.text, 'html.parser')
Insurance = html_soup.find('div',string="Insurance").find_next().text
print(Insurance)

BeautifulSoup and prettify() function

To parse html codes of a website, I decided to use BeautifulSoup class and prettify() method. I wrote the code below.
import requests
import bs4
response = requests.get("https://www.doviz.com")
soup = bs4.BeautifulSoup(response.content, "html.parser")
print(soup.prettify())
When I execute this code on Mac terminal, indentation of the codes are not set. On the other hand, If I execute this code on windows cmd or PyCharm, all codes are set.
Do you know the reason for this ?

try this code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.doviz.com")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())

Python 3 beautifulsoup html issue not finding game names

not sure what I am doing wrong but here is my code:
from requests import get
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?sort=desc&year_selected=2018"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
html_soup = BeautifulSoup(webpage, 'lxml')
game_names = html_soup.find_all("div", _class = "product_item product_title")
print(game_names)
I am try to find all the game name from this list and metascore but can't figure out the html for it, I think what I have should work but when I run it all I get is this:
[]
can someone explain?
I'm looking at the beautiuflsoup documentation but not sure where to look lol
Thank in advance for the help,

How To Get All Link On Html Document Using Beautiful Soup?

how can i get all the link present in html file using bs4.
i am trying with this code but i am not getting the url
import urllib
import re
from bs4 import BeautifulSoup
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags = soup('a')
for tag in tags:
print (url+tag.get('href',None))

You could use urlparse.urljoin;
EDIT: To deduplicate, just put them in a set before display.
from bs4 import BeautifulSoup
import urllib
from urlparse import urljoin
urlInput = raw_input('enter - ')
html = urllib.urlopen(urlInput).read()
soup = BeautifulSoup(html)
tags = soup('a')
urls = set()
for tag in tags:
urls.add(urljoin(url, tag.get('href')))
for url in urls:
print url

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Can i navigate through a find_all() beautiful soup object? - html

Related

I want to extract the html from a particular div class named "se_component_wrap sect_dsc __se_component_area"

beautiful soup findall not returning results

BeautifulSoup and prettify() function

Python 3 beautifulsoup html issue not finding game names

How To Get All Link On Html Document Using Beautiful Soup?

Categories

Resources