How To Get All Link On Html Document Using Beautiful Soup? - html

how can i get all the link present in html file using bs4.
i am trying with this code but i am not getting the url
import urllib
import re
from bs4 import BeautifulSoup
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags = soup('a')
for tag in tags:
print (url+tag.get('href',None))

You could use urlparse.urljoin;
EDIT: To deduplicate, just put them in a set before display.
from bs4 import BeautifulSoup
import urllib
from urlparse import urljoin
urlInput = raw_input('enter - ')
html = urllib.urlopen(urlInput).read()
soup = BeautifulSoup(html)
tags = soup('a')
urls = set()
for tag in tags:
urls.add(urljoin(url, tag.get('href')))
for url in urls:
print url

Related

I want to extract the html from a particular div class named "se_component_wrap sect_dsc __se_component_area"

I have already got a working python script, but i want to automate the url fetching from a page..I just need all the html code inside the div class se_component_wrap sect_dsc __se_component_area but currently i'm getting the html of the whole page
from lxml import html
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
from fetch import *
my_url=('https://m.post.naver.com/viewer/postView.nhn?volumeNo=20163796&memberNo=29747755')
#Opening Client
uClient = uReq(my_url)
#Opening the client
page_html = uClient.read()
#Closing connection
uClient.close()
page_soup = soup(page_html, "html.parser")
clear_file=page_soup.prettify()
with open("test.txt","w", encoding="utf-8") as outp:
outp.write(clear_file)
print (page_soup)
fetcher()
I expect the output to be the html code that's contained in that division instead of the complete page
A straight forward way would be:
soup = BeautifulSoup(page_html, "html.parser")
for div in soup.findAll('div',{'class':'se_component_wrap sect_dsc __se_component_area'}):
print div

When I use the following script with selenium and Beautifulsoup the text is correctly extracted but the json files is always the same

I have created the script below to extract the text from the posts of an instagram user profile.
It works perfectly to extract the posts but there is a problem once i start using the scroll function of selenium as the json file does not seem to be updating.
I have created a loop for 2 repetitions for test purposes
but there seems to be a problem in the line pageSource=driver.page_source.
I am expecting the script to load the new json file linked to the new page that but when I test it the pageSource file is always the same even if selenium is correctly scrolling through the page.
import requests
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
url = #instagram url
driver = webdriver.Firefox()
driver.get(url)
for n in range(2):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
pageSource=driver.page_source
soup = BeautifulSoup(pageSource, 'html.parser')
body = soup.find('body')
script = body.find('script')
raw = script.text.strip().replace('window._sharedData =', '').replace(';', '')
json_data=json.loads(raw)
for post in json_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
text_src = post['node']['edge_media_to_caption']['edges'][0]['node']['text']
print (text_src)
time.sleep(5)

BeautifulSoup and prettify() function

To parse html codes of a website, I decided to use BeautifulSoup class and prettify() method. I wrote the code below.
import requests
import bs4
response = requests.get("https://www.doviz.com")
soup = bs4.BeautifulSoup(response.content, "html.parser")
print(soup.prettify())
When I execute this code on Mac terminal, indentation of the codes are not set. On the other hand, If I execute this code on windows cmd or PyCharm, all codes are set.
Do you know the reason for this ?
try this code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.doviz.com")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())

Python 3 beautifulsoup html issue not finding game names

not sure what I am doing wrong but here is my code:
from requests import get
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?sort=desc&year_selected=2018"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
html_soup = BeautifulSoup(webpage, 'lxml')
game_names = html_soup.find_all("div", _class = "product_item product_title")
print(game_names)
I am try to find all the game name from this list and metascore but can't figure out the html for it, I think what I have should work but when I run it all I get is this:
[]
can someone explain?
I'm looking at the beautiuflsoup documentation but not sure where to look lol
Thank in advance for the help,

Can i navigate through a find_all() beautiful soup object?

If I have the line:
from bs4 import BeautifulSoup
soup = BeautifulSoup('A LOT OF HTML HERE', 'html.parser')
all_posts = soup.find_all('div',{'class': 'post'})
Is there an equivalent to this pesudocode:
for post in all_posts:
user = post.find('a',{'class':'user123'})
content = post.find('span',{'class':'content123'})`
Thanks in advance!