Python3.5 BeautifulSoup4 get text from 'p' in div - html

I am trying to pull all the text from the div class 'caselawcontent searchable-content'. This code just prints the HTML without the text from the web page. What am I missing to get the text?
The following link is in the 'finteredcasesdoc.text' file:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html
import requests
from bs4 import BeautifulSoup
with open('filteredcasesdoc.txt', 'r') as openfile1:
for line in openfile1:
rulingpage = requests.get(line).text
soup = BeautifulSoup(rulingpage, 'html.parser')
doctext = soup.find('div', class_='caselawcontent searchable-content')
print (doctext)

from bs4 import BeautifulSoup
import requests
url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
I've added a much more reliable .find method ( key : value)
whole_section = soup.find('div',{'class':'caselawcontent searchable-content'})
the_title = whole_section.center.h2
#e.g. Missouri Court of Appeals,Southern District,Division Two.
second_title = whole_section.center.h3.p
#e.g. STATE of Missouri, Plaintiff-Appellant v....
number_text = whole_section.center.h3.next_sibling.next_sibling
#e.g.
the_date = number_text.next_sibling.next_sibling
#authors
authors = whole_section.center.next_sibling
para = whole_section.findAll('p')[1:]
#Because we don't want the paragraph h3.p.
# we could aslso do findAll('p',recursive=False) doesnt pickup children
Basically, I've dissected this whole tree
as for the Paragraphs (e.g. Main text, the var para), you'll have to loop
print(authors)
# and you can add .text (e.g. print(authors.text) to get the text without the tag.
# or a simple function that returns only the text
def rettext(something):
return something.text
#Usage: print(rettext(authorts))

Try printing doctext.text. This will get rid of all the HTML tags for you.
from bs4 import BeautifulSoup
cases = []
with open('filteredcasesdoc.txt', 'r') as openfile1:
for url in openfile1:
# GET the HTML page as a string, with HTML tags
rulingpage = requests.get(url).text
soup = BeautifulSoup(rulingpage, 'html.parser')
# find the part of the HTML page we want, as an HTML element
doctext = soup.find('div', class_='caselawcontent searchable-content')
print(doctext.text) # now we have the inner HTML as a string
cases.append(doctext.text) # do something useful with this !

Related

How to write a css/xpath for dynamically changing element?

I am using beautiful soup and below is my selector to scrape href.
html = ''' <a data-testid="Link" class="sc-pciXn eUevWj JobTile___StyledJobLink-sc-
1nulpkp-0 gkKKqP JobTile___StyledJobLink-sc-1nulpkp-0 gkKKqP"
href="https://join.com/companies/talpasolutions/4978529-project-customer-
success-manager-heavy-industries-d-f-m">'''
soup = beautifulsoup(HTML , "lxml")
jobs = soup.find_all( "a" ,class_= "sc-pciXn eUevWj JobTile___StyledJobLink-sc-1nulpkp-0
gkKKqP JobTile___StyledJobLink-sc-1nulpkp-0 gkKKqP")
for job in jobs:
job_url = job.get("href")
I am using find_all because there is a total of 3 elements with hrefs.
Above method is working but the website keeps changing the classes on a daily basis. I need a different way to design CSS/XPath
Try:
import requests
from bs4 import BeautifulSoup
url = "https://join.com/companies/talpasolutions"
soup = BeautifulSoup(requests.get(url).content, "lxml")
for a in soup.select("a:has(h3)"):
print(a.get("href"))
Prints:
https://join.com/companies/talpasolutions/4978529-project-customer-success-manager-heavy-industries-d-f-m
https://join.com/companies/talpasolutions/4925936-senior-data-engineer-d-f-m
https://join.com/companies/talpasolutions/4926107-senior-data-scientist-d-f-m

beautiful soup unable to find elements from website

It's my first time working with web scraping so cut me some slack. I'm trying to pull the "card_tag" from a website. I triple checked that the card tag is inside their respected tags as seen in the code.
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.anime-planet.com/users/mistersenpai/anime/dropped")
src = result.content
soup = BeautifulSoup(src, features="html.parser")
urls = []
for div_tag in soup.find_all('div id="siteContainer"'):
ul_tag = div_tag.find("ul class")
li_tag = ul_tag.find("li")
card_tag = li_tag.find("h3")
urls.append(card_tag)
print(urls)
When I go to print the url list it outputs nothing. You can see the thing I'm looking for by visiting the link as seen in the code and inspecting element on "Blood-C". As you can see it's listed in the tag I'm trying to find, yet my code can't seem to find it.
Any help would be much appreciated.
just minor syntax you need to change with the tags and attributes.
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.anime-planet.com/users/mistersenpai/anime/dropped")
src = result.content
soup = BeautifulSoup(src, features="html.parser")
urls = []
containers = soup.find_all('div', {'id':'siteContainer'})
for div_tag in containers:
ul_tag = div_tag.find("ul", {'data-type':'anime'})
li_tag = ul_tag.find_all("li")
for each in li_tag:
card_tag = each.find("h3")
urls.append(card_tag)
print(card_tag)
Also, you could just skip all that and go straight to those <h3> tags with the class attribute cardName:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.anime-planet.com/users/mistersenpai/anime/dropped")
src = result.content
soup = BeautifulSoup(src, features="html.parser")
urls = []
for card_tag in soup.find_all('h3', {'class':'cardName'}):
print(card_tag)
urls.append(card_tag)
Output:
<h3 class="cardName">Black Butler</h3>
<h3 class="cardName">Blood-C</h3>
<h3 class="cardName">Place to Place</h3>

Parsing HTML and writing PDF to disk (python)

My goal was to write a script that downloads all the pdf files from a user entered site.
Problem 1. the code does not return the anchor tags located inside of the iframe. I tried explicitly using the iframe tag name and then using .contents but the commanded returns an empty list.
Question 1: How to parse the iframe? Why doesn't the iframe.contents return its children i.e. the <a> tags?
Problem 2: Writing the PDFs to disk appears successful however when I attempt to the files I get the following error,
"....could not open...because it is either not a supported file type
or because the file has been damaged ( for example, it was sent as an
email...and wasn't correctly decoded).
Question 2: Anybody encounter this before?
The code is split in two blocks; one for each problem delete the set of quotes around a block to run.
Lastly if anyone can explain why the two urls don't match in the first block of code that would be awesome. Code is commented; contains urls for each question. Thanks!
PYTHON CODE
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
#initializing counters
slide = 1
count = 0
#ignore SSL cert errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#get user url and create soup object
url = input("Enter the website name: ")
connect = urllib.request.urlopen(url, context=ctx)
soup = BeautifulSoup(connect, 'html.parser')
######## code block for question 1 revolving around parsing iframes and the issues with the
######## mismatching urls
#url used for code block 1: https://www.cs.ucr.edu/~epapalex/teaching/235_F19/index.html
"""
#trying to retrieve all anchor tags; doesn't print the anchor tags within the iframe
tags = soup('a')
for tag in tags:
print(tag)
print('\n')
#explictly asking for the iframe tag
iframe = soup.iframe
#the url printed on this line doesn't match the url printed once I get the src attribute
#navigating to the url listed here is what I use for the second block of code because it
#isn't an iframe
print(iframe)
iframe_src_url = iframe['src']
#this url doesn't match the one shown in the previous print statement and it leaves you dealing
#with another iframe
print(iframe_src_url)
"""
#########code block for question 2 where I enter the url found in the iframe src attribute
#url for block 2: https://docs.google.com/spreadsheets/d/e/2PACX-1vRF408HaDlR6Q9fx6WF6YzeNrZIkXZBqwz_qyN8hz8N4rhIrcpc_GWNMrCODVmucMEUhXIElxcXyDpY/pubhtml?gid=0&single=true&widget=true&headers=false
"""
tags = soup('a')
#iterate through tags, retrieve href addresses, navigate to the document, write data to file
for tag in tags:
doc_url = tag.get('href')
file = urllib.request.urlopen(doc_url, context=ctx)
file = open("Week " + str(slide) + " slides.pdf", 'wb')
file.write(connect.read())
file.close()
print("Finished file: ", slide)
count = count + 1
slide = slide + 1
print("Total files downloaded: ", count)"""
import requests
from bs4 import BeautifulSoup
r = requests.get(
'https://www.cs.ucr.edu/~epapalex/teaching/235_F19/index.html')
soup = BeautifulSoup(r.content, 'html.parser')
for item in soup.findAll('iframe'):
print(item.get('src'))
Output:
https://docs.google.com/spreadsheets/d/e/2PACX-1vRF408HaDlR6Q9fx6WF6YzeNrZIkXZBqwz_qyN8hz8N4rhIrcpc_GWNMrCODVmucMEUhXIElxcXyDpY/pubhtml?gid=0&single=true&widget=true&headers=false
And Regarding the second question:
import requests
from bs4 import BeautifulSoup
r = requests.get(
'https://docs.google.com/spreadsheets/d/e/2PACX-1vRF408HaDlR6Q9fx6WF6YzeNrZIkXZBqwz_qyN8hz8N4rhIrcpc_GWNMrCODVmucMEUhXIElxcXyDpY/pubhtml?gid=0&single=true&widget=true&headers=false')
soup = BeautifulSoup(r.content, 'html.parser')
links = []
for item in soup.findAll('a', {'rel': 'noreferrer'}):
links.append(item.get('href'))
for item in links:
r = requests.get(item)
source = r.headers.get('Location')
print(f"Saving File {source[56:]}")
r1 = requests.get(source)
with open(f"{source[56:]}", 'wb') as f:
f.write(r1.content)
print(f"\nTotal File Downloaded is {len(links)}")
Output will save the file to your local disck:
Saving File 01-intro-logistics.pdf
Saving File 02-data.pdf
Saving File 03-preprocessing.pdf
Saving File 03-preprocessing.pdf
Saving File 04-frequent-patterns.pdf
Saving File 05a-supervised.pdf
Saving File 05b-supervised.pdf
Saving File 05c-supervised.pdf
Saving File 06a-supervised-advanced.pdf
Saving File 06b-supervised-advanced.pdf
Saving File 07a-unsupervised.pdf
Saving File 07b-unsupervised.pdf
Saving File 07c-advanced-unsupervised.pdf
Saving File 08-graph-mining.pdf
Saving File 09-anomaly-detection.pdf
Saving File 10-time-series.pdf
Total File Downloaded is 16
Full Version:
import requests
from bs4 import BeautifulSoup
import html
def Get_Links():
links = set()
r = requests.get(
'https://www.cs.ucr.edu/~epapalex/teaching/235_F19/index.html')
soup = BeautifulSoup(r.text, 'html.parser')
source = html.escape(soup.find('iframe').get('src'))
r = requests.get(source)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', {'rel': 'noreferrer'}):
links.add(item.get('href'))
return links, len(links)
def Save_Items():
items, size = Get_Links()
for item in items:
r = requests.get(item)
source = r.headers.get('Location')
print(f"Saving File {source[56:]}")
r = requests.get(source)
with open(f"{source[56:]}", 'wb') as f:
f.write(r.content)
print(f"\nTotal File Downloaded is {size}")
Save_Items()

Having trouble finding Span tag (Python 3)

I'm trying to strip out the Span tags from a html file.
I am using a page which has a lot of Span tags in it. I need to extract some numbers and add them together. However, I can't even get the lines I need out, so I am hoping someone can offer some advice.
My code is below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
# url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
# print(soup)
spans = soup.findAll('span')
for span in spans:
print span
Thanks

BeautifulSoup cannot find all <p> tags in html

I'm trying to pull articles' title, text, and user comments from websites using BeautifulSoup. I have managed to filter the first two but I'm having problems pulling the user comments. The code I have right now.
def extract_text(url):
url_to_extract = url
html = urllib.urlopen(url_to_extract).read()
soup = BeautifulSoup(html, 'html.parser')
for script in soup(["script", "style"]):
script.extract()
print 'Title and Publisher is:\n' + soup.title.string
body_text = ''
article = soup.findAll('p')
for element in article:
body_text += '\n' + ''.join(element.findAll(text=True))
print body_text
def main():
url_title = 'https://www.theguardian.com/politics/2016/oct/24/nicola-sturgeon-says-brexit-meeting-was-deeply-frustrating'
extract_text(url_title)
I've checked the source code for this specific article in the main method, the user comments are available in < p> tags which should make them parsed by BeautifulSoup along with the article text, but it doesn't show. I tried to print all the beautifulsoup content by
print soup
It doesn't show the div where the user comments are supposed to be in. I've tried this on BBC and Guardian websites for now. I would be glad for any kind of help here.
import requests
from bs4 import BeautifulSoup
def getArticle(url):
url = 'http://www.bbc.com/news/business-34421804'
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c)
article_text = ''
article = soup.findAll('p')
for element in article:
article_text += '\n' + ''.join(element.findAll(text = True))
return article_text