Extracting contents from a PDF to display on web pages - html

I'm trying to display the contents of the pdf by converting PDF into HTML using Adobe Acrobat 2021, extracting the paragraph structure, and post-processing. I saw a website whose only source is judgments as PDFs from the Supreme Court Website and displays them flawlessly. Does anybody have any idea how it's done?
My current flow is to convert the PDF into HTML to preserve the page layout and extract the text using Beautifulsoup.
Issues I'm currently facing:
Bulletin numbers are somehow dynamically calculated in the PDF and are tagged as
::before
on the browser. bs4 won't recognize it
Miss some paragraphs in between as some paragraphs are detected incorrectly
Table is detected as a table but some imperfections
PDF example : drive link
HTML from Adobe Acrobat : HTML file of the above PDF
This is my goal : Advocatekhoj
This is how accurate I'm expecting it to be.
Could someone please shed light on this? how-to(s) or any suggestions.
Note: I tried various PDF to HTML tools and the Adobe Acrobat was the best in detecting paragraph layout and preserving structure.
from bs4 import BeautifulSoup
from pprint import pprint
from os import listdir
from os.path import isfile, join
mypath = "sup_del_htmls/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
counter = 0
for f in onlyfiles:
print(counter)
with open("output_txt/"+f+".txt", 'w',encoding='utf-8') as txtfile:
with open(mypath+f, encoding='utf-8') as fp:
soup = BeautifulSoup(fp, "html.parser")
para_counter = 1
for li in soup.select("li"):
if li.find_parent("li"):
continue
full_para = ""
for para in li.select("p"):
for match in para.findAll('span'):
match.unwrap()
para_txt = para.get_text().replace("¶", "")
para_txt = para_txt.strip()
if para_txt.endswith(".") or para_txt.endswith(":") or para_txt.endswith(";") or para_txt.endswith(",") or para_txt.endswith('"') or para_txt.endswith("'"):
full_para += para_txt + "\n"
else:
full_para += para_txt + " "
txtfile.write(full_para)
txtfile.write("\n" + "--sep--" + "\n")
if li.find("table"):
tables = li.find_all("table")
for table in tables:
txtfile.write("--table--"+ "\n")
txtfile.write(str(table) + "\n")
txtfile.write("--sep--" + "\n")
reversed_end = []
for p in reversed(soup.select("p")):
if p.find_parent('li') or p.find_parent('ol'):
break
reversed_end.append(" ".join(p.text.split()))
if reversed_end!=[]:
for final_end in reversed(reversed_end):
txtfile.write(final_end + "\n")
txtfile.write("--sep--" + "\n")
The Result : output.txt

For the numbering with :before in css, you can try to extract the selector/s for the numbered items with a function like this
def getLctrSelectors(stsh):
stsh = stsh.get_text() if stsh else ''
ll_ids = list(set([
l.replace('>li', '> li').split('> li')[0].strip()
for l in stsh.splitlines() if l.strip()[:1] == '#'
and '> li' in l.replace('>li', '> li') and
'counter-increment' in l.split('{')[-1].split(':')[0]
]))
for i, l in enumerate(ll_ids):
sel = f'{l} > li > *:first-child'
ll_ids[i] = (sel, 1)
crl = [
ll for ll in stsh.splitlines() if ll.strip().startswith(l)
and 'counter-reset' in ll.split('{')[-1].split(':')[-2:][0]
][:1]
if not crl: continue
crl = crl[0].split('{')[-1].split('counter-reset')[-1].split(':')[-1]
crl = [w for w in crl.split(';')[0].split() if w.isdigit()]
ll_ids[i] = (sel, int(crl[-1]) if crl else 1)
return ll_ids
(It should take a style tag as input and return a list of selectors and starting counts - like [('#l1 > li > *:first-child', 3)] for your sample html.)
You can use it in your code to insert the numbers into the text in the bs4 tree:
soup = BeautifulSoup(fp, "html.parser")
for sel, ctStart in getLctrSelectors(soup.select_one('style')):
for i, lif in enumerate(soup.select(sel)):
lif.insert(0, f'{i + ctStart}. ')
para_counter = 1
### REST OF CODE ###
I'm not sure I can help you with paragraphs and tables issues... Are you sure the site uses the same pdfs as you have? (Or that they use pdfs at all rather than something closer to the original/raw data?) Your pdf itself looked rather different from its corresponding page on the site.

Related

How to scrape only texts from specific HTML elements?

I have a problem with selecting the appropriate items from the list.
For example - I want to omit "1." then the first "5" (as in the example)
Additionally, I would like to write a condition that the letter "W" should be changed to "WIN".
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
for i in content:
line = i.text.split()[0]
if re.search('Ajax', line):
res.append(line)
print(res)
results
['1.Ajax550016:315?WWWWW']
I need
Ajax;5;5;0;16;3;W;W;W;W;W
I would recommend to select your elements more specific:
for e in soup.select('.ui-table__row'):
Iterate the ResultSet and decompose() unwanted tag:
e.select_one('.wld--tbd').decompose()
Extract texts with stripped_strings and join() them to your expected string:
data.append(';'.join(e.stripped_strings))
Example
Also making some replacements, based on dict just to demonstrate how this would work, not knowing R or P.
...
soup = BS2(page,'html.parser')
data = []
for e in soup.select('.ui-table__row'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
To only get result for Ajax:
data = []
for e in soup.select('.ui-table__row:-soup-contains("Ajax")'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
Output
Based on actually data it may differ from questions example.
['Ajax;6;6;0;0;21;3;WIN;WIN;WIN;WIN;WIN']
you had the right start by using bs4 to find the table div, but then you gave up and just tried to use re to extract from the text. as you can see that's not going to work. Here is a simple way to hack and get what you want. I keep grabinn divs from the table div you find, and the grab the text of the next eight divs after finding Ajax. then I do some dirty string manipulation thing because the WWWWW is all in the same toplevel div.
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
#driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
driver.implicitly_wait(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
found = 0
for i in content.find('div'):
line = i.text.split()[0]
if re.search('Ajax', line):
found = 8
if found:
found -= 1
res.append(line)
# change field 5 into separate values and skip field 6
res = res[:4] +res[5].split(':') + res[7:]
# break the last field into separate values and drop the first '?'
res = res[:-1] + [ i for i in res[-1]][1:]
print(";".join(res))
returns
Ajax;5;5;0;16;3;W;W;W;W;W
This works, but it is very brittle, and will break as soon as the website changes their content. you should put in a lot of error checking. I also replaced the sleep with a wait call, and added chromedrivermamager, which allows me to use selenium with chrome.

Parsing HTML and writing PDF to disk (python)

My goal was to write a script that downloads all the pdf files from a user entered site.
Problem 1. the code does not return the anchor tags located inside of the iframe. I tried explicitly using the iframe tag name and then using .contents but the commanded returns an empty list.
Question 1: How to parse the iframe? Why doesn't the iframe.contents return its children i.e. the <a> tags?
Problem 2: Writing the PDFs to disk appears successful however when I attempt to the files I get the following error,
"....could not open...because it is either not a supported file type
or because the file has been damaged ( for example, it was sent as an
email...and wasn't correctly decoded).
Question 2: Anybody encounter this before?
The code is split in two blocks; one for each problem delete the set of quotes around a block to run.
Lastly if anyone can explain why the two urls don't match in the first block of code that would be awesome. Code is commented; contains urls for each question. Thanks!
PYTHON CODE
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
#initializing counters
slide = 1
count = 0
#ignore SSL cert errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#get user url and create soup object
url = input("Enter the website name: ")
connect = urllib.request.urlopen(url, context=ctx)
soup = BeautifulSoup(connect, 'html.parser')
######## code block for question 1 revolving around parsing iframes and the issues with the
######## mismatching urls
#url used for code block 1: https://www.cs.ucr.edu/~epapalex/teaching/235_F19/index.html
"""
#trying to retrieve all anchor tags; doesn't print the anchor tags within the iframe
tags = soup('a')
for tag in tags:
print(tag)
print('\n')
#explictly asking for the iframe tag
iframe = soup.iframe
#the url printed on this line doesn't match the url printed once I get the src attribute
#navigating to the url listed here is what I use for the second block of code because it
#isn't an iframe
print(iframe)
iframe_src_url = iframe['src']
#this url doesn't match the one shown in the previous print statement and it leaves you dealing
#with another iframe
print(iframe_src_url)
"""
#########code block for question 2 where I enter the url found in the iframe src attribute
#url for block 2: https://docs.google.com/spreadsheets/d/e/2PACX-1vRF408HaDlR6Q9fx6WF6YzeNrZIkXZBqwz_qyN8hz8N4rhIrcpc_GWNMrCODVmucMEUhXIElxcXyDpY/pubhtml?gid=0&single=true&widget=true&headers=false
"""
tags = soup('a')
#iterate through tags, retrieve href addresses, navigate to the document, write data to file
for tag in tags:
doc_url = tag.get('href')
file = urllib.request.urlopen(doc_url, context=ctx)
file = open("Week " + str(slide) + " slides.pdf", 'wb')
file.write(connect.read())
file.close()
print("Finished file: ", slide)
count = count + 1
slide = slide + 1
print("Total files downloaded: ", count)"""
import requests
from bs4 import BeautifulSoup
r = requests.get(
'https://www.cs.ucr.edu/~epapalex/teaching/235_F19/index.html')
soup = BeautifulSoup(r.content, 'html.parser')
for item in soup.findAll('iframe'):
print(item.get('src'))
Output:
https://docs.google.com/spreadsheets/d/e/2PACX-1vRF408HaDlR6Q9fx6WF6YzeNrZIkXZBqwz_qyN8hz8N4rhIrcpc_GWNMrCODVmucMEUhXIElxcXyDpY/pubhtml?gid=0&single=true&widget=true&headers=false
And Regarding the second question:
import requests
from bs4 import BeautifulSoup
r = requests.get(
'https://docs.google.com/spreadsheets/d/e/2PACX-1vRF408HaDlR6Q9fx6WF6YzeNrZIkXZBqwz_qyN8hz8N4rhIrcpc_GWNMrCODVmucMEUhXIElxcXyDpY/pubhtml?gid=0&single=true&widget=true&headers=false')
soup = BeautifulSoup(r.content, 'html.parser')
links = []
for item in soup.findAll('a', {'rel': 'noreferrer'}):
links.append(item.get('href'))
for item in links:
r = requests.get(item)
source = r.headers.get('Location')
print(f"Saving File {source[56:]}")
r1 = requests.get(source)
with open(f"{source[56:]}", 'wb') as f:
f.write(r1.content)
print(f"\nTotal File Downloaded is {len(links)}")
Output will save the file to your local disck:
Saving File 01-intro-logistics.pdf
Saving File 02-data.pdf
Saving File 03-preprocessing.pdf
Saving File 03-preprocessing.pdf
Saving File 04-frequent-patterns.pdf
Saving File 05a-supervised.pdf
Saving File 05b-supervised.pdf
Saving File 05c-supervised.pdf
Saving File 06a-supervised-advanced.pdf
Saving File 06b-supervised-advanced.pdf
Saving File 07a-unsupervised.pdf
Saving File 07b-unsupervised.pdf
Saving File 07c-advanced-unsupervised.pdf
Saving File 08-graph-mining.pdf
Saving File 09-anomaly-detection.pdf
Saving File 10-time-series.pdf
Total File Downloaded is 16
Full Version:
import requests
from bs4 import BeautifulSoup
import html
def Get_Links():
links = set()
r = requests.get(
'https://www.cs.ucr.edu/~epapalex/teaching/235_F19/index.html')
soup = BeautifulSoup(r.text, 'html.parser')
source = html.escape(soup.find('iframe').get('src'))
r = requests.get(source)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', {'rel': 'noreferrer'}):
links.add(item.get('href'))
return links, len(links)
def Save_Items():
items, size = Get_Links()
for item in items:
r = requests.get(item)
source = r.headers.get('Location')
print(f"Saving File {source[56:]}")
r = requests.get(source)
with open(f"{source[56:]}", 'wb') as f:
f.write(r.content)
print(f"\nTotal File Downloaded is {size}")
Save_Items()

Turn a file name into a clickable list

I need to turn the name of the file in the folder into a clickable text. As of now, the file name is in one line and link in another.
What's the name of it? Which keywords I should use?
html = '<html><body>'
subset = []
lastFile = None
for file in os.listdir():
if file.endswith(".html"):
subset.append(file)
for r in subset:
if not lastFile:
html += '<h3>%s</h3>' % r
html += 'r' % r
You can just wrap the <h3> tag in an anchor tag, using your code do something like this
html = '<html><body>'
subset = []
lastFile = None
for file in os.listdir():
if file.endswith(".html"):
subset.append(file)
for r in subset:
if not lastFile:
html += '<a href="%s">' % r
html += '<h3>%s</h3></a>' % r

.text is scrambled with numbers and special keys in BeautifuSoup

Hello I am currently using Python 3, BeautifulSoup 4 and, requests to scrape some information from supremenewyork.com UK. I have implemented a proxy script (that I know works) into the script. The only problem is that this website does not like programs to scrape this information automatically and so they have decided to scramble this script which I think makes it unusable as text.
My question: is there a way to get the text without using the .text thing and/or is there a way to get the script to read the text? and when it sees a special character like # to skip over it or to read the text when it sees & skip until it sees ;?
because basically how this website scrambles the text is by doing this. Here is an example, the text shown when you inspect element is:
supremetshirt
Which is supposed to say "supreme t-shirt" and so on (you get the idea, they don't use letters to scramble only numbers and special keys)
this  is kind of highlighted in a box automatically when you inspect the element using a VPN on the UK supreme website, and is different than the text (which isn't highlighted at all). And whenever I run my script without the proxy code onto my local supremenewyork.com, It works fine (but only because of the code, not being scrambled on my local website and I want to pull this info from the UK website) any ideas? here is my code:
import requests
from bs4 import BeautifulSoup
categorys = ['jackets', 'shirts', 'tops_sweaters', 'sweatshirts', 'pants', 'shorts', 't-shirts', 'hats', 'bags', 'accessories', 'shoes', 'skate']
catNumb = 0
#use new proxy every so often for testing (will add something that pulls proxys and usses them for you.
UK_Proxy1 = '51.143.153.167:80'
proxies = {
'http': 'http://' + UK_Proxy1 + '',
'https': 'https://' + UK_Proxy1 + '',
}
for cat in categorys:
catStr = str(categorys[catNumb])
cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
proxy_script = requests.get(cUrl, proxies=proxies).text
bSoup = BeautifulSoup(proxy_script, 'lxml')
print('\n*******************"'+ catStr.upper() + '"*******************\n')
catNumb += 1
for item in bSoup.find_all('div', class_='inner-article'):
url = item.a['href']
alt = item.find('img')['alt']
req = requests.get('http://www.supremenewyork.com' + url)
item_soup = BeautifulSoup(req.text, 'lxml')
name = item_soup.find('h1', itemprop='name').text
#name = item_soup.find('h1', itemprop='name')
style = item_soup.find('p', itemprop='model').text
#style = item_soup.find('p', itemprop='model')
print (alt +(' --- ')+ name +(' --- ')+ style)
#print(alt)
#print(str(name))
#print (str(style))
When I run this script I get this error:
name = item_soup.find('h1', itemprop='name').text
AttributeError: 'NoneType' object has no attribute 'text'
And so what I did was I un-hash-tagged the stuff that is hash-tagged above, and hash-tagged the other stuff that is similar but different, and I get some kind of str error and so I tried the print(str(name)). I am able to print the alt fine (with every script, the alt is not scrambled), but when it comes to printing the name and style all it prints is a None under every alt code is printed.
I have been working on fixing this for days and have come up with no solutions. can anyone help me solve this?
I have solved my own answer using this solution:
thetable = soup5.find('div', class_='turbolink_scroller')
items = thetable.find_all('div', class_='inner-article')
for item in items:
alt = item.find('img')['alt']
name = item.h1.a.text
color = item.p.a.text
print(alt,' --- ', name, ' --- ',color)

BeautifulSoup cannot find all <p> tags in html

I'm trying to pull articles' title, text, and user comments from websites using BeautifulSoup. I have managed to filter the first two but I'm having problems pulling the user comments. The code I have right now.
def extract_text(url):
url_to_extract = url
html = urllib.urlopen(url_to_extract).read()
soup = BeautifulSoup(html, 'html.parser')
for script in soup(["script", "style"]):
script.extract()
print 'Title and Publisher is:\n' + soup.title.string
body_text = ''
article = soup.findAll('p')
for element in article:
body_text += '\n' + ''.join(element.findAll(text=True))
print body_text
def main():
url_title = 'https://www.theguardian.com/politics/2016/oct/24/nicola-sturgeon-says-brexit-meeting-was-deeply-frustrating'
extract_text(url_title)
I've checked the source code for this specific article in the main method, the user comments are available in < p> tags which should make them parsed by BeautifulSoup along with the article text, but it doesn't show. I tried to print all the beautifulsoup content by
print soup
It doesn't show the div where the user comments are supposed to be in. I've tried this on BBC and Guardian websites for now. I would be glad for any kind of help here.
import requests
from bs4 import BeautifulSoup
def getArticle(url):
url = 'http://www.bbc.com/news/business-34421804'
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c)
article_text = ''
article = soup.findAll('p')
for element in article:
article_text += '\n' + ''.join(element.findAll(text = True))
return article_text