Extracting text from plain HTML and write to new file - html

I'm extracting a certain part of a HTML document (to be fair: basis for this is an iXBRL document which means I do have a lot of written formatting code inside) and write my output, the original file without the extracted part, to a .txt file. My aim is to measure the difference in document size (how much KB of the original document refers to the extracted part). As far as I know there shouldn't be any difference in HTML to text format, so my difference should be reliable although I am comparing two different document formats. My code so far is:
import glob
import os
import contextlib
import re
#contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def extractor():
os.chdir(r"F:\Test")
with stdout2file("FileShortened.txt"):
for file in glob.iglob('*.html', recursive=True):
with open(file) as f:
contents = f.read()
extract = re.compile(r'(This is the beginning of).*?Until the End', re.I | re.S)
cut = extract.sub('', contents)
print(file.split(os.path.sep)[-1], end="| ")
print(cut, end="\n")
extractor()
Note: I am NOT using BS4 or lxml because I am not only interested in HTML text but actually in ALL lines between my start and end-RegEx incl. all formatting code lines.
My code is working without problems, however as I have a lot of files my FileShortened.txt document is quickly going to be massive in size. My problem is not with the file or the extraction, but with redirecting my output to various txt-file. For now, I am getting everything into one file, what I would need is some kind of a "for each file searched, create new txt-file with the same name as the original document" condition (arcpy module?!)?
Somehting like:
File1.html --> File1Short.txt
File2.html --> File2Short.txt
...
Is there an easy way (without changing my code too much) to invert my code in the sense of printing the "RegEx Match" to a new .txt file instead of "everything except my RegEx match"?
Any help appreciated!

Ok, I figured it out.
Final Code is:
import glob
import os
import re
from os import path
def extractor():
os.chdir(r"F:\Test") # the directory containing my html
for file in glob.glob("*.html"): # iterates over all files in the directory ending in .html
with open(file) as f, open((file.rsplit(".", 1)[0]) + ".txt", "w") as out:
contents = f.read()
extract = re.compile(r'Start.*?End', re.I | re.S)
cut = extract.sub('', contents)
out.write(cut)
out.close()
extractor()

Related

Python: Creating PDF from PNG images and CSV tables using reportlab

I am trying to create a PDF document using a series of PDF images and a series of CSV tables using the python package reportlab. The tables are giving me a little bit of grief.
This is my code so far:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate
from reportlab.pdfgen.canvas import Canvas
from reportlab.platypus import *
from reportlab.platypus.tables import Table
from PIL import Image
from matplotlib.backends.backend_pdf import PdfPages
# Set the path to the folder containing the images and tables
folder_path = 'Files'
# Create a new PDF document
pdf_filename = 'testassessment.pdf'
canvas = Canvas(pdf_filename)
# Iterate through the files in the folder
for file in os.listdir(folder_path):
file_path = os.path.join(folder_path, file)
# If the file is an image, draw it on the PDF
if file.endswith('.png'):
canvas.drawImage(file_path, 105, 148.5, width=450, height=400)
canvas.showPage() #ends page
# If the file is a table, draw it on the PDF
elif file.endswith('.csv'):
df = pd.read_csv(file_path)
table = df.to_html()
canvas.drawString(10, 10, table)
canvas.showPage()
# Save the PDF
canvas.save()
The tables are not working. When I use .drawString it ends up looking like this:
Does anyone know how I can get the table to be properly inserted into the PDF?
According to the reportlab docs, page 14, "The draw string methods draw single lines of text on the canvas.". You might want to have a look at "The text object methods" on the same page.
You might want to consider using PyMuPDF with Stories it allows for more flexibility of layout from a data input. For an example of something very similar to what you are trying to achieve see: https://pymupdf.readthedocs.io/en/latest/recipes-stories.html#how-to-display-a-list-from-json-data

Writing a utils.py function to zip up a csv in Code Repo

I had (maybe two years ago) written up a tool to zip up several csv files written out to disk in a code repo from a dataframe for someone who's working in a platform that would work best with a zipped up csv file so they can download it and work with it elsewhere (more user friendly for some).
I can't remember if I had gotten it to work at the time but here's my recent stab at this (and yes before you ask I'm aware that there's an easy way to gzip a file using the df.write_dataframe() option... I'm doing this to have more control over the name of the zip and this is a windows user with a locked down system and a select set of tools) ...
Here's what I've got so far in utils.py:
import tempfile
import zipfile
def zipit(source_df, out, zipfile_name, internal_prefix, fileSuffix):
file_list = list(source_df.filesystem().ls())
fs = source_df.filesystem()
zipf = zipfile.ZipFile("zipfile.zip", 'w', zipfile.ZIP_DEFLATED)
for files in file_list:
temp = tempfile.NamedTemporaryFile(prefix=internal_prefix, suffix=fileSuffix)
with fs.open(files.path, 'rb') as f:
w = open(temp.name, 'wb')
w.write(f.read())
w.close()
f.close()
zipf.write(temp.name)
zipf.close()
with open("zipfile.zip", 'rb') as f:
with out.filesystem().open(zipfile_name, 'wb') as w:
w.write(f.read())
w.close()
f.close()
My issue is that this zips up the .csv file(s) and I can name it but it dumps it in a long crazy series of temp folder names and the .csv file crashes when you try to open it.
I'm sure I can figure this out but I feel like I'm blowing this way out of the water here and would appreciate the community wisdom on this.
The other (much less important) problem is that the file itself has a prefix and a suffix which is nice, but it'd be nicer if I could just name the whole file instead of getting the temp files random chars in the middle of the name.

Scrape for Absolute URL with html.parse and remove duplicates

I am trying to make sure that the relative links are saved as absolute links into this CSV. (URL parse) I am also trying to remove duplicates, which is why I created the variable "ddupe".
I keep getting all the relative URLs saved when I open the csv in the desktop.
Can someone please help me figure this out? I thought about calling the "set" just like this page: How do you remove duplicates from a list whilst preserving order?
#Importing the request library to make HTTP requests
#Importing the bs4 library to extract / parse html and xml files
#utlize urlparse to change relative URL to absolute URL
#import csv (built in package) to read / write to Microsoft Excel
from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse
import csv
#create the page variable
#associate page to request to obtain the information from raw_html
#store the html information in a text
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
parsed = urlparse(page)
raw_html = page.text # declare the raw_html variable
soup = BeautifulSoup(raw_html, 'html.parser') # parse the html
#remove duplicate htmls
ddupe = open(‘page.text’, ‘r’).readlines()
ddupe_set = set(ddupe)
out = open(‘page.text’, ‘w’)
for ddupe in ddupe_set:
out.write(ddupe)
T = [["US Census Bureau Links"]] #Title
#Finds all the links
links = map(lambda link: link['href'], soup.find_all('a', href=True))
with open("US_Census_Bureau_links.csv","w",newline="") as f:
cw=csv.writer(f, quoting=csv.QUOTE_ALL) #Create a file handle for csv writer
cw.writerows(T) #Creates the Title
for link in links: #Parses the links in the csv
cw.writerow([link])
f.close() #closes the program
The function you're looking for is urljoin, not urlparse (both from the same package urllib.parse). It should be used somewhere after this line:
links = map(lambda link: link['href'], soup.find_all('a', href=True))
Use a list comprehension or map + lambda like you did here to join the relative URLs with base paths.

Batch with docx

I am trying to write few lines to look for a string in the paragraphs on several docx files in a single folder. I have managed to open the docx in the folder one by one but not yet to find and print the paragraph containing a specific string, any hint is highly appreciated.
import docx
import glob
from docx import Document
for document in glob.iglob("*.docx"):
document=Document()
for paragraph in document.paragraphs:
if 'String' in paragraph.text:
print paragraph.text
else:
print ('not found')
I think you're confusing a filename with a python-pptx Document object.
What you need is something like this:
import glob
from docx import Document
for filename in glob.iglob('*.docx'):
document = Document(filename)
for paragraph in document.paragraphs:
if 'String' in paragraph.text:
print paragraph.text
else:
print 'not found'

What is the proper method for reading and writing HTML/XML (byte string) with Python and lxml and etree?

EDIT: Now that the problem is solved, I realize that it had more to do with properly reading/writing byte-strings, rather than HTML. Hopefully, that will make it easier for someone else to find this answer.
I have an HTML file that's poorly formatted. I want to use a Python lib to just make it tidy.
It seems like it should be as simple as the following:
import sys
from lxml import etree, html
#read the unformatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html', 'r', encoding='utf-8') as file:
#write the pretty XML to a file
file_text = ''.join(file.readlines())
#format the HTML
document_root = html.fromstring(file_text)
document = etree.tostring(document_root, pretty_print=True)
#write the nice, pretty, formatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/Pretty.html', 'w') as file:
#write the pretty XML to a file
file.write(document)
But this chunk of code complains that file_lines is not a string or bytes-like object. Okay, it makes sense that the function can't take a list, I suppose.
But then, it's 'bytes' not a string. No problem, str(document)
But then I get HTML that's full of '\n' that are not newlines... they're a slash followed by an en. And there are no actual carriage returns in the result, it's just one long line.
I've tried a number of other weird things like specifying the encoding, trying to decode, etc. None of which produce the desired result.
What's the right way to read and write this kind of (is non-ASCII the right term?) text?
You are missing that you get bytes from tostring method from etree and need to take that into account when writing (a bytestring) to a file. Use the b switch in the open function like this and forget about the str() conversion:
with open('Pretty.html', 'wb') as file:
#write the pretty XML to a file
file.write(document)
Addendum
Even though this answer solves the immediate problem at hand and teaches about bytestrings, the solution by Padraic Cunningham is the cleaner and faster way to write lxml etrees to a file.
This can be done all using lxml in a couple of lines of code without ever needing to use open, the .write method is exactly for what you are trying to do:
# parse using file name which is the also the recommended way.
tree = html.parse("C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html")
# call write on the tree
tree.write("C:/Users/mhurley/Portable_Python/notebooks/Pretty.html", pretty_print=True, encoding="utf=8")
Also file_text = ''.join(file.readlines()) is exactly the same as file_text = file.read()