how can I load one text file corpus using PlaintextCorpusReader Module - nltk

I can only do something like this:
`from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()`
if i have just a single file as my corpus is there an efficient code to directly choose that file rather than this method, which is for a corpus many text files

Related

Python: Creating PDF from PNG images and CSV tables using reportlab

I am trying to create a PDF document using a series of PDF images and a series of CSV tables using the python package reportlab. The tables are giving me a little bit of grief.
This is my code so far:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate
from reportlab.pdfgen.canvas import Canvas
from reportlab.platypus import *
from reportlab.platypus.tables import Table
from PIL import Image
from matplotlib.backends.backend_pdf import PdfPages
# Set the path to the folder containing the images and tables
folder_path = 'Files'
# Create a new PDF document
pdf_filename = 'testassessment.pdf'
canvas = Canvas(pdf_filename)
# Iterate through the files in the folder
for file in os.listdir(folder_path):
file_path = os.path.join(folder_path, file)
# If the file is an image, draw it on the PDF
if file.endswith('.png'):
canvas.drawImage(file_path, 105, 148.5, width=450, height=400)
canvas.showPage() #ends page
# If the file is a table, draw it on the PDF
elif file.endswith('.csv'):
df = pd.read_csv(file_path)
table = df.to_html()
canvas.drawString(10, 10, table)
canvas.showPage()
# Save the PDF
canvas.save()
The tables are not working. When I use .drawString it ends up looking like this:
Does anyone know how I can get the table to be properly inserted into the PDF?
According to the reportlab docs, page 14, "The draw string methods draw single lines of text on the canvas.". You might want to have a look at "The text object methods" on the same page.
You might want to consider using PyMuPDF with Stories it allows for more flexibility of layout from a data input. For an example of something very similar to what you are trying to achieve see: https://pymupdf.readthedocs.io/en/latest/recipes-stories.html#how-to-display-a-list-from-json-data

Splitting sentences from a .txt file to .csv using NLTK

I have a corpus of newspaper articles in a .txt file, and I'm trying to split the sentences from it to a .csv in order to annotate each sentence.
I was told to use NLTK for this purpose, and I found the following code for sentence splitting:
import nltk
from nltk.tokenize import sent_tokenize
sent_tokenize("Here is my first sentence. And that's a second one.")
However, I'm wondering:
How does one use a .txt file as an input for the tokenizer (so that I don't have to just copy and paste everything), and
How does one output a .csv file instead of just printing the sentences in my terminal.
Reading a .txt file & tokenizing its sentences
Assuming the .txt file is located in the same folder as your Python script, you can read a .txt file and tokenize the sentences using NLTK as shown below:
from nltk import sent_tokenize
with open("myfile.txt") as file:
textFile = file.read()
tokenTextList = sent_tokenize(textFile)
print(tokenTextList)
# Output: ['Here is my first sentence.', "And that's a second one."]
Writing a list of sentence tokens to .csv file
There are a number of options for writing a .csv file. Pick whichever is more convenient (e.g. if you already have pandas loaded, use the pandas option).
To write a .csv file using the pandas module:
import pandas as pd
df = pd.DataFrame(tokenTextList)
df.to_csv("myCSVfile.csv", index=False, header=False)
To write a .csv file using the numpy module:
import numpy as np
np.savetxt("myCSVfile.csv", tokenTextList, delimiter=",", fmt="%s")
To write a .csv file using the csv module:
import csv
with open('myCSVfile.csv', 'w', newline='') as file:
write = csv.writer(file, lineterminator='\n')
# write.writerows([tokenTextList])
write.writerows([[token] for token in tokenTextList]) # For pandas style output

Batch with docx

I am trying to write few lines to look for a string in the paragraphs on several docx files in a single folder. I have managed to open the docx in the folder one by one but not yet to find and print the paragraph containing a specific string, any hint is highly appreciated.
import docx
import glob
from docx import Document
for document in glob.iglob("*.docx"):
document=Document()
for paragraph in document.paragraphs:
if 'String' in paragraph.text:
print paragraph.text
else:
print ('not found')
I think you're confusing a filename with a python-pptx Document object.
What you need is something like this:
import glob
from docx import Document
for filename in glob.iglob('*.docx'):
document = Document(filename)
for paragraph in document.paragraphs:
if 'String' in paragraph.text:
print paragraph.text
else:
print 'not found'

Spark - How to write a single csv file WITHOUT folder?

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)
You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it
df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work

Extracting text from plain HTML and write to new file

I'm extracting a certain part of a HTML document (to be fair: basis for this is an iXBRL document which means I do have a lot of written formatting code inside) and write my output, the original file without the extracted part, to a .txt file. My aim is to measure the difference in document size (how much KB of the original document refers to the extracted part). As far as I know there shouldn't be any difference in HTML to text format, so my difference should be reliable although I am comparing two different document formats. My code so far is:
import glob
import os
import contextlib
import re
#contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def extractor():
os.chdir(r"F:\Test")
with stdout2file("FileShortened.txt"):
for file in glob.iglob('*.html', recursive=True):
with open(file) as f:
contents = f.read()
extract = re.compile(r'(This is the beginning of).*?Until the End', re.I | re.S)
cut = extract.sub('', contents)
print(file.split(os.path.sep)[-1], end="| ")
print(cut, end="\n")
extractor()
Note: I am NOT using BS4 or lxml because I am not only interested in HTML text but actually in ALL lines between my start and end-RegEx incl. all formatting code lines.
My code is working without problems, however as I have a lot of files my FileShortened.txt document is quickly going to be massive in size. My problem is not with the file or the extraction, but with redirecting my output to various txt-file. For now, I am getting everything into one file, what I would need is some kind of a "for each file searched, create new txt-file with the same name as the original document" condition (arcpy module?!)?
Somehting like:
File1.html --> File1Short.txt
File2.html --> File2Short.txt
...
Is there an easy way (without changing my code too much) to invert my code in the sense of printing the "RegEx Match" to a new .txt file instead of "everything except my RegEx match"?
Any help appreciated!
Ok, I figured it out.
Final Code is:
import glob
import os
import re
from os import path
def extractor():
os.chdir(r"F:\Test") # the directory containing my html
for file in glob.glob("*.html"): # iterates over all files in the directory ending in .html
with open(file) as f, open((file.rsplit(".", 1)[0]) + ".txt", "w") as out:
contents = f.read()
extract = re.compile(r'Start.*?End', re.I | re.S)
cut = extract.sub('', contents)
out.write(cut)
out.close()
extractor()