I am writing a script to print the output to an html file. I am stuck on the format of my output. Below is my code:
def printTohtml(Alist):
myfile = open('zip_files.html', 'w')
html = """<html>
<head></head>
<body><p></p>{htmlText}</body>
</html>"""
title = "Study - User - zip file - Last date modified"
myfile.write(html.format(htmlText = title))
for newL in Alist:
for j in newL:
if j == newL[-1]:
myfile.write(html.format(htmlText=j))
else:
message = j + ', '
myfile.write(html.format(htmlText = message))
myfile.close()
Alist = [['123', 'user1', 'New Compressed (zipped) Folder.zip', '05-24-17'],
['123', 'user2', 'Iam.zip', '05-19-17'], ['abcd', 'Letsee.zip', '05-22-17'],
['Here', 'whichTwo.zip', '06-01-17']]
printTohtml(Alist)
I want my output to be like this:
Study - User - zip file - Last date modified
123, user1, New Compressed (zipped) Folder.zip, 05-24-17
123, user2, Iam.zip, 05-19-17
abcd, Letsee.zip, 05-22-17
Here, whichTwo.zip, 06-01-17
But my code is giving me everything on its own line. Can anyone please help me?
Thanks in advance for your help!
My Output:
Study - User - zip file - Last date modified
123,
user1,
New Compressed (zipped) Folder.zip,
05-24-17
123,
user2,
Iam.zip,
05-19-17
abcd,
Letsee.zip,
05-22-17
Here,
whichTwo.zip,
06-01-17
You might want to try something like that. I haven't tested but this will create the string first and then write it to the file. Might be faster for avoiding multiple writes but I am not sure how python is handling that on the background.
def printTohtml(Alist):
myfile = open('zip_files.html', 'w')
html = """<html>
<head></head>
<body><p></p>{htmlText}</body>
</html>"""
title = "Study - User - zip file - Last date modified"
Alist = [title] + [", ".join(line) for line in Alist]
myfile.write(html.format(htmlText = "\n".join(Alist)))
myfile.close()
Alist = [['123', 'user1', 'New Compressed (zipped) Folder.zip', '05-24-17'],
['123', 'user2', 'Iam.zip', '05-19-17'], ['abcd', 'Letsee.zip', '05-22-17'],
['Here', 'whichTwo.zip', '06-01-17']]
printTohtml(Alist)
Your issue is that you're including the html, body and paragraph tags every time you write a line to your file.
Why don't you concatenate your string, separating the lines with <br> tags, and then load them into your file, like so:
def printTohtml(Alist):
myfile = open('zip_files.html', 'w')
html = """<html>
<head></head>
<body><p>{htmlText}</p></body>
</html>"""
complete_string = "Study - User - zip file - Last date modified"
for newL in Alist:
for j in newL:
if j == newL[-1]:
complete_string += j + "<br>"
else:
message = j + ', '
complete_string += message + "<br>"
myfile.write(html.format(htmlText = complete_string))
myfile.close()
Also, your template placeholder is in the wrong spot, it should be between your paragraph tags.
Related
#client.command()
async def show(ctx, player, *args): # General stats
rs = requests.get(apiLink + "/checkban?name=" + str(player))
if rs.status_code == 200: # HTTP OK
rs = rs.json()
joined_array = ','.join({str(rs["otherNames"]['usedNames'])})
embed = discord.Embed(title="Other users for" + str(player),
description="""User is known as:
""" +joined_array)
await ctx.send(embed=embed)
My goal here is to have every username on different lines after each comma, and preferably without the [] at the start and end. I have tried adding
joined_array = ','.join({str(rs["otherNames"]['usedNames'])}) but the response from the bot is the same as shown in the image.
Any answer or tip/suggestion is appreciated!
Try this:
array = ['user1', 'user2', 'user3', 'user4', 'user5', 'user6'] #your list
new = ",\n".join(array)
print(new)
Output:
user1,
user2,
user3,
user4,
user5,
user6
In your case I think array should be replaced with rs["otherNames"]['usedNames']
I was able to take a text file, read each line, create a dictionary per line, update(append) each line and store the json file. The issue is when reading the json file it will not read correctly. the error point to a storing file issue?
The text file looks like:
84.txt; Frankenstein, or the Modern Prometheus; Mary Wollstonecraft (Godwin) Shelley
98.txt; A Tale of Two Cities; Charles Dickens
...
import json
import re
path = "C:\\...\\data\\"
books = {}
books_json = {}
final_book_json ={}
file = open(path + 'books\\set_of_books.txt', 'r')
json_list = file.readlines()
open(path + 'books\\books_json.json', 'w').close() # used to clean each test
json_create = []
i = 0
for line in json_list:
line = line.replace('#', '')
line = line.replace('.txt','')
line = line.replace('\n','')
line = line.split(';', 4)
BookNumber = line[0]
BookTitle = line[1]
AuthorName = line[-1]
file
if BookNumber == ' 2701':
BookNumber = line[0]
BookTitle1 = line[1]
BookTitle2 = line[2]
AuthorName = line[3]
BookTitle = BookTitle1 + ';' + BookTitle2 # needed to combine title into one to fit dict format
books = json.dumps( {'AuthorName': AuthorName, 'BookNumber': BookNumber, 'BookTitle': BookTitle})
books_json = json.loads(books)
final_book_json.update(books_json)
with open(path + 'books\\books_json.json', 'a'
) as out_put:
json.dump(books_json, out_put)
with open(path + 'books\\books_json.json', 'r'
) as out_put:
'books\\books_json.json', 'r')]
print(json.load(out_put))
The reported error is: JSONDecodeError: Extra data: line 1 column 133
(char 132) - adding this is right between the first "}{". Not sure
how json should look in a flat-file format? The output file as seen on
an editor looks like: {"AuthorName": " Mary Wollstonecraft (Godwin)
Shelley", "BookNumber": " 84", "BookTitle": " Frankenstein, or the
Modern Prometheus"}{"AuthorName": " Charles Dickens", "BookNumber": "
98", "BookTitle": " A Tale of Two Cities"}...
I ended up changing the approach and used pandas to read the text and then spliting the single-cell input.
books = pd.read_csv(path + 'books\\set_of_books.txt', sep='\t', names =('r','t', 'a') )
#print(books.head(10))
# Function to clean the 'raw(r)' inoput data
def clean_line(cell):
...
return cell
books['r'] = books['r'].apply(clean_line)
books = books['r'].str.split(';', expand=True)
I am creating a python script that can read scanned, and tabular .pdfs and extract some important data and insert it into a JSON to later be implemented into a SQL database (I will also be developing the DB as a project for learning MongoDB).
Basically, my issue is I have never worked with any JSON files before but that was the format I was recommended to output to. The scraping script works, the pre-processing could be a lot cleaner, but for now it works. The issue I run into is the keys, and values are in the same list, and some of the values because they had a decimal point are two different list items. Not really sure where to even start.
I don't really know where to start, I suppose since I know what the indexes of the list are I can easily assign keys and values, but then it may not be applicable to any .pdf, that is the script cannot be coded explicitly.
import PyPDF2 as pdf2
import textract
with "TestSpec.pdf" as filename:
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.pdfFileReader(pdfFileObj)
num_pages = pdfReader.numpages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(0)
count += 1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
def cleanText(x):
'''
This function takes the byte data extracted from scanned PDFs, and cleans it of all
unnessary data.
Requires re
'''
stringedText = str(x)
cleanText = stringedText.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
return clean
cleanText = cleanText(text)
This is the current output
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
and we want the output as a JSON setup like
{"Date | 21feb2019", "Sequence ID: | lacz-rp", "Sequence 5'-3' | gat..."}
and so on. Just not sure how to do that.
here is a screenshot of the data from my sample pdf
So, i have figured out some of this. I am still having issues with grabbing the last 3rd of the data i need without explicitly programming it in. but here is what i have so far. Once i have everything working then i will worry about optimizing it and condensing.
# for PDF reading
import PyPDF2 as pdf2
import textract
# for data preprocessing
import re
from dateutil.parser import parse
# For generating the JSON file array
import json
# This finds and opens the pdf file, reads the data, and extracts the data.
filename = "*.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.PdfFileReader(pdfFileObj)
text = ""
pageObj = pdfReader.getPage(0)
text += pageObj.extractText()
# checks if extracted data is in string form or picture, if picture textract reads data.
# it then closes the pdf file
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
pdfFileObj.close()
# Converts text to string from byte data for preprocessing
stringedText = str(text)
# Removed escaped lines and replaced them with actual new lines.
formattedText = stringedText.replace('\\n', '\n').lower()
# Slices the long string into a workable piece (only contains useful data)
slice1 = formattedText[(formattedText.index("sheet") + 10): (formattedText.index("secondary") - 2)]
clean = re.sub('\n', " ", slice1)
clean2 = re.sub(' +', ' ', clean)
# Creating the PrimerData dictionary
with open("PrimerData.json",'w') as file:
primerDataSlice = clean[clean.index("molecular"): -1]
primerData = re.split(": |\n", primerDataSlice)
primerKeys = primerData[0::2]
primerValues = primerData[1::2]
primerDict = {"Primer Data": dict(zip(primerKeys,primerValues))}
# Generatring the JSON array "Primer Data"
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
# Grabbing the date (this has just the date, so json will have to add date.)
date = re.findall('(\d{2}[\/\- ](\d{2}|january|jan|february|feb|march|mar|april|apr|may|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)[\/\- ]\d{2,4})', clean2)
Without input data it is difficult to give you working code. A minimal working example with input would help. As for JSON handling, python dictionaries can dump to json easily. See examples here.
https://docs.python-guide.org/scenarios/json/
Get a json string from a dictionary and write to a file. Figure out how to parse the text into a dictionary.
import json
d = {"Date" : "21feb2019", "Sequence ID" : "lacz-rp", "Sequence 5'-3'" : "gat"}
json_data = json.dumps(d)
print(json_data)
# Write that data to a file
So, I did figure this out, the problem was really just that because of the way my pre-processing was pulling all the data into a single list wasn't really that great of an idea considering that the keys for the dictionary never changed.
Here is the semi-finished result for making the Dictionary and JSON file.
# Collect the sequence name
name = clean2[clean2.index("Sequence") + 11: clean2.index("Sequence") + 19]
# Collecting Shipment info
ordered = input("Who placed this order? ")
received = input("Who is receiving this order? ")
dateOrder = re.findall(
r"(\d{2}[/\- ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[/\- ]\d{2,4})",
clean2)
dateReceived = date.today()
refNo = clean2[clean2.index("ref.No. ") + 8: clean2.index("ref.No.") + 17]
orderNo = clean2[clean2.index("Order No.") +
10: clean2.index("Order No.") + 18]
# Finding and grabbing the sequence data. Storing it and then finding the
# GC content and melting temp or TM
bases = int(clean2[clean2.index("bases") - 3:clean2.index("bases") - 1])
seqList = [line for line in clean2 if re.match(r'^[AGCT]+$', line)]
sequence = "".join(i for i in seqList[:bases])
def gc_content(x):
count = 0
for i in x:
if i == 'G' or i == 'C':
count += 1
else:
count = count
return round((count / bases) * 100, 1)
gc = gc_content(sequence)
tm = mt.Tm_GC(sequence, Na=50)
moleWeight = round(mw(Seq(sequence, generic_dna)), 2)
dilWeight = float(clean2[clean2.index("ug/OD260:") +
10: clean2.index("ug/OD260:") + 14])
dilution = dilWeight * 10
primerDict = {"Primer Data": {
"Sequence": sequence,
"Bases": bases,
"TM (50mM NaCl)": tm,
"% GC content": gc,
"Molecular weight": moleWeight,
"ug/0D260": dilWeight,
"Dilution volume (uL)": dilution
},
"Shipment Info": {
"Ref. No.": refNo,
"Order No.": orderNo,
"Ordered by": ordered,
"Date of Order": dateOrder,
"Received By": received,
"Date Received": str(dateReceived.strftime("%d-%b-%Y"))
}}
# Generating the JSON array "Primer Data"
with open("".join(name) + ".json", 'w') as file:
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
I am writing a script to print the output to an html file. I am stuck on the format of my output. Below is my code:
def printTohtml(Alist):
myfile = open('zip_files.html', 'w')
html = """<html>
<head></head>
<body><p></p>{htmlText}</body>
</html>"""
title = "Study - User - zip file - Last date modified"
myfile.write(html.format(htmlText = title))
for newL in Alist:
for j in newL:
if j == newL[-1]:
myfile.write(html.format(htmlText=j))
else:
message = j + ', '
myfile.write(html.format(htmlText = message))
myfile.close()
Alist = [['123', 'user1', 'New Compressed (zipped) Folder.zip', '05-24-17'],
['123', 'user2', 'Iam.zip', '05-19-17'], ['abcd', 'Letsee.zip', '05-22-17'],
['Here', 'whichTwo.zip', '06-01-17']]
printTohtml(Alist)
I want my output to be like this:
Study - User - zip file - Last date modified
123, user1, New Compressed (zipped) Folder.zip, 05-24-17
123, user2, Iam.zip, 05-19-17
abcd, Letsee.zip, 05-22-17
Here, whichTwo.zip, 06-01-17
But my code is giving me everything on its own line. Can anyone please help me?
Thanks in advance for your help!
My Output:
Study - User - zip file - Last date modified
123,
user1,
New Compressed (zipped) Folder.zip,
05-24-17
123,
user2,
Iam.zip,
05-19-17
abcd,
Letsee.zip,
05-22-17
Here,
whichTwo.zip,
06-01-17
Try this instead:
for newL in Alist:
for j in newL:
if j == newL[-1]:
myfile.write(html.format(htmlText=j))
else:
message = message + ', ' + j
myfile.write(html.format(htmlText = message))
For each element in newL, you need to store them in 'message'. Once each newL is completely read, write the 'message' into the myfile.
With each iteration, your loop is creating new <head></head><body><p></p></body></html><html>.
Also, if you want to remove space between paragraphs, use <style>p { margin: 0, !important; } </style>.
You could try this function which gives the HTML output you wanted.
def printTohtml(Alist, htmlfile):
html = "<html>\n<head></head>\n<style>p { margin: 0 !important; }</style>\n<body>\n"
title = "Study - User - zip file - Last date modified"
html += '\n<p>' + title + '</p>\n'
for line in Alist:
para = '<p>' + ', '.join(line) + '</p>\n'
html += para
with open(htmlfile, 'w') as f:
f.write(html + "\n</body>\n</html>")
Alist = [['123', 'user1', 'New Compressed (zipped) Folder.zip', '05-24-17'],
['123', 'user2', 'Iam.zip', '05-19-17'], ['abcd', 'Letsee.zip', '05-22-17'],
['Here', 'whichTwo.zip', '06-01-17']]
printTohtml(Alist, 'zip_files.html')
The markup written to zip_files.html is:
<html>
<head></head>
<style>p { margin: 0 !important; }</style>
<body>
<p>Study - User - zip file - Last date modified</p>
<p>123, user1, New Compressed (zipped) Folder.zip, 05-24-17</p>
<p>123, user2, Iam.zip, 05-19-17</p>
<p>abcd, Letsee.zip, 05-22-17</p>
<p>Here, whichTwo.zip, 06-01-17</p>
</body>
</html>
The page displays the following output:
Study - User - zip file - Last date modified
123, user1, New Compressed (zipped) Folder.zip, 05-24-17
123, user2, Iam.zip, 05-19-17
abcd, Letsee.zip, 05-22-17
Here, whichTwo.zip, 06-01-17
I have a function takes a file as input and prints certain statistics and also copies the file into a file name provided by the user. Here is my current code:
def copy_file(option):
infile_name = input("Please enter the name of the file to copy: ")
infile = open(infile_name, 'r')
outfile_name = input("Please enter the name of the new copy: ")
outfile = open(outfile_name, 'w')
slist = infile.readlines()
if option == 'statistics':
for line in infile:
outfile.write(line)
infile.close()
outfile.close()
result = []
blank_count = slist.count('\n')
for item in slist:
result.append(len(item))
print('\n{0:<5d} lines in the list\n{1:>5d} empty lines\n{2:>7.1f} average character per line\n{3:>7.1f} average character per non-empty line'.format(
len(slist), blank_count, sum(result)/len(slist), (sum(result)-blank_count)/(len(slist)-blank_count)))
copy_file('statistics')
It prints the statistics of the file correctly, however the copy it makes of the file is empty. If I remove the readline() part and the statistics part, the function seems to make a copy of the file correctly. How can I correct my code so that it does both. It's a minor problem but I can't seem to get it.
The reason the file is blank is that
slist = infile.readlines()
is reading the entire contents of the file, so when it gets to
for line in infile:
there is nothing left to read and it just closes the newly truncated (mode w) file leaving you with a blank file.
I think the answer here is to change your for line in infile: to for line in slist:
def copy_file(option):
infile_name= input("Please enter the name of the file to copy: ")
infile = open(infile_name, 'r')
outfile_name = input("Please enter the name of the new copy: ")
outfile = open(outfile_name, 'w')
slist = infile.readlines()
if option == 'statistics':
for line in slist:
outfile.write(line)
infile.close()
outfile.close()
result = []
blank_count = slist.count('\n')
for item in slist:
result.append(len(item))
print('\n{0:<5d} lines in the list\n{1:>5d} empty lines\n{2:>7.1f} average character per line\n{3:>7.1f} average character per non-empty line'.format(
len(slist), blank_count, sum(result)/len(slist), (sum(result)-blank_count)/(len(slist)-blank_count)))
copy_file('statistics')
Having said all that, consider if it's worth using your own copy routine rather than shutil.copy - Always better to delegate the task to your OS as it will be quicker and probably safer (thanks to NightShadeQueen for the reminder)!