output data not printing quite as planned - csv

I have a piece of code that takes an entered number, searches a CSV and outputs that line, however the original entered number is also printed. What would I need to modify in my code to remove this ?
def DoASearch():
self.outputQty.delete(1.0, 'end')
self.outputDesc.delete(1.0, 'end')
self.dwgoutputbox.configure(state="normal")
self.dwgoutputbox.delete(1.0, 'end')
self.dwgoutputbox.configure(state="disabled")
try:
print(int(sonumber.get()))
except ValueError:
messagebox.showwarning("OPPS !!", "Please enter a valid Shop Order number.")
with open("lesspreadsheettest.csv") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
result=(row['Shop Order'])
if sonumber.get() == result:
print(row['Part Number'])
print(row['Description'])
print(row['Quantity'])
print(row['Drawings'])
print(row['Issue'])
self.searchbutton = ttk.Button(rootWindow, text="Search", command=DoASearch)
self.searchbutton.grid(row=1, column=7, sticky=W, padx=3, pady=3)
Modified and works nicely with the below code. Theres a few extra repeat fields that have been added which can be ignored.
def checkcsv():
with open("lesspreadsheettest.csv") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
result=(row['Shop Order'])
if sonumber.get() == result:
descQty=(row['Quantity'])
descInfo=(row['Description'])
descPN=(row['Part Number'])
descDwg1=(row['Drawings1'])
descIss1=(row['Issue1'])
descDwg2=(row['Drawings2'])
descIss2=(row['Issue2'])
descDwg3=(row['Drawings3'])
descIss3=(row['Issue3'])
self.outputQty.insert(INSERT,descQty)
self.outputDesc.insert(INSERT,descPN, END," ", END, descInfo)
self.dwgoutputbox.insert(INSERT, descDwg1, END, " ", END, " Issue: ",END,descIss1,END, "\n")
self.dwgoutputbox.insert(INSERT, descDwg2, END, " ", END, " Issue: ",END,descIss2,END, "\n")
self.dwgoutputbox.insert(INSERT, descDwg3, END, " ", END, " Issue: ",END,descIss3,END, "\n")
self.outputQty.configure(state="disabled")
self.outputDesc.configure(state="disabled")
self.dwgoutputbox.configure(state="disabled")
A slightly better version using a loop follows:
def checkcsv():
with open("lesspreadsheettest.csv") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
result=(row['Shop Order'])
if sonumber.get() == result:
descQty=(row['Quantity'])
descInfo=(row['Description'])
descPN=(row['Part Number'])
for i in range(1,4):
descDwg=(row['Drawings'+ str(i)])
descIss=(row['Issue'+ str(i)])
self.dwgoutputbox.insert(1.0, descDwg, "dwg", " Issue: ", "", descIss, "", "\n")
self.outputQty.insert(1.0, descQty)
self.outputDesc.insert(1.0, descPN, "", ": ", "", descInfo)
self.outputQty.configure(state="disabled")
self.outputDesc.configure(state="disabled")
self.dwgoutputbox.configure(state="disabled")

I suggest to change the following line
print(int(sonumber.get()))
into
int(sonumber.get())
However, you should really take a deeper look at your code in order to understand which things it does and where it does those things.

Related

Why is isspace() returning false for strings from the docx python library that are empty?

My objective is to extract strings from numbered/bulleted lists in multiple Microsoft Word documents, then to organize those strings into a single, one-line string where each string is ordered in the following manner: 1.string1 2.string2 3.string3 etc. I refer to these one-line strings as procedures, consisting of 'steps' 1., 2., 3., etc.
The reason it has to be in this format is because the procedure strings are being put into a database, the database is used to create Excel spreadsheet outputs, a formatting macro is used on the spreadsheets, and the procedure strings in question have to be in this format in order for that macro to work properly.
The numbered/bulleted lists in MSword are all similar in format, but some use numbers, some use bullets, and some have extra line spaces before the first point, or extra line spaces after the last point.
The following text shows three different examples of how the Word documents are formatted:
Paragraph Keyword 1: arbitrary text
1. Step 1
2. Step 2
3. Step 3
Paragraph Keyword 2: arbitrary text
Paragraph Keyword 3: arbitrary text
• Step 1
• Step 2
• Step 3
Paragraph Keyword 4: arbitrary text
Paragraph Keyword 5: arbitrary text
Step 1
Step 2
Step 3
Paragraph Keyword 6: arbitrary text
(For some reason the first two lists didn't get indented in the formatting of the post, but in my word document all the indentation is the same)
When the numbered/bulleted list is formatted without line extra spaces, my code works fine, e.g. between "paragraph keyword 1:" and "paragraph keyword 2:".
I was trying to use isspace() to isolate the instances where there are extra line spaces that aren't part of the list that I want to include in my procedure strings.
Here is my code:
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
def extractStrings(file):
doc = file
for i in range(len(doc.paragraphs)):
str1 = doc.paragraphs[i].text
if "Paragraph Keyword 1:" in str1:
start1=i
if "Paragraph Keyword 2:" in str1:
finish1=i
if "Paragraph Keyword 3:" in str1:
start2=i
if "Paragraph Keyword 4:" in str1:
finish2=i
if "Paragraph Keyword 5:" in str1:
start3=i
if "Paragraph Keyword 6:" in str1:
finish3=i
print("----------------------------")
procedure1 = ""
y=1
for x in range(start1 + 1, finish1):
temp = str((doc.paragraphs[x].text))
print(temp)
if not temp.isspace():
if y > 1:
procedure1 = (procedure1 + " " + str(y) + "." + temp)
else:
procedure1 = (procedure1 + str(y) + "." + temp)
y=y+1
print(procedure1)
print("----------------------------")
procedure2 = ""
y=1
for x in range(start2 + 1, finish2):
temp = str((doc.paragraphs[x].text))
print(temp)
if not temp.isspace():
if y > 1:
procedure2 = (procedure2 + " " + str(y) + "." + temp)
else:
procedure2 = (procedure2 + str(y) + "." + temp)
y=y+1
print(procedure2)
print("----------------------------")
procedure3 = ""
y=1
for x in range(start3 + 1, finish3):
temp = str((doc.paragraphs[x].text))
print(temp)
if not temp.isspace():
if y > 1:
procedure3 = (procedure3 + " " + str(y) + "." + temp)
else:
procedure3 = (procedure3 + str(y) + "." + temp)
y=y+1
print(procedure3)
print("----------------------------")
del doc
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
import docx
doc1 = docx.Document("docx_isspace_experiment_042420.docx")
extractStrings(doc1)
del doc1
Unfortunately I have no way of putting the output into this post, but the problem is that whenever there is a blank line in the word doc, isspace() returns false, and a number "x." is assigned to empty space, so I end up with something like: 1. 2.Step 1 3.Step 2 4.Step 3 5. 6. (that's the last iteration of print(procedure3) from the code)
The problem is that isspace() is returning false even when my python console output shows that the string is just a blank line.
Am I using isspace() incorrectly? Is there something in the string I am not detecting that is causing isspace() to return false? Is there a better way to accomplish this?
Use the test:
# --- for s a str value, like paragraph.text ---
if s.strip() == "":
print("s is a blank line")
str.isspace() returns True if the string contains only whitespace. An empty str contains nothing, and so therefore does not contain whitespace.

If statement not returning the desired result

I'm new to Python and I believe the issue with my code is being caused by the fact that I'm a newbie and there's some theory or something that I must not be familiar with yet.
Yes, this question was asked before but, is different from mine. Believe me I tried everything that I thought that needs to be done.
Everything worked until I added everything in "if five in silos" statement.
After I enter the values for the 6 input functions, the program just finishes with exit code 0. Nothing else happens. The for loop is not initiated.
I want for the code to accept either 103 or 106 when prompting to enter something for the "five" variable.
I'm using PyCharm and Python 3.7.
import mysql.connector
try:
db = mysql.connector.connect(
host="",
user="",
passwd="",
database=""
)
one = int(input("Number of requested telephone numbers: "))
two = input("Enter the prefix (4 characters) with a leading 0: ")[:4]
three = int(input("Enter the ccid: "))
four = int(input("Enter the cid: "))
six = input("Enter case number: ")
five = int(input("Enter silo (103, 106 only): "))
cursor = db.cursor()
cursor.execute(f"SELECT * FROM n1 WHERE ddi LIKE '{two}%' AND silo = 1 AND ccid = 0 LIMIT {one}")
cursor.fetchall()
silos = (103, 106)
if five in silos:
if cursor.rowcount > 0:
for row in cursor:
seven = input(f"{row[1]} has been found on our system. Do you want to continue? Type either Y or N.")
if seven == "Y":
cursor.execute(f"INSERT INTO n{five} (ddi, silo, ccid, campaign, assigned, allocated, "
f"internal_notes, client_notes, agentid, carrier, alias) VALUES "
f"('{row[1]}', 1, {three}, {four}, NOW(), NOW(), 'This is a test.', '', 0, "
f"'{row[13]}', '') "
f"ON DUPLICATE KEY UPDATE "
f"silo = VALUES (silo), "
f"ccid = VALUES (ccid), "
f"campaign = VALUES (campaign);")
cursor.execute(f"UPDATE n1 SET silo = {five}, internal_notes = '{six}', allocated = NOW() WHERE "
f"ddi = '{row[1]}'")
else:
print("The operation has been canceled.")
db.commit()
else:
print(f"No results for prefix {two}.")
else:
print("Enter either silo 103 or 106.")
cursor.close()
db.close()
except (ValueError, NameError):
print("Please, enter an integer for all questions, except case number.")
Because it must be:
for row in cursor.fetchall():
// do something
In your code cursor returns a Python Class defined by db.cursor() but you need to call the fetchall() function to read the rows contained in it.
You're actually calling cursor.fetchall() without doing nothing with it, you can assign the call to a variable and than do this:
result = cursor.fetchall()
for row in result:
//do something
I found the problem: I had to store cursor.fetchall() into a variable.
After I put: eight = cursor.fetchall() before the "silos" tuple, everything worked perfectly.

Python Json creating dictionary from a text file, printing file issue

I was able to take a text file, read each line, create a dictionary per line, update(append) each line and store the json file. The issue is when reading the json file it will not read correctly. the error point to a storing file issue?
The text file looks like:
84.txt; Frankenstein, or the Modern Prometheus; Mary Wollstonecraft (Godwin) Shelley
98.txt; A Tale of Two Cities; Charles Dickens
...
import json
import re
path = "C:\\...\\data\\"
books = {}
books_json = {}
final_book_json ={}
file = open(path + 'books\\set_of_books.txt', 'r')
json_list = file.readlines()
open(path + 'books\\books_json.json', 'w').close() # used to clean each test
json_create = []
i = 0
for line in json_list:
line = line.replace('#', '')
line = line.replace('.txt','')
line = line.replace('\n','')
line = line.split(';', 4)
BookNumber = line[0]
BookTitle = line[1]
AuthorName = line[-1]
file
if BookNumber == ' 2701':
BookNumber = line[0]
BookTitle1 = line[1]
BookTitle2 = line[2]
AuthorName = line[3]
BookTitle = BookTitle1 + ';' + BookTitle2 # needed to combine title into one to fit dict format
books = json.dumps( {'AuthorName': AuthorName, 'BookNumber': BookNumber, 'BookTitle': BookTitle})
books_json = json.loads(books)
final_book_json.update(books_json)
with open(path + 'books\\books_json.json', 'a'
) as out_put:
json.dump(books_json, out_put)
with open(path + 'books\\books_json.json', 'r'
) as out_put:
'books\\books_json.json', 'r')]
print(json.load(out_put))
The reported error is: JSONDecodeError: Extra data: line 1 column 133
(char 132) - adding this is right between the first "}{". Not sure
how json should look in a flat-file format? The output file as seen on
an editor looks like: {"AuthorName": " Mary Wollstonecraft (Godwin)
Shelley", "BookNumber": " 84", "BookTitle": " Frankenstein, or the
Modern Prometheus"}{"AuthorName": " Charles Dickens", "BookNumber": "
98", "BookTitle": " A Tale of Two Cities"}...
I ended up changing the approach and used pandas to read the text and then spliting the single-cell input.
books = pd.read_csv(path + 'books\\set_of_books.txt', sep='\t', names =('r','t', 'a') )
#print(books.head(10))
# Function to clean the 'raw(r)' inoput data
def clean_line(cell):
...
return cell
books['r'] = books['r'].apply(clean_line)
books = books['r'].str.split(';', expand=True)

Is there a way to take a list of strings and create a JSON file, where both the key and value are list items?

I am creating a python script that can read scanned, and tabular .pdfs and extract some important data and insert it into a JSON to later be implemented into a SQL database (I will also be developing the DB as a project for learning MongoDB).
Basically, my issue is I have never worked with any JSON files before but that was the format I was recommended to output to. The scraping script works, the pre-processing could be a lot cleaner, but for now it works. The issue I run into is the keys, and values are in the same list, and some of the values because they had a decimal point are two different list items. Not really sure where to even start.
I don't really know where to start, I suppose since I know what the indexes of the list are I can easily assign keys and values, but then it may not be applicable to any .pdf, that is the script cannot be coded explicitly.
import PyPDF2 as pdf2
import textract
with "TestSpec.pdf" as filename:
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.pdfFileReader(pdfFileObj)
num_pages = pdfReader.numpages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(0)
count += 1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
def cleanText(x):
'''
This function takes the byte data extracted from scanned PDFs, and cleans it of all
unnessary data.
Requires re
'''
stringedText = str(x)
cleanText = stringedText.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
return clean
cleanText = cleanText(text)
This is the current output
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
and we want the output as a JSON setup like
{"Date | 21feb2019", "Sequence ID: | lacz-rp", "Sequence 5'-3' | gat..."}
and so on. Just not sure how to do that.
here is a screenshot of the data from my sample pdf
So, i have figured out some of this. I am still having issues with grabbing the last 3rd of the data i need without explicitly programming it in. but here is what i have so far. Once i have everything working then i will worry about optimizing it and condensing.
# for PDF reading
import PyPDF2 as pdf2
import textract
# for data preprocessing
import re
from dateutil.parser import parse
# For generating the JSON file array
import json
# This finds and opens the pdf file, reads the data, and extracts the data.
filename = "*.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.PdfFileReader(pdfFileObj)
text = ""
pageObj = pdfReader.getPage(0)
text += pageObj.extractText()
# checks if extracted data is in string form or picture, if picture textract reads data.
# it then closes the pdf file
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
pdfFileObj.close()
# Converts text to string from byte data for preprocessing
stringedText = str(text)
# Removed escaped lines and replaced them with actual new lines.
formattedText = stringedText.replace('\\n', '\n').lower()
# Slices the long string into a workable piece (only contains useful data)
slice1 = formattedText[(formattedText.index("sheet") + 10): (formattedText.index("secondary") - 2)]
clean = re.sub('\n', " ", slice1)
clean2 = re.sub(' +', ' ', clean)
# Creating the PrimerData dictionary
with open("PrimerData.json",'w') as file:
primerDataSlice = clean[clean.index("molecular"): -1]
primerData = re.split(": |\n", primerDataSlice)
primerKeys = primerData[0::2]
primerValues = primerData[1::2]
primerDict = {"Primer Data": dict(zip(primerKeys,primerValues))}
# Generatring the JSON array "Primer Data"
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
# Grabbing the date (this has just the date, so json will have to add date.)
date = re.findall('(\d{2}[\/\- ](\d{2}|january|jan|february|feb|march|mar|april|apr|may|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)[\/\- ]\d{2,4})', clean2)
Without input data it is difficult to give you working code. A minimal working example with input would help. As for JSON handling, python dictionaries can dump to json easily. See examples here.
https://docs.python-guide.org/scenarios/json/
Get a json string from a dictionary and write to a file. Figure out how to parse the text into a dictionary.
import json
d = {"Date" : "21feb2019", "Sequence ID" : "lacz-rp", "Sequence 5'-3'" : "gat"}
json_data = json.dumps(d)
print(json_data)
# Write that data to a file
So, I did figure this out, the problem was really just that because of the way my pre-processing was pulling all the data into a single list wasn't really that great of an idea considering that the keys for the dictionary never changed.
Here is the semi-finished result for making the Dictionary and JSON file.
# Collect the sequence name
name = clean2[clean2.index("Sequence") + 11: clean2.index("Sequence") + 19]
# Collecting Shipment info
ordered = input("Who placed this order? ")
received = input("Who is receiving this order? ")
dateOrder = re.findall(
r"(\d{2}[/\- ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[/\- ]\d{2,4})",
clean2)
dateReceived = date.today()
refNo = clean2[clean2.index("ref.No. ") + 8: clean2.index("ref.No.") + 17]
orderNo = clean2[clean2.index("Order No.") +
10: clean2.index("Order No.") + 18]
# Finding and grabbing the sequence data. Storing it and then finding the
# GC content and melting temp or TM
bases = int(clean2[clean2.index("bases") - 3:clean2.index("bases") - 1])
seqList = [line for line in clean2 if re.match(r'^[AGCT]+$', line)]
sequence = "".join(i for i in seqList[:bases])
def gc_content(x):
count = 0
for i in x:
if i == 'G' or i == 'C':
count += 1
else:
count = count
return round((count / bases) * 100, 1)
gc = gc_content(sequence)
tm = mt.Tm_GC(sequence, Na=50)
moleWeight = round(mw(Seq(sequence, generic_dna)), 2)
dilWeight = float(clean2[clean2.index("ug/OD260:") +
10: clean2.index("ug/OD260:") + 14])
dilution = dilWeight * 10
primerDict = {"Primer Data": {
"Sequence": sequence,
"Bases": bases,
"TM (50mM NaCl)": tm,
"% GC content": gc,
"Molecular weight": moleWeight,
"ug/0D260": dilWeight,
"Dilution volume (uL)": dilution
},
"Shipment Info": {
"Ref. No.": refNo,
"Order No.": orderNo,
"Ordered by": ordered,
"Date of Order": dateOrder,
"Received By": received,
"Date Received": str(dateReceived.strftime("%d-%b-%Y"))
}}
# Generating the JSON array "Primer Data"
with open("".join(name) + ".json", 'w') as file:
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)

How to preserve CSV field whitespace for quoted field using Microsoft.VisualBasic.FileIO.TextFieldParser?

I'm parsing CSV data using Microsoft.VisualBasic.FileIO.TextFieldParser. It's very good compared to the freeware libraries I've found for parsing CSV. It does everything that I think it should WRT CSV except that it does not preserve the leading/trailing spaces of a field that is enclosed in quotes. Well, it does if I set TrimWhiteSpace to false, but then it doesn't trim the spaces from fields not enclosed in quotes. For CSV I want it to trim non-quoted fields and not trim the quoted fields.
This is how I'm using the class:
var parser = new TextFieldParser(textReader) {Delimiters = new[] {","}};
//TrimWhiteSpace is true by default
var row1 = _textFieldParser.ReadFields();
var row2 = _textFieldParser.ReadFields();
Consider this data:
1 , 2
" 1 ", " 2 "
For TrimWhiteSpace==true, both row1 and row2 are ["1", "2"].
For TrimWhiteSpace==false, both row1 and row2 are [" 1 ", " 2 "].
What I want is row1==["1", "2"] and row2==[" 1 ", " 2 "].
Although quite late to answer, found the question interesting and up-voted because IMO it's surprising there's no built-in way to keep white space under the described conditions.
So assuming the same input as the question, with an added line to also keep the double quote escape character (an immediately following double quote):
1 , 2
" 1 ", " 2 "
" a ""quoted"" word ", " hello world "
Set HasFieldsEnclosedInQuotes to false, and deal with any field that is enclosed in quotes using a simple Regex:
var separator = new string('=', 40);
Console.WriteLine(separator);
// demo only - show the input lines read from a text file
var text = File.ReadAllText(inputPath);
var lines = text.Split(
new string[] { Environment.NewLine },
StringSplitOptions.None
);
using (var textReader = new StringReader(text))
{
using (var parser = new TextFieldParser(textReader))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
parser.TrimWhiteSpace = true;
parser.HasFieldsEnclosedInQuotes = false;
// remove double quotes, since HasFieldsEnclosedInQuotes is false
var regex = new Regex(#"
# match double quote
\""
# if not immediately followed by a double quote
(?!\"")
",
RegexOptions.IgnorePatternWhitespace
);
var rowStart = 0;
while (parser.PeekChars(1) != null)
{
Console.WriteLine(
"row {0}: {1}", parser.LineNumber, lines[rowStart]
);
var fields = parser.ReadFields();
for (int i = 0; i < fields.Length; ++i)
{
Console.WriteLine(
"parsed field[{0}] = [{1}]", i,
regex.Replace(fields[i], "")
);
}
++rowStart;
Console.WriteLine(separator);
}
}
}
OUTPUT:
========================================
row 1: 1 , 2
parsed field[0] = [1]
parsed field[1] = [2]
========================================
row 2: " 1 ", " 2 "
parsed field[0] = [ 1 ]
parsed field[1] = [ 2 ]
========================================
row 3: " a ""quoted"" word ", " hello world "
parsed field[0] = [ a "quoted" word ]
parsed field[1] = [ hello world ]
========================================