I'm trying to use an xcom_pull inside an SQL phrase executed by a Snowflake operator in Airflow.
I need the task_id name to use a variable since I want to support different tasks.
I tried this syntax but seems it is not being rendered ok.
Anyone has an idea how to do it?
This is the Python code:
for product, val in PRODUCTS_TO_EXTRACT_INC.items():
product_indicator, prefix = val
params['product_prefix'] = prefix
calculate_to_date = SnowflakeOperator(
dag=dag,
task_id=f'calculate_to_date_{prefix}',
snowflake_conn_id = SF_CONNECTION_ID,
warehouse=SF_WAREHOUSE,
database=BI_DB,
schema=STG_SCHEMA,
role=SF_ROLE,
sql= [ """
{SQL_FILE}
""".format(SQL_FILE="{% include '" + QUERIES_DIR + ETL + "/calculate_to_date.sql'" + " %}")
],
params=params
)
This is the SQL code for calculate_to_date.sql:
select '{{{{ (ti.xcom_pull(key="return_value", task_ids=["calculate_from_date_{}"])[0][0]).get("FROM_DATE") }}}}'.format(params.product_prefix) AS TO_DATE
This is the error message:
File "/home/airflow/gcs/dags/Test/queries/fact_subscriptions_events/calculate_to_date.sql", line 11, in template
select '{{{{ (ti.xcom_pull(key="return_value", task_ids=["calculate_from_date_{}"])[0][0]).get("FROM_DATE") }}}}'.format(params.product_prefix)
jinja2.exceptions.TemplateSyntaxError: expected token ':', got '}'
the correct syntax is
select '{{ (ti.xcom_pull(key="return_value", task_ids="calculate_from_date_{}".format(params.product_prefix))[0]).get("FROM_DATE") }}' AS TO_DATE
it works like a charm
My objective is to extract strings from numbered/bulleted lists in multiple Microsoft Word documents, then to organize those strings into a single, one-line string where each string is ordered in the following manner: 1.string1 2.string2 3.string3 etc. I refer to these one-line strings as procedures, consisting of 'steps' 1., 2., 3., etc.
The reason it has to be in this format is because the procedure strings are being put into a database, the database is used to create Excel spreadsheet outputs, a formatting macro is used on the spreadsheets, and the procedure strings in question have to be in this format in order for that macro to work properly.
The numbered/bulleted lists in MSword are all similar in format, but some use numbers, some use bullets, and some have extra line spaces before the first point, or extra line spaces after the last point.
The following text shows three different examples of how the Word documents are formatted:
Paragraph Keyword 1: arbitrary text
1. Step 1
2. Step 2
3. Step 3
Paragraph Keyword 2: arbitrary text
Paragraph Keyword 3: arbitrary text
• Step 1
• Step 2
• Step 3
Paragraph Keyword 4: arbitrary text
Paragraph Keyword 5: arbitrary text
Step 1
Step 2
Step 3
Paragraph Keyword 6: arbitrary text
(For some reason the first two lists didn't get indented in the formatting of the post, but in my word document all the indentation is the same)
When the numbered/bulleted list is formatted without line extra spaces, my code works fine, e.g. between "paragraph keyword 1:" and "paragraph keyword 2:".
I was trying to use isspace() to isolate the instances where there are extra line spaces that aren't part of the list that I want to include in my procedure strings.
Here is my code:
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
def extractStrings(file):
doc = file
for i in range(len(doc.paragraphs)):
str1 = doc.paragraphs[i].text
if "Paragraph Keyword 1:" in str1:
start1=i
if "Paragraph Keyword 2:" in str1:
finish1=i
if "Paragraph Keyword 3:" in str1:
start2=i
if "Paragraph Keyword 4:" in str1:
finish2=i
if "Paragraph Keyword 5:" in str1:
start3=i
if "Paragraph Keyword 6:" in str1:
finish3=i
print("----------------------------")
procedure1 = ""
y=1
for x in range(start1 + 1, finish1):
temp = str((doc.paragraphs[x].text))
print(temp)
if not temp.isspace():
if y > 1:
procedure1 = (procedure1 + " " + str(y) + "." + temp)
else:
procedure1 = (procedure1 + str(y) + "." + temp)
y=y+1
print(procedure1)
print("----------------------------")
procedure2 = ""
y=1
for x in range(start2 + 1, finish2):
temp = str((doc.paragraphs[x].text))
print(temp)
if not temp.isspace():
if y > 1:
procedure2 = (procedure2 + " " + str(y) + "." + temp)
else:
procedure2 = (procedure2 + str(y) + "." + temp)
y=y+1
print(procedure2)
print("----------------------------")
procedure3 = ""
y=1
for x in range(start3 + 1, finish3):
temp = str((doc.paragraphs[x].text))
print(temp)
if not temp.isspace():
if y > 1:
procedure3 = (procedure3 + " " + str(y) + "." + temp)
else:
procedure3 = (procedure3 + str(y) + "." + temp)
y=y+1
print(procedure3)
print("----------------------------")
del doc
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
import docx
doc1 = docx.Document("docx_isspace_experiment_042420.docx")
extractStrings(doc1)
del doc1
Unfortunately I have no way of putting the output into this post, but the problem is that whenever there is a blank line in the word doc, isspace() returns false, and a number "x." is assigned to empty space, so I end up with something like: 1. 2.Step 1 3.Step 2 4.Step 3 5. 6. (that's the last iteration of print(procedure3) from the code)
The problem is that isspace() is returning false even when my python console output shows that the string is just a blank line.
Am I using isspace() incorrectly? Is there something in the string I am not detecting that is causing isspace() to return false? Is there a better way to accomplish this?
Use the test:
# --- for s a str value, like paragraph.text ---
if s.strip() == "":
print("s is a blank line")
str.isspace() returns True if the string contains only whitespace. An empty str contains nothing, and so therefore does not contain whitespace.
I was able to take a text file, read each line, create a dictionary per line, update(append) each line and store the json file. The issue is when reading the json file it will not read correctly. the error point to a storing file issue?
The text file looks like:
84.txt; Frankenstein, or the Modern Prometheus; Mary Wollstonecraft (Godwin) Shelley
98.txt; A Tale of Two Cities; Charles Dickens
...
import json
import re
path = "C:\\...\\data\\"
books = {}
books_json = {}
final_book_json ={}
file = open(path + 'books\\set_of_books.txt', 'r')
json_list = file.readlines()
open(path + 'books\\books_json.json', 'w').close() # used to clean each test
json_create = []
i = 0
for line in json_list:
line = line.replace('#', '')
line = line.replace('.txt','')
line = line.replace('\n','')
line = line.split(';', 4)
BookNumber = line[0]
BookTitle = line[1]
AuthorName = line[-1]
file
if BookNumber == ' 2701':
BookNumber = line[0]
BookTitle1 = line[1]
BookTitle2 = line[2]
AuthorName = line[3]
BookTitle = BookTitle1 + ';' + BookTitle2 # needed to combine title into one to fit dict format
books = json.dumps( {'AuthorName': AuthorName, 'BookNumber': BookNumber, 'BookTitle': BookTitle})
books_json = json.loads(books)
final_book_json.update(books_json)
with open(path + 'books\\books_json.json', 'a'
) as out_put:
json.dump(books_json, out_put)
with open(path + 'books\\books_json.json', 'r'
) as out_put:
'books\\books_json.json', 'r')]
print(json.load(out_put))
The reported error is: JSONDecodeError: Extra data: line 1 column 133
(char 132) - adding this is right between the first "}{". Not sure
how json should look in a flat-file format? The output file as seen on
an editor looks like: {"AuthorName": " Mary Wollstonecraft (Godwin)
Shelley", "BookNumber": " 84", "BookTitle": " Frankenstein, or the
Modern Prometheus"}{"AuthorName": " Charles Dickens", "BookNumber": "
98", "BookTitle": " A Tale of Two Cities"}...
I ended up changing the approach and used pandas to read the text and then spliting the single-cell input.
books = pd.read_csv(path + 'books\\set_of_books.txt', sep='\t', names =('r','t', 'a') )
#print(books.head(10))
# Function to clean the 'raw(r)' inoput data
def clean_line(cell):
...
return cell
books['r'] = books['r'].apply(clean_line)
books = books['r'].str.split(';', expand=True)
I am creating a python script that can read scanned, and tabular .pdfs and extract some important data and insert it into a JSON to later be implemented into a SQL database (I will also be developing the DB as a project for learning MongoDB).
Basically, my issue is I have never worked with any JSON files before but that was the format I was recommended to output to. The scraping script works, the pre-processing could be a lot cleaner, but for now it works. The issue I run into is the keys, and values are in the same list, and some of the values because they had a decimal point are two different list items. Not really sure where to even start.
I don't really know where to start, I suppose since I know what the indexes of the list are I can easily assign keys and values, but then it may not be applicable to any .pdf, that is the script cannot be coded explicitly.
import PyPDF2 as pdf2
import textract
with "TestSpec.pdf" as filename:
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.pdfFileReader(pdfFileObj)
num_pages = pdfReader.numpages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(0)
count += 1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
def cleanText(x):
'''
This function takes the byte data extracted from scanned PDFs, and cleans it of all
unnessary data.
Requires re
'''
stringedText = str(x)
cleanText = stringedText.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
return clean
cleanText = cleanText(text)
This is the current output
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
and we want the output as a JSON setup like
{"Date | 21feb2019", "Sequence ID: | lacz-rp", "Sequence 5'-3' | gat..."}
and so on. Just not sure how to do that.
here is a screenshot of the data from my sample pdf
So, i have figured out some of this. I am still having issues with grabbing the last 3rd of the data i need without explicitly programming it in. but here is what i have so far. Once i have everything working then i will worry about optimizing it and condensing.
# for PDF reading
import PyPDF2 as pdf2
import textract
# for data preprocessing
import re
from dateutil.parser import parse
# For generating the JSON file array
import json
# This finds and opens the pdf file, reads the data, and extracts the data.
filename = "*.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.PdfFileReader(pdfFileObj)
text = ""
pageObj = pdfReader.getPage(0)
text += pageObj.extractText()
# checks if extracted data is in string form or picture, if picture textract reads data.
# it then closes the pdf file
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
pdfFileObj.close()
# Converts text to string from byte data for preprocessing
stringedText = str(text)
# Removed escaped lines and replaced them with actual new lines.
formattedText = stringedText.replace('\\n', '\n').lower()
# Slices the long string into a workable piece (only contains useful data)
slice1 = formattedText[(formattedText.index("sheet") + 10): (formattedText.index("secondary") - 2)]
clean = re.sub('\n', " ", slice1)
clean2 = re.sub(' +', ' ', clean)
# Creating the PrimerData dictionary
with open("PrimerData.json",'w') as file:
primerDataSlice = clean[clean.index("molecular"): -1]
primerData = re.split(": |\n", primerDataSlice)
primerKeys = primerData[0::2]
primerValues = primerData[1::2]
primerDict = {"Primer Data": dict(zip(primerKeys,primerValues))}
# Generatring the JSON array "Primer Data"
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
# Grabbing the date (this has just the date, so json will have to add date.)
date = re.findall('(\d{2}[\/\- ](\d{2}|january|jan|february|feb|march|mar|april|apr|may|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)[\/\- ]\d{2,4})', clean2)
Without input data it is difficult to give you working code. A minimal working example with input would help. As for JSON handling, python dictionaries can dump to json easily. See examples here.
https://docs.python-guide.org/scenarios/json/
Get a json string from a dictionary and write to a file. Figure out how to parse the text into a dictionary.
import json
d = {"Date" : "21feb2019", "Sequence ID" : "lacz-rp", "Sequence 5'-3'" : "gat"}
json_data = json.dumps(d)
print(json_data)
# Write that data to a file
So, I did figure this out, the problem was really just that because of the way my pre-processing was pulling all the data into a single list wasn't really that great of an idea considering that the keys for the dictionary never changed.
Here is the semi-finished result for making the Dictionary and JSON file.
# Collect the sequence name
name = clean2[clean2.index("Sequence") + 11: clean2.index("Sequence") + 19]
# Collecting Shipment info
ordered = input("Who placed this order? ")
received = input("Who is receiving this order? ")
dateOrder = re.findall(
r"(\d{2}[/\- ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[/\- ]\d{2,4})",
clean2)
dateReceived = date.today()
refNo = clean2[clean2.index("ref.No. ") + 8: clean2.index("ref.No.") + 17]
orderNo = clean2[clean2.index("Order No.") +
10: clean2.index("Order No.") + 18]
# Finding and grabbing the sequence data. Storing it and then finding the
# GC content and melting temp or TM
bases = int(clean2[clean2.index("bases") - 3:clean2.index("bases") - 1])
seqList = [line for line in clean2 if re.match(r'^[AGCT]+$', line)]
sequence = "".join(i for i in seqList[:bases])
def gc_content(x):
count = 0
for i in x:
if i == 'G' or i == 'C':
count += 1
else:
count = count
return round((count / bases) * 100, 1)
gc = gc_content(sequence)
tm = mt.Tm_GC(sequence, Na=50)
moleWeight = round(mw(Seq(sequence, generic_dna)), 2)
dilWeight = float(clean2[clean2.index("ug/OD260:") +
10: clean2.index("ug/OD260:") + 14])
dilution = dilWeight * 10
primerDict = {"Primer Data": {
"Sequence": sequence,
"Bases": bases,
"TM (50mM NaCl)": tm,
"% GC content": gc,
"Molecular weight": moleWeight,
"ug/0D260": dilWeight,
"Dilution volume (uL)": dilution
},
"Shipment Info": {
"Ref. No.": refNo,
"Order No.": orderNo,
"Ordered by": ordered,
"Date of Order": dateOrder,
"Received By": received,
"Date Received": str(dateReceived.strftime("%d-%b-%Y"))
}}
# Generating the JSON array "Primer Data"
with open("".join(name) + ".json", 'w') as file:
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
I'm using ruby and the geocodio gem to do some reverse geocoding. The reverse geocoder returns an object of type Geocodio::Address which per their website is JSON. I'm trying to use ruby JSON:parse to convert to a hash that I can them map as needed.
JSON parse returns this error...
C:/Ruby23-x64/lib/ruby/2.3.0/json/common.rb:156:in parse': 784: unexpected token at '"Saipan, MP 96950"' (J
from C:/Ruby23-x64/lib/ruby/2.3.0/json/common.rb:156:inparse'
Here's my whole script
require 'geocodio'
require 'CSV'
require 'json'
filename = './for_fips.csv'
#fipslist = []
geocodio = Geocodio::Client.new('not real...d5a1557e2175d8ce265')
i = 1
CSV.foreach(filename) do |row|
lat = row[1]
long = row[2]
coord = lat + "," + long
#puts coord
add = geocodio.reverse_geocode([lat + "," + long],fields: %w[cd stateleg school timezone]).best
add_parsed = JSON.parse(add)
pp add_parsed
#puts add.each { |k,v| "#{k}=#{v}"}.join('~~')
i +=1
if i > 2 then break end
#fipslist << fcc.district_fips
end