Airflow, avoid top level pull from SSM - jinja2

I have the following DAG, which works just fine:
from airflow import DAG
from airflow.models import Variable
from airflow.operators.subdag import SubDagOperator
from subdags import my_subdag
data_sets = Variable.get("data_sets", deserialize_json=True).get("data")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 1, 1),
}
with DAG(
'myDAG',
default_args=default_args,
schedule_interval='00 12 * * *'
) as dag:
...
for data_set in data_sets:
subdag = SubDagOperator(
task_id=f'{data_set}_subdag',
subdag=my_subdag(
parent_dag_name='myDAG',
child_dag_name=f'{data_set}_subdag',
),
...
default_args=default_args,
)
start >> subdag >> end
But as you can see, I am calling Variable at the top level, which is not best practice (the scheduler queries the secret backend every minute or so).
What can I do to make it so airflow is calling Variable.get only during execution? I was looking at best practices, I can't use another file ('Generating Python code with embedded meta-data') so I though maybe jinja templating could help but I am not sure how to proceed.

No. You can't do it differently currently. If you want to have a dynamic DAG structure based on some external sources, then you cannot do it otherwise than using Top-Level code (and following best practices).
However If you want to have N identical tasks (with some variable index) that you want to dynamically start based on some dynamic data (which seems what you want to do) when the task is executed, there is a change coming in 2.3.0 that will allow you to do that.
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-42+Dynamic+Task+Mapping

Related

How can I process large files in Code Repositories?

I have a data feed that gives a large .txt file (50-75GB) every day. The file contains several different schemas within it, where each row corresponds to one schema. I would like to split this into partitioned datasets for each schema, how can I do this efficiently?
The largest problem you need to solve is the iteration speed to recover your schemas, which can be challenging for a file at this scale.
Your best tactic here will be to get an example 'notional' file with each of the schemas you want to recover as a line within it, and to add this as a file within your repository. When you add this file into your repo (alongside your transformation logic), you will then be able to push it into a dataframe, much as you would with the raw files in your dataset, for quick testing iteration.
First, make sure you specify txt files as a part of your package contents, this way your tests will discover them (this is covered in documentation under Read a file from a Python repository):
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.txt']
}
)
I am using a txt file with the following contents:
my_column, my_other_column
some_string,some_other_string
some_thing,some_other_thing,some_final_thing
This text file is at the following path in my repository: transforms-python/src/myproject/datasets/raw.txt
Once you have configured the text file to be shipped with your logic, and after you have included the file itself in your repository, you can then include the following code. This code has a couple of important functions:
It keeps raw file parsing logic completely separate from the stage of reading the file into a Spark DataFrame. This is so that the way this DataFrame is constructed can be left to the test infrastructure, or to the run time, depending on where you are running.
This keeping of the logic separate lets you ensure the actual row-by-row parsing you want to do is its own testable function, instead of having it live purely within your my_compute_function
This code uses the Spark-native spark_session.read.text method, which will be orders of magnitude faster than row-by-row parsing of a raw txt file. This will ensure the parallelized DataFrame is what you operate on, not a single file, line by line, inside your executors (or worse, your driver).
from transforms.api import transform, Input, Output
from pkg_resources import resource_filename
def raw_parsing_logic(raw_df):
return raw_df
#transform(
my_output=Output("/txt_tests/parsed_files"),
my_input=Input("/txt_tests/dataset_of_files"),
)
def my_compute_function(my_input, my_output, ctx):
all_files_df = None
for file_status in my_input.filesystem().ls('**/**'):
raw_df = ctx.spark_session.read.text(my_input.filesystem().hadoop_path + "/" + file_status.path)
parsed_df = raw_parsing_logic(raw_df)
all_files_df = parsed_df if all_files_df is None else all_files_df.unionByName(parsed_df)
my_output.write_dataframe(all_files_df)
def test_my_compute_function(spark_session):
file_path = resource_filename(__name__, "raw.txt")
raw_df = raw_parsing_logic(
spark_session.read.text(file_path)
)
assert raw_df.count() > 0
raw_columns_set = set(raw_df.columns)
expected_columns_set = {"value"}
assert len(raw_columns_set.intersection(expected_columns_set)) == 1
Once you have this code up and running, your test_my_compute_function method will be very fast to iterate on, so that you can perfect your schema recovery logic. This will make it substantially easier to get your dataset building at the very end, but without any of the overhead of a full build.

How to run Julia function on specific processor using remotecall(), when the function itself does not have return

I tried to use remotecall() in julia to distribute work to specific processor. The function I like to run does not have any return but it will output something. I can not make it work as there is no output file after running the code.
This is the test code I am creating:
using DelimitedFiles
addprocs(4) # add 4 processors
#everywhere function test(x) # Define the function
print("hi")
writedlm(string("test",string(x),".csv"), [x], ',')
end
remotecall(test, 2, 2) # To run the function on process 2
remotecall(test, 3, 3) # To run the function on process 3
This is the output I am getting:
Future(3, 1, 67, nothing)
And there is no output file (csv), or "hi" shown
I wonder if anyone can help me with this or I did anything wrong. I am fairly new to julia and have never used parallel processing.
The background is I need to run a big simulation (A big function with bunch of includes, but no direct return outputs) lots of times, and I like to split the work to different processors.
Thanks a lot
If you want to use a module function in a worker, you need to import that module locally in that worker first, just like you have to do it in your 'root' process. Therefore your using DelimitedFiles directive needs to occur "#everywhere" first, rather than just on the 'root' process. In other words:
#everywhere using DelimitedFiles
Btw, I am assuming you're using a relatively recent version of Julia and simply forgot to add the using Distributed directive in your example.
Furthermore, when you perform a remote call, what you get back is a "Future" object, which is a way of allowing you to obtain the 'future results of that computation' from that worker, once they're finished. To get the results of that 'future computation', use fetch.
This is all very simplistic and general information, since you haven't provided a specific example that can be copy / pasted and answered specifically. Have a look at the relevant section in the manual, it's fairly clearly written: https://docs.julialang.org/en/v1/manual/parallel-computing/#Multi-Core-or-Distributed-Processing-1

How to upload multiple JSON files into CouchDB

I am new to CouchDB. I need to get 60 or more JSON files in a minute from a server.
I have to upload these JSON files to CouchDB individually as soon as I receive them.
I installed CouchDB on my Linux machine.
I hope some one can help me with my requirement.
If possible can someone help me with pseudo code.
My Idea:
Is to write a python script for uploading all JSON files to CouchDB.
Each and every JSON file must be each document and the data present in
JSON must be inserted same into CouchDB
(the specified format with values in a file).
Note:
These JSON files are Transactional, every second 1 file is generated
so I need to read the file upload as same format into CouchDB on
successful uploading archive the file into local system of different folder.
python program to parse the json and insert into CouchDb:
import sys
import glob
import errno,time,os
import couchdb,simplejson
import json
from pprint import pprint
couch = couchdb.Server() # Assuming localhost:5984
#couch.resource.credentials = (USERNAME, PASSWORD)
# If your CouchDB server is running elsewhere, set it up like this:
couch = couchdb.Server('http://localhost:5984/')
db = couch['mydb']
path = 'C:/Users/Desktop/CouchDB_Python/Json_files/*.json'
#dirPath = 'C:/Users/VijayKumar/Desktop/CouchDB_Python'
files = glob.glob(path)
for file1 in files:
#dirs = os.listdir( dirPath )
file2 = glob.glob(file1)
for name in file2: # 'file' is a builtin type, 'name' is a less-ambiguous variable name.
try:
with open(name) as f: # No need to specify 'r': this is the default.
#sys.stdout.write(f.read())
json_data=f
data = json.load(json_data)
db.save(data)
pprint(data)
json_data.close()
#time.sleep(2)
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
I would use CouchDB bulk API, even though you have specified that you need to send them to db one by one. For example, by implementing a simple queue that gets sent out every say 5 - 10 seconds via a bulk doc call will greatly increase performance of your application.
There is obviously a quirk in that and that is you need to know the IDs of the docs that you want to get from the DB. But for the PUTs it is perfect. (it is not entirely true, you can get ranges of docs using bulk operation if the IDs you are using for your docs can be sorted nicely).
From my experience working with CouchDB, I have a hunch that you are dealing with Transactional documents in order to compile them into some sort of sum result and act on that data accordingly (maybe creating next transactional doc in series). For that you can rely on CouchDB by using 'reduce' functions on the views you create. It takes a little practice to get reduce function working properly and is highly dependent on what it is you actually what to achieve and what data you are prepared to emit by the view so I can't really provide you with more detail on that.
So in the end the app logic would go something like that:
get _design/someDesign/_view/yourReducedView
calculate new transaction
add transaction to queue
onTimeout
send all in transaction queue
If I got that first part of why you are using transactional docs wrong all that would really change is the part where you getting those transactional docs in my app logic.
Also, before writing your own 'reduce' function, have a look at buil-in ones (they are alot faster then anything outside of db engine can do)
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
EDIT:
Since you are starting, I strongly recommend to have a look at CouchDB Definitive Guide.
NOTE FOR LATER:
Here is one hidden stone (well maybe not so much a hidden stone but not an obvious thing to look out for for the new-comer in any case). When you write reduce function make sure that it does not produce too much output for the query without boundaries. This will extremely slow down the entire view even when you provide reduce=false when getting stuff from it.
So you need to get JSON documents from a server and send them to CouchDB as you receive them. A Python script would work fine. Here is some pseudo-code:
loop (until no more docs)
get new JSON doc from server
send JSON doc to CouchDB
end loop
In Python, you could use requests to send the documents to CouchDB and probably to get the documents from the server as well (if it is using an HTTP API).
You might want to checkout the pycouchdb module for python3. I've used it myself to upload lots of JSON objects into couchdb instance. My project does pretty much the same as you describe so you can take a look at my project Pyro at Github for details.
My class looks like that:
class MyCouch:
""" COMMUNICATES WITH COUCHDB SERVER """
def __init__(self, server, port, user, password, database):
# ESTABLISHING CONNECTION
self.server = pycouchdb.Server("http://" + user + ":" + password + "#" + server + ":" + port + "/")
self.db = self.server.database(database)
def check_doc_rev(self, doc_id):
# CHECKS REVISION OF SUPPLIED DOCUMENT
try:
rev = self.db.get(doc_id)
return rev["_rev"]
except Exception as inst:
return -1
def update(self, all_computers):
# UPDATES DATABASE WITH JSON STRING
try:
result = self.db.save_bulk( all_computers, transaction=False )
sys.stdout.write( " Updating database" )
sys.stdout.flush()
return result
except Exception as ex:
sys.stdout.write( "Updating database" )
sys.stdout.write( "Exception: " )
print( ex )
sys.stdout.flush()
return None
Let me know in case of any questions - I will be more than glad to help if you will find some of my code usable.

How to commit model instances and remove them from working memory a few at a time

I have a pyramid view that is used for loading data from a large file into a database. For each line in the file it does a little processing then creates some model instances and adds them to the session. This works fine except when the files are big. For large files the view slowly eats up all my ram until everything effectively grinds to a halt.
So my idea is to process each line individually with a function that creates a session, creates the necessary model instances and adds them to the current session, then commits.
def commit_line(lTitles,lLine,oStartDate,oEndDate,iDS,dSettings):
from sqlalchemy.orm import (
scoped_session,
sessionmaker,
)
from sqlalchemy import engine_from_config
from pyramidapp.models import Base, DataEntry
from zope.sqlalchemy import ZopeTransactionExtension
import transaction
oCurrentDBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
engine = engine_from_config(dSettings, 'sqlalchemy.')
oCurrentDBSession.configure(bind=engine)
Base.metadata.bind = engine
oEntry = DataEntry()
oCurrentDBSession.add(oEntry)
...
transaction.commit()
My requirements for this function are as follows:
create a session (check)
make a bunch of model instances (check)
add those instances to the session (check)
commit those models to the database
get rid of the session (so that it and the objects created in 2 are garbage collected)
I've made sure that the newly created session is passed as an argument whenever necessary in order to stop errors to do with multiple sessions blah blah. But alas! I can't get database connections to go away and stuff isn't being committed.
I tried separating the function out into a celery task so the view executes to completion and does what it needs to but I'm getting an error in celery about having too many mysql connections no matter what I try in terms of committing and closing and disposing and I'm not sure why. And yes, I restart the celery server when I make changes.
Surely there is a simple way to do this? All I want to do is make a session commit then go away and leave me alone.
Creating a new session for each line of your large file is going to be quite slow I would imagine.
What I would try is to commit the session and expunge all objects from it every 1000 rows or so:
counter = 0
for line in mymegafile:
entry = process_line(line)
session.add(entry)
if counter > 1000:
counter = 0
transaction.commit() # if you insist on using ZopeTransactionExtension, otherwise session.commit()
session.expunge_all() # this may not be required actually, see https://groups.google.com/forum/#!topic/sqlalchemy/We4XGX2CYX8
else:
counter += 1
If there are no references to DataEntry instances from anywhere they should be garbage collected by Python interpreter at some point.
However, if all you're doing in that view is inserting new records to the database, it may be much more efficient to use SQLAlchemy Core constructs or literal SQL to bulk-insert data. This would also get rid of the problem with your ORM instances eating up your RAM. See I’m inserting 400,000 rows with the ORM and it’s really slow! for details.
So I tried a bunch of things and, although using SQLAlchemy's built in functionality to solve this was probably possible I could not find any way of pulling that off.
So here's an outline of what I did:
seperate the lines to be processed into batches
for each batch of lines queue up a celery task to deal with those lines
in the celery task a seperate process is launched that does the necessary stuff with the lines.
Reasoning:
The batch stuff is obvious
Celery was used because it took a heck of a long time to process an entire file so queuing just made sense
the task launched a separate process because if it didn't then I had the same problem that I had with the pyramid application
Some code:
Celery task:
def commit_lines(lLineData,dSettings,cwd):
"""
writes the line data to a file then calls a process that reads the file and creates
the necessary data entries. Then deletes the file
"""
import lockfile
sFileName = "/home/sheena/tmp/cid_line_buffer"
lock = lockfile.FileLock("{0}_lock".format(sFileName))
with lock:
f = open(sFileName,'a') #in case the process was at any point interrupted...
for d in lLineData:
f.write('{0}\n'.format(d))
f.close()
#now call the external process
import subprocess
import os
sConnectionString = dSettings.get('sqlalchemy.url')
lArgs = [
'python',os.path.join(cwd,'commit_line_file.py'),
'-c',sConnectionString,
'-f',sFileName
]
#open the subprocess. wait for it to complete before continuing with stuff. if errors: raise
subprocess.check_call(lArgs,shell=False)
#and clear the file
lock = lockfile.FileLock("{0}_lock".format(sFileName))
with lock:
f = open(sFileName,'w')
f.close()
External process:
"""
this script goes through all lines in a file and creates data entries from the lines
"""
def main():
from optparse import OptionParser
from sqlalchemy import create_engine
from pyramidapp.models import Base,DBSession
import ast
import transaction
#get options
oParser = OptionParser()
oParser.add_option('-c','--connection_string',dest='connection_string')
oParser.add_option('-f','--input_file',dest='input_file')
(oOptions, lArgs) = oParser.parse_args()
#set up connection
#engine = engine_from_config(dSettings, 'sqlalchemy.')
engine = create_engine(
oOptions.connection_string,
echo=False)
DBSession.configure(bind=engine)
Base.metadata.bind = engine
#commit stuffs
import lockfile
lock = lockfile.FileLock("{0}_lock".format(oOptions.input_file))
with lock:
for sLine in open(oOptions.input_file,'r'):
dLine = ast.literal_eval(sLine)
create_entry(**dLine)
transaction.commit()
def create_entry(iDS,oStartDate,oEndDate,lTitles,lValues):
#import stuff
oEntry = DataEntry()
#do some other stuff, make more model instances...
DBSession.add(oEntry)
if __name__ == "__main__":
main()
in the view:
for line in big_giant_csv_file_handler:
lLineData.append({'stuff':'lots'})
if lLineData:
lLineSets = [lLineData[i:i+iBatchSize] for i in range(0,len(lLineData),iBatchSize)]
for l in lLineSets:
commit_lines.delay(l,dSettings,sCWD) #queue it for celery
You are just doing it wrong. Period.
Quoted from SQLAlchemy docs
The advanced developer will try to keep the details of session,
transaction and exception management as far as possible from the
details of the program doing its work.
Quoted from Pyramid docs
We made the decision to use SQLAlchemy to talk to our database. We also, though, installed pyramid_tm and zope.sqlalchemy.
Why?
Pyramid has a strong orientation towards support for transactions.
Specifically, you can install a transaction manager into your app
application, either as middleware or a Pyramid "tween". Then, just
before you return the response, all transaction-aware parts of your
application are executed. This means Pyramid view code usually doesn't
manage transactions.
My answer today is not code, but a recommendation to follow best practices recommended by the authors of the packages/frameworks you are working with.
References
Big picture - Using Thread-Local Scope with Web Applications
Typical error message when doing it wrong
Databases using SQLAlchemy
How to use scoped_session
Encapsulate CSV reading and creating SQLAlchemy model instances into something that supports the iterator protocol. I called it BatchingModelReader. It returns a collection of DataEntry instances, collection size depends on batch size. If the model changes overtime, you do not need to change the celery task. The task only puts a batch of models into a session and commits the transaction. By controlling the batch size you control memory consumption. Neither BatchingModelReader nor the celery task save huge amounts of intermediate data. This example shows as well that using celery is only an option. I added links to code samples of an pyramid application I am actually refactoring in a Github fork.
BatchingModelReader - encapsulates csv.reader and uses existing models from your pyramid application
get inspired by source code of csv.DictReader
could be run as a celery task - use appropriate task decorator
from .models import DBSession
import transaction
def import_from_csv(path_to_csv, batchsize)
"""given a CSV file and batchsize iterate over batches of model instances and import them to database"""
for batch in BatchingModelReader(path_to_csv, batchsize):
with transaction.manager:
DBSession.add_all(batch)
pyramid view - just save big giant CSV file, start task, return response immediately
#view_config(...):
def view(request):
"""gets file from request, save it to filesystem and start celery task"""
with open(path_to_csv, 'w') as f:
f.write(big_giant_csv_file)
#start task with parameters
import_from_csv.delay(path_to_csv, 1000)
Code samples
ToDoPyramid - commit transaction from commandline
ToDoPyramid - commit transaction from request
Pyramid using SQLAlchemy
Databases using SQLAlchemy
SQLAlchemy internals
Big picture - Using Thread-Local Scope with Web Applications
How to use scoped_session

"ResourceClosedError: The transaction is closed" error with celery beat and sqlalchemy + pyramid app

I have a pyramid app called mainsite.
The site works in a pretty asynchronous manner mostly through threads being launched from the view to carry out the backend operations.
It connects to mysql with sqlalchemy and uses ZopeTransactionExtension for session management.
So far the application has been running great.
I need to run periodic jobs on it and it needs to use some of the same asynchronous functions that are being launched from the view.
I used apscheduler but ran into issues with that. So I thought of using celery beat as a separate process that treats mainapp as a library and imports the functions to be used.
My celery config looks like this:
from datetime import timedelta
from api.apiconst import RERUN_CHECK_INTERVAL, AUTOMATION_CHECK_INTERVAL, \
AUTH_DELETE_TIME
BROKER_URL = 'sqla+mysql://em:em#localhost/edgem'
CELERY_RESULT_BACKEND = "database"
CELERY_RESULT_DBURI = 'mysql://em:em#localhost/edgem'
CELERYBEAT_SCHEDULE = {
'rerun': {
'task': 'tasks.rerun_scheduler',
'schedule': timedelta(seconds=RERUN_CHECK_INTERVAL)
},
'automate': {
'task': 'tasks.automation_scheduler',
'schedule': timedelta(seconds=20)
},
'remove-tokens': {
'task': 'tasks.token_remover_scheduler',
'schedule': timedelta(seconds=2 * 24 * 3600 )
},
}
CELERY_TIMEZONE = 'UTC'
The tasks.py is
from celery import Celery
celery = Celery('tasks')
celery.config_from_object('celeryconfig')
#celery.task
def rerun_scheduler():
from mainsite.task import check_update_rerun_tasks
check_update_rerun_tasks()
#celery.task
def automation_scheduler():
from mainsite.task import automate
automate()
#celery.task
def token_remover_scheduler():
from mainsite.auth_service import delete_old_tokens
delete_old_tokens()
keep in mind that all the above functions immediately return but launch threads if required
The threads save objects into db by doing transaction.commit() after session.add(object).
The problem is that the whole things works like a gem only for about 30 minutes. After that ResourceClosedError: The transaction is closed errors starts happening wherever there is a transaction.commit(). I am not sure what is the problem and I need help troubleshooting.
The reason I do import inside the tasks was to get rid of this error. Thought importing every time task needed to be run was a good idea and I may get a new transaction each time, but looks like that is not the case.
In my experience trying to reuse a session configured to be used with Pyramid (with ZopeTransactionExtension etc.) with a Celery worker results in a terrible hard-to-debug mess.
ZopeTransactionExtension binds SQLAlchemy session to Pyramid's request-response cycle - a transaction is started and committed or rolled back automatically, you're generally not supposed to use transaction.commit() within your code - if everything is ok ZTE will commit everything, if your code raises and exception your transaction will be rolled back.
With Celery you need to manage SQLAlchemy sessions manually, which ZTE prevents you from doing, so you need to configure your DBSession differently.
Something simple like this would work:
DBSession = None
def set_dbsession(session):
global DBSession
if DBSession is not None:
raise AttributeError("DBSession has been already set to %s!" % DBSession)
DBSession = session
And then from Pyramid startup code you do
def main(global_config, **settings):
...
set_dbsession(scoped_session(sessionmaker(extension=ZopeTransactionExtension())))
With Celery it's a bit trickier - I ended up creating a custom start script for Celery, in which I configure the session.
In setup.py of the worker egg:
entry_points="""
# -*- Entry points: -*-
[console_scripts]
custom_celery = worker.celeryd:start_celery
custom_celerybeat = worker.celeryd:start_celerybeat
""",
)
in worker/celeryd.py:
def initialize_async_session(db_string, db_echo):
import sqlalchemy as sa
from db import Base, set_dbsession
session = sa.orm.scoped_session(sa.orm.sessionmaker(autoflush=True, autocommit=True))
engine = sa.create_engine(db_string, echo=db_echo)
session.configure(bind=engine)
set_dbsession(session)
Base.metadata.bind = engine
def start_celery():
initialize_async_session(DB_STRING, DB_ECHO)
import celery.bin.celeryd
celery.bin.celeryd.main()
The general approach you're using with "threads being launched from the view to carry out the backend operations" feels a bit dangerous to me if you ever plan to deploy the application to a production server - a web server often recycles, kills or creates new "workers" so generally there are no guarantees each particular process would survive beyond the current request-response cycle. I never tried doing this though, so maybe you'll be ok :)