I have 2 json configuration files to read and want to assign there values to variables. I am creating a data flow job using apache beam but unable to parse those files and assign there values to a variable.
config1.json - { "bucket_name": "mybucket"}
config2.json - { "dataset_name": "mydataset"}
This is the pipeline statements ---- I tried with one JSON file first but even that is not working
with beam.Pipeline(options=pipeline_options) as pipeline:
steps = (pipeline
| "Getdata" >> beam.io.ReadFromText(custom_options.configfile)
| "CUSTOM JSON PARSE" >> beam.ParDo(custom_json_parser(custom_options.configfile))
| "write to GCS" >> beam.io.WriteToText('gs://mynewbucket/outputfile.txt')
)
result = pipeline.run()
result.wait_until_finish()
I also tried creating a function to parse atleast one file. This is a sample method I created but it did not work.
class custom_json_parser(beam.DoFn):
import apache_beam as beam
from apache_beam.io.gcp import gcsio
import logging
def __init__(self, configfile):
self.configfile = configfile
def process(self, configfile):
logging.info("JSON PARSING STARTED")
with beam.io.gcp.gcsio.GcsIO().open(self.configfile, 'r') as f:
for line in f:
data = json.loads(line)
bucket = data.get('bucket_name')
dataset = data.get('dataset_name') ```
Can someone please suggest the best method to resolve this issue in apache beam?
Thanks in Advance
If you need to read only once your files in the pipeline, don't read them in the pipeline, but before running it.
Read the files from GCS
Parse the file and put the useful content in the pipeline options map
Run your pipeline and use the data from the options
EDIT 1
You can use this piece of code to load the file and read it, before your pipeline. Simple Python, standard GCS libraries.
from google.cloud import storage
import json
client = storage.Client()
bucket = client.get_bucket('your-bucket')
blob = bucket.get_blob("name.json")
json_data = blob.download_as_string().decode('UTF-8')
print(json_data) # print -> {"name": "works!!"}
print(json.loads(json_data)["name"]) # print -> works!!
You can try following code snippet: -
Function to Parse File
class custom_json_parser(beam.DoFn):
def process(self, element):
logging.info(element)
data = json.loads(element)
bucket = data.get('bucket_name')
dataset = data.get('dataset_name')
return [{"bucket": bucket , "dataset": dataset }]
Over Pipeline you can call function
with beam.Pipeline(options=pipeline_options) as pipeline:
steps = (pipeline
| "Getdata" >> beam.io.ReadFromText(custom_options.configfile)
| "CUSTOM JSON PARSE" >> beam.ParDo(custom_json_parser())
| "write to GCS" >> beam.io.WriteToText('gs://mynewbucket/outputfile.txt')
)
result = pipeline.run()
result.wait_until_finish()
It will work.
In Python, how can we find out the command line arguments that were provided for a script, and process them?
For some more specific examples, see Implementing a "[command] [action] [parameter]" style command-line interfaces? and How do I format positional argument help using Python's optparse?.
import sys
print("\n".join(sys.argv))
sys.argv is a list that contains all the arguments passed to the script on the command line. sys.argv[0] is the script name.
Basically,
import sys
print(sys.argv[1:])
The canonical solution in the standard library is argparse (docs):
Here is an example:
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument("-f", "--file", dest="filename",
help="write report to FILE", metavar="FILE")
parser.add_argument("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="don't print status messages to stdout")
args = parser.parse_args()
argparse supports (among other things):
Multiple options in any order.
Short and long options.
Default values.
Generation of a usage help message.
Just going around evangelizing for argparse which is better for these reasons.. essentially:
(copied from the link)
argparse module can handle positional
and optional arguments, while
optparse can handle only optional
arguments
argparse isn’t dogmatic about
what your command line interface
should look like - options like -file
or /file are supported, as are
required options. Optparse refuses to
support these features, preferring
purity over practicality
argparse produces more
informative usage messages, including
command-line usage determined from
your arguments, and help messages for
both positional and optional
arguments. The optparse module
requires you to write your own usage
string, and has no way to display
help for positional arguments.
argparse supports action that
consume a variable number of
command-line args, while optparse
requires that the exact number of
arguments (e.g. 1, 2, or 3) be known
in advance
argparse supports parsers that
dispatch to sub-commands, while
optparse requires setting
allow_interspersed_args and doing the
parser dispatch manually
And my personal favorite:
argparse allows the type and
action parameters to add_argument()
to be specified with simple
callables, while optparse requires
hacking class attributes like
STORE_ACTIONS or CHECK_METHODS to get
proper argument checking
There is also argparse stdlib module (an "impovement" on stdlib's optparse module). Example from the introduction to argparse:
# script.py
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'integers', metavar='int', type=int, choices=range(10),
nargs='+', help='an integer in the range 0..9')
parser.add_argument(
'--sum', dest='accumulate', action='store_const', const=sum,
default=max, help='sum the integers (default: find the max)')
args = parser.parse_args()
print(args.accumulate(args.integers))
Usage:
$ script.py 1 2 3 4
4
$ script.py --sum 1 2 3 4
10
If you need something fast and not very flexible
main.py:
import sys
first_name = sys.argv[1]
last_name = sys.argv[2]
print("Hello " + first_name + " " + last_name)
Then run python main.py James Smith
to produce the following output:
Hello James Smith
The docopt library is really slick. It builds an argument dict from the usage string for your app.
Eg from the docopt readme:
"""Naval Fate.
Usage:
naval_fate.py ship new <name>...
naval_fate.py ship <name> move <x> <y> [--speed=<kn>]
naval_fate.py ship shoot <x> <y>
naval_fate.py mine (set|remove) <x> <y> [--moored | --drifting]
naval_fate.py (-h | --help)
naval_fate.py --version
Options:
-h --help Show this screen.
--version Show version.
--speed=<kn> Speed in knots [default: 10].
--moored Moored (anchored) mine.
--drifting Drifting mine.
"""
from docopt import docopt
if __name__ == '__main__':
arguments = docopt(__doc__, version='Naval Fate 2.0')
print(arguments)
One way to do it is using sys.argv. This will print the script name as the first argument and all the other parameters that you pass to it.
import sys
for arg in sys.argv:
print arg
#set default args as -h , if no args:
if len(sys.argv) == 1: sys.argv[1:] = ["-h"]
I use optparse myself, but really like the direction Simon Willison is taking with his recently introduced optfunc library. It works by:
"introspecting a function
definition (including its arguments
and their default values) and using
that to construct a command line
argument parser."
So, for example, this function definition:
def geocode(s, api_key='', geocoder='google', list_geocoders=False):
is turned into this optparse help text:
Options:
-h, --help show this help message and exit
-l, --list-geocoders
-a API_KEY, --api-key=API_KEY
-g GEOCODER, --geocoder=GEOCODER
I like getopt from stdlib, eg:
try:
opts, args = getopt.getopt(sys.argv[1:], 'h', ['help'])
except getopt.GetoptError, err:
usage(err)
for opt, arg in opts:
if opt in ('-h', '--help'):
usage()
if len(args) != 1:
usage("specify thing...")
Lately I have been wrapping something similiar to this to make things less verbose (eg; making "-h" implicit).
As you can see optparse "The optparse module is deprecated with and will not be developed further; development will continue with the argparse module."
Pocoo's click is more intuitive, requires less boilerplate, and is at least as powerful as argparse.
The only weakness I've encountered so far is that you can't do much customization to help pages, but that usually isn't a requirement and docopt seems like the clear choice when it is.
import argparse
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('integers', metavar='N', type=int, nargs='+',
help='an integer for the accumulator')
parser.add_argument('--sum', dest='accumulate', action='store_const',
const=sum, default=max,
help='sum the integers (default: find the max)')
args = parser.parse_args()
print(args.accumulate(args.integers))
Assuming the Python code above is saved into a file called prog.py
$ python prog.py -h
Ref-link: https://docs.python.org/3.3/library/argparse.html
You may be interested in a little Python module I wrote to make handling of command line arguments even easier (open source and free to use) - Commando
Yet another option is argh. It builds on argparse, and lets you write things like:
import argh
# declaring:
def echo(text):
"Returns given word as is."
return text
def greet(name, greeting='Hello'):
"Greets the user with given name. The greeting is customizable."
return greeting + ', ' + name
# assembling:
parser = argh.ArghParser()
parser.add_commands([echo, greet])
# dispatching:
if __name__ == '__main__':
parser.dispatch()
It will automatically generate help and so on, and you can use decorators to provide extra guidance on how the arg-parsing should work.
I recommend looking at docopt as a simple alternative to these others.
docopt is a new project that works by parsing your --help usage message rather than requiring you to implement everything yourself. You just have to put your usage message in the POSIX format.
Also with python3 you might find convenient to use Extended Iterable Unpacking to handle optional positional arguments without additional dependencies:
try:
_, arg1, arg2, arg3, *_ = sys.argv + [None] * 2
except ValueError:
print("Not enough arguments", file=sys.stderr) # unhandled exception traceback is meaningful enough also
exit(-1)
The above argv unpack makes arg2 and arg3 "optional" - if they are not specified in argv, they will be None, while if the first is not specified, ValueError will be thouwn:
Traceback (most recent call last):
File "test.py", line 3, in <module>
_, arg1, arg2, arg3, *_ = sys.argv + [None] * 2
ValueError: not enough values to unpack (expected at least 4, got 3)
My solution is entrypoint2. Example:
from entrypoint2 import entrypoint
#entrypoint
def add(file, quiet=True):
''' This function writes report.
:param file: write report to FILE
:param quiet: don't print status messages to stdout
'''
print file,quiet
help text:
usage: report.py [-h] [-q] [--debug] file
This function writes report.
positional arguments:
file write report to FILE
optional arguments:
-h, --help show this help message and exit
-q, --quiet don't print status messages to stdout
--debug set logging level to DEBUG
import sys
# Command line arguments are stored into sys.argv
# print(sys.argv[1:])
# I used the slice [1:] to print all the elements except the first
# This because the first element of sys.argv is the program name
# So the first argument is sys.argv[1], the second is sys.argv[2] ecc
print("File name: " + sys.argv[0])
print("Arguments:")
for i in sys.argv[1:]:
print(i)
Let's name this file command_line.py and let's run it:
C:\Users\simone> python command_line.py arg1 arg2 arg3 ecc
File name: command_line.py
Arguments:
arg1
arg2
arg3
ecc
Now let's write a simple program, sum.py:
import sys
try:
print(sum(map(float, sys.argv[1:])))
except:
print("An error has occurred")
Result:
C:\Users\simone> python sum.py 10 4 6 3
23
This handles simple switches, value switches with optional alternative flags.
import sys
# [IN] argv - array of args
# [IN] switch - switch to seek
# [IN] val - expecting value
# [IN] alt - switch alternative
# returns value or True if val not expected
def parse_cmd(argv,switch,val=None,alt=None):
for idx, x in enumerate(argv):
if x == switch or x == alt:
if val:
if len(argv) > (idx+1):
if not argv[idx+1].startswith('-'):
return argv[idx+1]
else:
return True
//expecting a value for -i
i = parse_cmd(sys.argv[1:],"-i", True, "--input")
//no value needed for -p
p = parse_cmd(sys.argv[1:],"-p")
Several of our biotechnology clients have posed these two questions recently:
How can we execute a Python script as a command?
How can we pass input values to a Python script when it is executed as a command?
I have included a Python script below which I believe answers both questions. Let's assume the following Python script is saved in the file test.py:
#
#----------------------------------------------------------------------
#
# file name: test.py
#
# input values: data - location of data to be processed
# date - date data were delivered for processing
# study - name of the study where data originated
# logs - location where log files should be written
#
# macOS usage:
#
# python3 test.py "/Users/lawrence/data" "20220518" "XYZ123" "/Users/lawrence/logs"
#
# Windows usage:
#
# python test.py "D:\data" "20220518" "XYZ123" "D:\logs"
#
#----------------------------------------------------------------------
#
# import needed modules...
#
import sys
import datetime
def main(argv):
#
# print message that process is starting...
#
print("test process starting at", datetime.datetime.now().strftime("%Y%m%d %H:%M"))
#
# set local values from input values...
#
data = sys.argv[1]
date = sys.argv[2]
study = sys.argv[3]
logs = sys.argv[4]
#
# print input arguments...
#
print("data value is", data)
print("date value is", date)
print("study value is", study)
print("logs value is", logs)
#
# print message that process is ending...
#
print("test process ending at", datetime.datetime.now().strftime("%Y%m%d %H:%M"))
#
# call main() to begin processing...
#
if __name__ == '__main__':
main(sys.argv)
The script can be executed on a macOS computer in a Terminal shell as shown below and the results will be printed to standard output (be sure the current directory includes the test.py file):
$ python3 test.py "/Users/lawrence/data" "20220518" "XYZ123" "/Users/lawrence/logs"
test process starting at 20220518 16:51
data value is /Users/lawrence/data
date value is 20220518
study value is XYZ123
logs value is /Users/lawrence/logs
test process ending at 20220518 16:51
The script can also be executed on a Windows computer in a Command Prompt as shown below and the results will be printed to standard output (be sure the current directory includes the test.py file):
D:\scripts>python test.py "D:\data" "20220518" "XYZ123" "D:\logs"
test process starting at 20220518 17:20
data value is D:\data
date value is 20220518
study value is XYZ123
logs value is D:\logs
test process ending at 20220518 17:20
This script answers both questions posed above and is a good starting point for developing scripts that will be executed as commands with input values.
Reason for the new answer:
Existing answers specify multiple options.
Standard option is to use argparse, a few answers provided examples from the documentation, and one answer suggested the advantage of it. But all fail to explain the answer adequately/clearly to the actual question by OP, at least for newbies.
An example of argparse:
import argparse
def load_config(conf_file):
pass
if __name__ == '__main__':
parser = argparse.ArgumentParser()
//Specifies one argument from the command line
//You can have any number of arguments like this
parser.add_argument("conf_file", help="configuration file for the application")
args = parser.parse_args()
config = load_config(args.conf_file)
Above program expects a config file as an argument. If you provide it, it will execute happily. If not, it will print the following
usage: test.py [-h] conf_file
test.py: error: the following arguments are required: conf_file
You can have the option to specify if the argument is optional.
You can specify the expected type for the argument using type key
parser.add_argument("age", type=int, help="age of the person")
You can specify default value for the arguments by specifying default key
This document will help you to understand it to an extent.
I have a Custom operator that inherits baseoperator. I am trying to templatize 'queue' name so that task can be picked up by different Celery worker.
But it uses raw template string (un rendered jinja string) as the queue name instead of rendered string.
The same flow works if I give the intended queue name directly as a simple string.
from airflow import DAG
from operators.check_operator import CheckQueueOperator
from datetime import datetime, timedelta
from airflow.operators.python_operator import BranchPythonOperator
from airflow.utils.dates import days_ago
default_args = {
'schedule_interval': None, # exclusively “externally triggered” DAG
'owner': 'admin',
'description': 'This helps to quickly check queue templatization',
'start_date': days_ago(1),
'retries': 0,
'retry_delay': timedelta(minutes=5),
'provide_context': True
}
# this goes to wrong queue --> {{ dag_run.conf["queue"]}}
with DAG('test_queue', default_args=default_args) as dag:
t1 = CheckQueueOperator(task_id='check_q',
queue='{{ dag_run.conf["queue"]}}'
)
In the above this scenario :
In RabbitMQ, I see the task being queued under queue name '{{ dag_run.conf["queue"]}}' (raw template string )
In Airflow, under Rendered template I am able to see properly rendered value for queue field
In the screenshot, we see docker-desktop as queue name. It's my test queue and also my default airflow queue. It works perfectly if I give this queue name as a direct string.
#this goes to right queue --> my_target_queue
with DAG('test_queue', default_args=default_args) as dag:
t1 = CheckQueueOperator(task_id='check_q',
queue='my_target_queue'
)
CheckQueueOperator code :
from airflow.models.baseoperator import BaseOperator
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
'''
Validate if queue can be templatized in base operator
'''
class CheckQueueOperator(BaseOperator):
template_fields = ['queue']
#apply_defaults
def __init__(
self,
*args,
**kwargs
):
super(CheckQueueOperator, self).__init__(*args, **kwargs)
def execute(self, context):
self.log.info('*******************************')
self.log.info('Queue name %s', self.queue)
return
Stack details:
Apache Airflow version - 1.10.12
Using CeleryExecutor
Using RabbitMQ
The queue attribute is reserved (perhaps not officially, but in practice) by the BaseOperator and while you may be able to hoodwink the webserver into rendering the attribute, the parts of Airflow that handle scheduling and task execution don't perform rendering prior to reading the queue attribute.
Can ogr2ogr or arcpy do a direct csv to shapefile conversion?
I'm trying to automate some processes with a small script and was hoping I can do it easily with ogr2ogr or arcpy which I'm new to.
Any input would be appreciated.
It can be done with ogr2ogr easily.
Assuming you have a csv-file containing coordinates, like for example (has to be comma seperated):
coord.csv
x,y,z
48.66080825,10.28323850,0
48.66074700,10.28292000,0
48.66075045,10.28249425,0
48.66075395,10.28249175,0
48.66077113,10.28233356,0
48.66080136,10.28213118,0
48.66079620,10.28196900,0
Then you need to create a sample file (name it according to your csv) in the same directory:
coord.vrt
<OGRVRTDataSource>
<OGRVRTLayer name="output">
<SrcDataSource relativeToVRT="1">.</SrcDataSource>
<SrcLayer>coord</SrcLayer>
<GeometryType>wkbPoint</GeometryType>
<LayerSRS>WGS84</LayerSRS>
<GeometryField encoding="PointFromColumns" x="x" y="y"/>
</OGRVRTLayer>
</OGRVRTDataSource>
Then run:
ogr2ogr -f "ESRI Shapefile" . coord.csv && ogr2ogr -f "ESRI Shapefile" . coord.vrt
This will give you "output.shp" in the coordinate system, you specified in the sample file.
Regards,
muxav
You need the following workflow to convert a .csv of coordinates to a feature class using the Python arcpy site-package:
Make XY Event Layer (Data Management) converts the tabular data to a temporary spatial layer
Feature Class To Feature Class (Conversion) converts the layer to a permanent feature class
This should get you started.
import arcpy
from arcpy import env
# Set environment settings
env.workspace = "C:/data"
ws = env.workspace
# Set the local variables
in_Table = "your_table.csv"
x_coords = "POINT_X"
y_coords = "POINT_Y"
z_coords = "POINT_Z"
out_Layer = "your_layer"
# Set the spatial reference--this is simply a path to a .prj file
spRef = r"Coordinate Systems\Projected Coordinate Systems\Utm\Nad 1983\NAD 1983 UTM Zone 11N.prj"
# Make the XY event layer...
arcpy.MakeXYEventLayer_management(in_Table, x_coords, y_coords, out_Layer, spRef, z_coords)
# Now convert to a feature class
arcpy.FeatureClassToFeatureClass_conversion (out_layer, ws, "out.shp")
I did not have success with any of the solutions here but I was able come up with a solution that worked using Python's shapely and fiona modules. It uses a tab-delineated .ascii file (my preference as opposed to .csv) but can easily be adapted to use a .csv as in the question posed. Hopefully this is helpful someone else trying to automate this same task.
# ------------------------------------------------------
# IMPORTS
# ------------------------------------------------------
import os
import pandas as pd
from shapely.geometry import Point, mapping
from fiona import collection
# ------------------------------------------------------
# INPUTS
# ------------------------------------------------------
# Define path
path = os.path.abspath(os.path.dirname(__file__))
# Set working directory
os.chdir(path)
# Define file to convert
file = 'points.ascii'
# Define shp file schema
schema = { 'geometry': 'Point', 'properties': { 'LocationID': 'str', 'Latitude': 'float', 'Longitude': 'float' } }
# Read in data
data = pd.read_csv(file, sep='\t')
# Define shp file to write to
shpOut = 'points.shp'
# Create shp file
with collection(shpOut, "w", "ESRI Shapefile", schema) as output:
# Loop through dataframe and populate shp file
for index, row in data.iterrows():
# Define point
point = Point(row['Longitude'], row['Latitude'])
# Write output
output.write({
'properties': {'LocationID': row['LocationID'], 'Latitude': row['Latitude'], 'Longitude': row['Longitude'] },
'geometry': mapping(point)
})