Macros in the Airflow Python operator - jinja2

Can I use macros with the PythonOperator? I tried following, but I was unable to get the macros rendered:
dag = DAG(
'temp',
default_args=default_args,
description='temp dag',
schedule_interval=timedelta(days=1))
def temp_def(a, b, **kwargs):
print '{{ds}}'
print '{{execution_date}}'
print 'a=%s, b=%s, kwargs=%s' % (str(a), str(b), str(kwargs))
ds = '{{ ds }}'
mm = '{{ execution_date }}'
t1 = PythonOperator(
task_id='temp_task',
python_callable=temp_def,
op_args=[mm , ds],
provide_context=False,
dag=dag)

In my opinion a more native Airflow way of approaching this would be to use the included PythonOperator and use the provide_context=True parameter as such.
t1 = MyPythonOperator(
task_id='temp_task',
python_callable=temp_def,
provide_context=True,
dag=dag)
Now you have access to all of the macros, airflow metadata and task parameters in the kwargs of your callable
def temp_def(**kwargs):
print 'ds={}, execution_date={}'.format((str(kwargs['ds']), str(kwargs['execution_date']))
If you had some custom defined params associated with the task you could access those as well via kwargs['params']

Macros only get processed for templated fields. To get Jinja to process this field, extend the PythonOperator with your own.
class MyPythonOperator(PythonOperator):
template_fields = ('templates_dict','op_args')
I added 'templates_dict' to the template_fields because the PythonOperator itself has this field templated:
PythonOperator
Now you should be able to use a macro within that field:
ds = '{{ ds }}'
mm = '{{ execution_date }}'
t1 = MyPythonOperator(
task_id='temp_task',
python_callable=temp_def,
op_args=[mm , ds],
provide_context=False,
dag=dag)

Related

How to deserialize Xcom strings in Airflow?

Consider a DAG containing two tasks: DAG: Task A >> Task B (BashOperators or DockerOperators). They need to communicate through XComs.
Task A outputs the informations through a one-line json in stdout, which can then be retrieve in the logs of Task A, and so in its return_value XCom key if xcom_push=True. For instance : {"key1":1,"key2":3}
Task B only needs the key2 information from Task A, so we need to deserialize the return_value XCom of Task A to extract only this value and pass it directly to Task B, using the jinja template {{xcom_pull('task_a')['key2']}}. Using it as this results in jinja2.exceptions.UndefinedError: 'str object' has no attribute 'key2' because return_value is just a string.
For example we can deserialize Airflow Variables in jinja templates (ex: {{ var.json.my_var.path }}). Globally I would like to do the same thing with XComs.
Edit: a workaround is to convert the json string into a python dictionary before sending it to Xcom (see below).
You can add a post function to the BashOperator that deserialize the result and push all keys separately
def _post(context, result):
ti = context["ti"]
output = json.loads(result)
for key, value in output.items():
ti.xcom_push(key, value)
BashOperator(
task_id="task_id",
bash_command='bash command',
post_execute=_post
)
A workaround is to create a custom Operator (inherited from BashOperator or DockerOperator) and augment the execute method:
execute the original execute method
intercepts the last log line of the task
tries to json.loads() it in a Python dictionnary
finally return the output (which is now a dictionnary, not a string)
The previous jinja template {{ xcom_pull('task_a')['key2'] }} is now working in task B, since the XCom value is now a Python dictionnary.
class BashOperatorExtended(BashOperator):
def execute(self, context):
output = BashOperator.execute(self, context)
try:
output = json.loads(output)
except:
pass
return output
class DockerOperatorExtended(DockerOperator):
def execute(self, context):
output = DockerOperator.execute(self, context)
try:
output = json.loads(output)
except:
pass
return output
But creating a new operator just for that purpose is not really satisfying..

AWS API re-deployment using ansible

I have an existing API in my AWS account. Now I am trying to use ansible to redeploy api after introducing any resource policy changes.
According to AWS I need to use below CLI command to redeploy the api:
- name: deploy API
command: >
aws apigateway update-stage --region us-east-1 \
--rest-api-id <rest-api-id> \
--stage-name 'stage'\
--patch-operations op='replace',path='/deploymentId',value='<deployment-id>'
Above, 'deploymentId' from previous deployment will be different after every deployment that's why trying to create that as a variable so this can be automated for redeployment steps.
I can get previous deployment information using below CLI:
- name: Get deployment information
command: >
aws apigateway get-deployments \
--rest-api-id 123454ne \
--region us-east-1
register: deployment_info
And output looks like this:
deployment_info.stdout_lines:
- '{'
- ' "items": ['
- ' {'
- ' "id": "abcd",'
- ' "createdDate": 1228509116'
- ' }'
- ' ]'
- '}'
I was using deployment_info.items.id as deploymentId and couldn't able to make this work. Now stuck on what can be Ansible CLI command to get id from output and use this id as deploymentId in deployment commands.
How can I use this id for deploymentId in deployment commands?
I created a small ansible module which you might find useful
#!/usr/bin/python
# Creates a new deployment for an API GW stage
# See https://docs.aws.amazon.com/apigateway/latest/developerguide/set-up-deployments.html
# Based on https://github.com/ansible-collections/community.aws/blob/main/plugins/modules/aws_api_gateway.py
# TODO needed?
# from __future__ import absolute_import, division, print_function
# __metaclass__ = type
import json
import traceback
try:
import botocore
except ImportError:
pass # Handled by AnsibleAWSModule
from ansible.module_utils.common.dict_transformations import camel_dict_to_snake_dict
from ansible_collections.amazon.aws.plugins.module_utils.core import AnsibleAWSModule
from ansible_collections.amazon.aws.plugins.module_utils.ec2 import AWSRetry
def main():
argument_spec = dict(
api_id=dict(type='str', required=True),
stage=dict(type='str', required=True),
deploy_desc=dict(type='str', required=False, default='')
)
module = AnsibleAWSModule(
argument_spec=argument_spec,
supports_check_mode=True
)
api_id = module.params.get('api_id')
stage = module.params.get('stage')
client = module.client('apigateway')
# Update stage if not in check_mode
deploy_response = None
changed = False
if not module.check_mode:
try:
deploy_response = create_deployment(client, api_id, **module.params)
changed = True
except (botocore.exceptions.ClientError, botocore.exceptions.EndpointConnectionError) as e:
msg = "Updating api {0}, stage {1}".format(api_id, stage)
module.fail_json_aws(e, msg)
exit_args = {"changed": changed, "api_deployment_response": deploy_response}
module.exit_json(**exit_args)
retry_params = {"retries": 10, "delay": 10, "catch_extra_error_codes": ['TooManyRequestsException']}
#AWSRetry.jittered_backoff(**retry_params)
def create_deployment(client, rest_api_id, **params):
result = client.create_deployment(
restApiId=rest_api_id,
stageName=params.get('stage'),
description=params.get('deploy_desc')
)
return result
if __name__ == '__main__':
main()

JINA#4428[C]:Can not fetch the URL of Hubble from `api.jina.ai`

I was trying out the Semantic Wikipedia Search from jina-ai.
This is the error I am getting after running python app.py -t index.
app.py is used to index the data.
JINA#4489[C]:Can not fetch the URL of Hubble from api.jina.ai
HubIO#4489[E]:Error while pulling jinahub+docker://TransformerTorchEncoder:
JSONDecodeError('Expecting value: line 1 column 1 (char 0)')
This is app.py:
__copyright__ = "Copyright (c) 2021 Jina AI Limited. All rights reserved."
__license__ = "Apache-2.0"
import os
import sys
import click
import random
from jina import Flow, Document, DocumentArray
from jina.logging.predefined import default_logger as logger
MAX_DOCS = int(os.environ.get('JINA_MAX_DOCS', 10000))
def config(dataset: str):
if dataset == 'toy':
os.environ['JINA_DATA_FILE'] = os.environ.get('JINA_DATA_FILE', 'data/toy-input.txt')
elif dataset == 'full':
os.environ['JINA_DATA_FILE'] = os.environ.get('JINA_DATA_FILE', 'data/input.txt')
os.environ['JINA_PORT'] = os.environ.get('JINA_PORT', str(45678))
cur_dir = os.path.dirname(os.path.abspath(__file__))
os.environ.setdefault('JINA_WORKSPACE', os.path.join(cur_dir, 'workspace'))
os.environ.setdefault('JINA_WORKSPACE_MOUNT',
f'{os.environ.get("JINA_WORKSPACE")}:/workspace/workspace')
def print_topk(resp, sentence):
for doc in resp.data.docs:
print(f"\n\n\nTa-Dah🔮, here's what we found for: {sentence}")
for idx, match in enumerate(doc.matches):
score = match.scores['cosine'].value
print(f'> {idx:>2d}({score:.2f}). {match.text}')
def input_generator(num_docs: int, file_path: str):
with open(file_path) as file:
lines = file.readlines()
num_lines = len(lines)
random.shuffle(lines)
for i in range(min(num_docs, num_lines)):
yield Document(text=lines[i])
def index(num_docs):
flow = Flow().load_config('flows/flow.yml')
data_path = os.path.join(os.path.dirname(__file__), os.environ.get('JINA_DATA_FILE', None))
with flow:
flow.post(on='/index', inputs=input_generator(num_docs, data_path),
show_progress=True)
def query(top_k):
flow = Flow().load_config('flows/flow.yml')
with flow:
text = input('Please type a sentence: ')
doc = Document(content=text)
result = flow.post(on='/search', inputs=DocumentArray([doc]),
parameters={'top_k': top_k},
line_format='text',
return_results=True,
)
print_topk(result[0], text)
#click.command()
#click.option(
'--task',
'-t',
type=click.Choice(['index', 'query'], case_sensitive=False),
)
#click.option('--num_docs', '-n', default=MAX_DOCS)
#click.option('--top_k', '-k', default=5)
#click.option('--dataset', '-d', type=click.Choice(['toy', 'full']), default='toy')
def main(task, num_docs, top_k, dataset):
config(dataset)
if task == 'index':
if os.path.exists(os.environ.get("JINA_WORKSPACE")):
logger.error(f'\n +---------------------------------------------------------------------------------+ \
\n | 🤖🤖🤖 | \
\n | The directory {os.environ.get("JINA_WORKSPACE")} already exists. Please remove it before indexing again. | \
\n | 🤖🤖🤖 | \
\n +---------------------------------------------------------------------------------+')
sys.exit(1)
index(num_docs)
elif task == 'query':
query(top_k)
if __name__ == '__main__':
main()
This is flow.yml
version: '1' # This is the yml file version
with: # Additional arguments for the flow
workspace: $JINA_WORKSPACE # Workspace folder path
port_expose: $JINA_PORT # Network Port for the flow
executors: # Now, define the executors that are run on this flow
- name: transformer # This executor computes an embedding based on the input text documents
uses: 'jinahub+docker://TransformerTorchEncoder' # We use a Transformer Torch Encoder from the hub as a docker container
- name: indexer # Now, index the text documents with the embeddings
uses: 'jinahub://SimpleIndexer' # We use the SimpleIndexer for this purpose
And when I try to execute app.py -t index
This is the error:
JINA#3803[C]:Can not fetch the URL of Hubble from `api.jina.ai` HubIO#3803[E]:Error while pulling jinahub+docker://TransformerTorchEncoder: JSONDecodeError('Expecting value: line 1 column 1 (char 0)')
I think this just happened because the API was down. It should work now.

Using python argparse arguments as variable values within a json file

I've googled this quite a bit and am unable to find helpful insight. Basically, I need to take the user input from my argparse arguments from a python script (as shown below) and plug those values into a json file (packerfile.json) located in the same working directory. I have been experimenting with subprocess, invoke and plumbum libraries without being able to "find the shoe that fits".
From the following code, I have removed all except for the arguments as to clean up:
#!/usr/bin/python
import os, sys, subprocess
import argparse
import json
from invoke import run
import packer
parser = argparse.ArgumentParser()
parser._positionals.title = 'Positional arguments'
parser._optionals.title = 'Optional arguments'
parser.add_argument("--access_key",
required=False,
action='store',
default=os.environ['AWS_ACCESS_KEY_ID'],
help="AWS access key id")
parser.add_argument("--secret_key",
required=False,
action='store',
default=os.environ['AWS_SECRET_ACCESS_KEY'],
help="AWS secret access key")
parser.add_argument("--region",
required=False,
action='store',
help="AWS region")
parser.add_argument("--guest_os_type",
required=True,
action='store',
help="Operating system to install on guest machine")
parser.add_argument("--ami_id",
required=False,
help="AMI ID for image base")
parser.add_argument("--instance_type",
required=False,
action='store',
help="Type of instance determines overall performance (e.g. t2.medium)")
parser.add_argument("--ssh_key_path",
required=False,
action='store',
default=os.environ['HOME']+'/.ssh',
help="SSH key path (e.g. ~/.ssh)")
parser.add_argument("--ssh_key_name",
required=True,
action='store',
help="SSH key name (e.g. mykey)")
args = parser.parse_args()
print(vars(args))
json example code:
{
"variables": {
"aws_access_key": "{{ env `AWS_ACCESS_KEY_ID` }}",
"aws_secret_key": "{{ env `AWS_SECRET_ACCESS_KEY` }}",
"magic_reference_date": "{{ isotime \"2006-01-02\" }}",
"aws_region": "{{ env 'AWS_REGION' }}",
"aws_ami_id": "ami-036affea69a1101c9",
"aws_instance_type": "t2.medium",
"image_version" : "0.1.0",
"guest_os_type": "centos7",
"home": "{{ env `HOME` }}"
},
so, the user input for the --region as shown in the python script shoul get plugged into the value for aws_region in the json file.
I am aware of how to print the value of args. The full command that I am providing to the script is: python packager.py --region us-west-2 --guest_os_type rhel7 --ssh_key_name test_key and the printed results are {'access_key': 'REDACTED', 'secret_key': 'REDACTED', 'region': 'us-west-2', 'guest_os_type': 'rhel7', 'ami_id': None, 'instance_type': None, 'ssh_key_path': '/Users/REDACTEDt/.ssh', 'ssh_key_name': 'test_key'} .. what i need is to import thos values into the packerfile.json variables list.. preferably in a way that i can reuse it (so it musn't overwrite the file)
Note: I have also been experimenting with using python to export local environment variables then having the JSON file pick them up, but that doesn't really seem like a viable solution.
I think that the best solution might be to take all of these arguments and export them to its own JSON file called variables.json and import these variables from JSON (variables.json) to JSON (packerfile.json) as a seperate process. Still could use guidence here though :)
You might use the __dict__ attribute from the SimpleNamespace that is returned by the ArgumentParser. Like so:
import json
parsed = parser.parse_args()
with open('packerfile.json', 'w') as f:
json.dump(f, parsed.__dict__)
If required, you could use add_argument(dest='attrib_name') to customise attribute names.
I was actually able to come up with a pretty simple solution.
args = parser.parse_args()
print(json.dumps(vars(args), indent=4))
s.call("echo '%s' > variables.json && packer build -var-file=variables.json packerfile.json" % json_formatted, shell=True)
arguments are captured under the variable args and dumped to the output with json.dump while vars is making sure to also dump the arguments with their key values and I currently have to run my code with >> vars.json but I'll insert logic to have python handle that.
Note: s == subprocess in s.call

Pipeline doesn't write to MySQL but also gives no error

I've tried to implement this pipeline in my spider.
After installing the necessary dependencies I am able to run the spider without any errors but for some reason it doesn't write to my database.
I'm pretty sure there is something going wrong with connecting to the database. When I give in a wrong password, I still don't get any error.
When the spider scraped all the data, it needs a few minutes before it starts dumping the stats.
2017-08-31 13:17:12 [scrapy] INFO: Closing spider (finished)
2017-08-31 13:17:12 [scrapy] INFO: Stored csv feed (27 items) in: test.csv
2017-08-31 13:24:46 [scrapy] INFO: Dumping Scrapy stats:
Pipeline:
import MySQLdb.cursors
from twisted.enterprise import adbapi
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy import log
SETTINGS = {}
SETTINGS['DB_HOST'] = 'mysql.domain.com'
SETTINGS['DB_USER'] = 'username'
SETTINGS['DB_PASSWD'] = 'password'
SETTINGS['DB_PORT'] = 3306
SETTINGS['DB_DB'] = 'database_name'
class MySQLPipeline(object):
#classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
def __init__(self, stats):
print "init"
#Instantiate DB
self.dbpool = adbapi.ConnectionPool ('MySQLdb',
host=SETTINGS['DB_HOST'],
user=SETTINGS['DB_USER'],
passwd=SETTINGS['DB_PASSWD'],
port=SETTINGS['DB_PORT'],
db=SETTINGS['DB_DB'],
charset='utf8',
use_unicode = True,
cursorclass=MySQLdb.cursors.DictCursor
)
self.stats = stats
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
print "close"
""" Cleanup function, called after crawing has finished to close open
objects.
Close ConnectionPool. """
self.dbpool.close()
def process_item(self, item, spider):
print "process"
query = self.dbpool.runInteraction(self._insert_record, item)
query.addErrback(self._handle_error)
return item
def _insert_record(self, tx, item):
print "insert"
result = tx.execute(
" INSERT INTO matches(type,home,away,home_score,away_score) VALUES (soccer,"+item["home"]+","+item["away"]+","+item["score"].explode("-")[0]+","+item["score"].explode("-")[1]+")"
)
if result > 0:
self.stats.inc_value('database/items_added')
def _handle_error(self, e):
print "error"
log.err(e)
Spider:
import scrapy
import dateparser
from crawling.items import KNVBItem
class KNVBspider(scrapy.Spider):
name = "knvb"
start_urls = [
'http://www.knvb.nl/competities/eredivisie/uitslagen',
]
custom_settings = {
'ITEM_PIPELINES': {
'crawling.pipelines.MySQLPipeline': 301,
}
}
def parse(self, response):
# www.knvb.nl/competities/eredivisie/uitslagen
for row in response.xpath('//div[#class="table"]'):
for div in row.xpath('./div[#class="row"]'):
match = KNVBItem()
match['home'] = div.xpath('./div[#class="value home"]/div[#class="team"]/text()').extract_first()
match['away'] = div.xpath('./div[#class="value away"]/div[#class="team"]/text()').extract_first()
match['score'] = div.xpath('./div[#class="value center"]/text()').extract_first()
match['date'] = dateparser.parse(div.xpath('./preceding-sibling::div[#class="header"]/span/span/text()').extract_first(), languages=['nl']).strftime("%d-%m-%Y")
yield match
If there are better pipelines available to do what I'm trying to achieve that'd be welcome as well. Thanks!
Update:
With the link provided in the accepted answer I eventually got to this function that's working (and thus solved my problem):
def process_item(self, item, spider):
print "process"
query = self.dbpool.runInteraction(self._insert_record, item)
query.addErrback(self._handle_error)
query.addBoth(lambda _: item)
return query
Take a look at this for how to use adbapi with MySQL for saving scraped items. Note the difference in your process_item and their process_item method implementation. While you return the item immediately, they return Deferred object which is the result of runInteraction method and which returns the item upon its completion. I think this is the reason your _insert_record never gets called.
If you can see the insert in your output that's already a good sign.
I'd rewrite the insert function this way:
def _insert_record(self, tx, item):
print "insert"
raw_sql = "INSERT INTO matches(type,home,away,home_score,away_score) VALUES ('%s', '%s', '%s', '%s', '%s')"
sql = raw_sql % ('soccer', item['home'], item['away'], item['score'].explode('-')[0], item['score'].explode('-')[1])
print sql
result = tx.execute(sql)
if result > 0:
self.stats.inc_value('database/items_added')
It allows you to debug the sql you're using. In you version you're not wrapping the string in ' which is a syntax error in mysql.
I'm not sure about your last values (score) so I treated them as strings.