How to obtain and process mysql records using Airflow? - mysql

I need to
1. run a select query on MYSQL DB and fetch the records.
2. Records are processed by python script.
I am unsure about the way I should proceed. Is xcom the way to go here? Also, MYSQLOperator only executes the query, doesn't fetch the records. Is there any inbuilt transfer operator I can use? How can I use a MYSQL hook here?
you may want to use a PythonOperator that uses the hook to get the data,
apply transformation and ship the (now scored) rows back some other place.
Can someone explain how to proceed regarding the same.
Refer - http://markmail.org/message/x6nfeo6zhjfeakfe
def do_work():
mysqlserver = MySqlHook(connection_id)
sql = "SELECT * from table where col > 100 "
row_count = mysqlserver.get_records(sql, schema='testdb')
print row_count[0][0]
callMYSQLHook = PythonOperator(
task_id='fetch_from_testdb',
python_callable=mysqlHook,
dag=dag
)
Is this the correct way to proceed?
Also how do we use xcoms to store the records for the following MySqlOperator?'
t = MySqlOperator(
conn_id='mysql_default',
task_id='basic_mysql',
sql="SELECT count(*) from table1 where id > 10",
dag=dag)

I was really struggling with this for the past 90 minutes, here is a more declarative way to follow for newcomers:
from airflow.hooks.mysql_hook import MySqlHook
def fetch_records():
request = "SELECT * FROM your_table"
mysql_hook = MySqlHook(mysql_conn_id = 'the_connection_name_sourced_from_the_ui', schema = 'specific_db')
connection = mysql_hook.get_conn()
cursor = connection.cursor()
cursor.execute(request)
sources = cursor.fetchall()
print(sources)
...your DAG() as dag: code
task = PythonOperator(
task_id = 'fetch_records',
python_callable = fetch_records
)
This returns to the logs the contents of your DB query.
I hope this is of use to someone else.

Sure, just create a hook or operator and call the get_records() method: https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/hooks/dbapi.html

Related

laravel to MySql query not displaying accurate results

i tried to do this insert and update another table with the sum of the data and the new values, but im getting wrong results
please help me convert this laravel to sql.
thanks in advance:
laravel:
$viaturas = Viaturas::firstWhere('matricula', $viaverde->matricula);
$viaturas->total_viaverde = $viaverde->custo + $viaturas->total_viaverde;
$viaturas->update();
sql not doing right as follows:
UPDATE `viaturas`
INNER JOIN `viaverde` ON `viaverde`.`matricula`=`viaturas`.`matricula`
SET `viaturas`.`total_viaverde` = (SELECT SUM(`viaverde`.`custo`));
The correct method is save()
$viaturas = Viaturas::firstWhere('matricula', $viaverde->matricula);
$viaturas->total_viaverde = $viaverde->custo + $viaturas->total_viaverde;
$viaturas->save();
If you use update(), you have two options
From the Model object
$viaturas = Viaturas::firstWhere('matricula', $viaverde->matricula);
# update this $viatura model
$viaturas->update(['total_viaverde' => $viaverde->custo + $viaturas->total_viaverde]);
From the Query Builder
# update every Viatura where matricula = $viaverde->matricula
Viaturas::query()
->where('matricula', $viaverde->matricula)
->update(['total_viaverde' => $viaverde->custo + $viaturas->total_viaverde]);
To run that specific SQL query from the query builder, I think this should do it
Viatura::query()
->join('viaverde', 'viaverde.matricula', 'viaturas.matricula')
->update(['viaturas.total_viaverde' => DB::raw('select sum(`viaverde`.`custo`)')]);

Hash a Select SQLAlchemy query

I have a SQLAlchemy query that I build, such as :
query_one = User.query.filter(User.id == 1) # Note that I don't call .first() or .all() as I want the "select" instance.
I want to store this Select query in such a way that I can retrieve it by having the same query :
stored_queries = {}
stored_queries[hash(query_one)] = query_one
# ... later on:
query_two = User.query.filter(User.id == 1)
if hash(query_two) in stored_queries:
# Execute custom code because it's the same query
Of course, hash in that case does not work, but is there a SQLAlchemy method that works in the same way?
I thought of str(query_one), but that query only consider the request, without the value. I need both.
Thank you in advance.
You can compile the query to get access to the parameters, and use those as part of your key:
def query_key(query):
statement = query_one.statement.compile()
return str(statement), str(statement.params)
query_key(query_one)
('SELECT user.id, ... FROM user WHERE user.id = :id_1', "{'id_1': 1}")
See https://docs.sqlalchemy.org/en/14/core/selectable.html#sqlalchemy.sql.expression.TableClause.compile

celery task insert duplicate data info mysql

I am using celery to archive the async job in python, my code flow is as following:
celery task get some data from remote api
celery beat get the celery task result from celery backend which is redis and then insert the result into redis
but in step 2, before I insert result data into mysql, I check if the data is existed.although I do the check, the duplicate data still be inserted.
my code is as following:
def get_task_result(logger=None):
db = MySQLdb.connect(host=MYSQL_HOST, port=MYSQL_PORT, user=MYSQL_USER, passwd=MYSQL_PASSWD, db=MYSQL_DB, cursorclass=MySQLdb.cursors.DictCursor, use_unicode=True, charset='utf8')
cursor = db.cursor()
....
....
store_subdomain_result(db, cursor, asset_id, celery_task_result)
....
....
cursor.close()
db.close()
def store_subdomain_result(db, cursor, top_domain_id, celery_task_result, logger=None):
subdomain_list = celery_task_result.get('result').get('subdomain_list')
source = celery_task_result.get('result').get('source')
for domain in subdomain_list:
query_subdomain_sql = f'SELECT * FROM nw_asset WHERE domain="{domain}"'
cursor.execute(query_subdomain_sql)
sub_domain_result = cursor.fetchone()
if sub_domain_result:
asset_id = sub_domain_result.get('id')
existed_source = sub_domain_result.get('source')
if source not in existed_source:
new_source = f'{existed_source},{source}'
update_domain_sql = f'UPDATE nw_asset SET source="{new_source}" WHERE id={asset_id}'
cursor.execute(update_domain_sql)
db.commit()
else:
insert_subdomain_sql = f'INSERT INTO nw_asset(domain) values("{domain}")'
cursor.execute(insert_subdomain_sql)
db.commit()
I first select if the data is existed, if the data not existed, I will do the insert, the code is as following:
query_subdomain_sql = f'SELECT * FROM nw_asset WHERE domain="{domain}"'
cursor.execute(query_subdomain_sql)
sub_domain_result = cursor.fetchone()
I do this, but it still insert duplicate data, I can't understand this.
I google this question and some one says use insert ignore or relace into or unique index, but I want to know why the code not work as expectedly?
also, In my opinion, I think if there is some cache in mysql, when I do the select, the data not really into mysql it just in the flush, so the select will return none?

Unable to retrieve data from my sql database using pymysql

I have been trying to retrieve data from my database. I was successful, however, this time inside an if statement. The code looks like:
cur_msql = conn_mysql.cursor(cursor=pymysql.cursors.DictCursor)
select_query = """select x,y,z from table where type='sample' and code=%s"""
cur_msql.execute(select_query, code)
result2 = cur_msql.fetchone()
if(result2==None):
insert_func(code)
select_query = f"""select x,y,z from table where type='sample' and code='{code}'"""
mycur = conn_mysql.cursor(cursor=pymysql.cursors.DictCursor)
print(select_query)
mycur.execute(select_query)
result3 = mycur.fetchone()
if(result2==None):
result2=result3
Now I see that insert_func does successfully insert into the 'table'. However, on trying to fetch that row, immediately after the insertion, it returns None as if the row is absent. On debugging I find that result3 is also None. Nothing looks wrong to me but it's not working.
you donĀ“t execute it in the right way, in the cur_msql.execute, you the to send the query and a tuple of values, and you are sending just a value:
cur_msql = conn_mysql.cursor(cursor=pymysql.cursors.DictCursor)
select_query = "select learnpath_code,learnpath_id,learnpath_name from contentgrail.knowledge_vectors_test where Type='chapters' and code=%s"
cur_msql.execute(select_query, (meta['chapter_code'],))
result2 = cur_msql.fetchone()

Django bulk update setting each to different values? [duplicate]

I'd like to update a table with Django - something like this in raw SQL:
update tbl_name set name = 'foo' where name = 'bar'
My first result is something like this - but that's nasty, isn't it?
list = ModelClass.objects.filter(name = 'bar')
for obj in list:
obj.name = 'foo'
obj.save()
Is there a more elegant way?
Update:
Django 2.2 version now has a bulk_update.
Old answer:
Refer to the following django documentation section
Updating multiple objects at once
In short you should be able to use:
ModelClass.objects.filter(name='bar').update(name="foo")
You can also use F objects to do things like incrementing rows:
from django.db.models import F
Entry.objects.all().update(n_pingbacks=F('n_pingbacks') + 1)
See the documentation.
However, note that:
This won't use ModelClass.save method (so if you have some logic inside it won't be triggered).
No django signals will be emitted.
You can't perform an .update() on a sliced QuerySet, it must be on an original QuerySet so you'll need to lean on the .filter() and .exclude() methods.
Consider using django-bulk-update found here on GitHub.
Install: pip install django-bulk-update
Implement: (code taken directly from projects ReadMe file)
from bulk_update.helper import bulk_update
random_names = ['Walter', 'The Dude', 'Donny', 'Jesus']
people = Person.objects.all()
for person in people:
r = random.randrange(4)
person.name = random_names[r]
bulk_update(people) # updates all columns using the default db
Update: As Marc points out in the comments this is not suitable for updating thousands of rows at once. Though it is suitable for smaller batches 10's to 100's. The size of the batch that is right for you depends on your CPU and query complexity. This tool is more like a wheel barrow than a dump truck.
Django 2.2 version now has a bulk_update method (release notes).
https://docs.djangoproject.com/en/stable/ref/models/querysets/#bulk-update
Example:
# get a pk: record dictionary of existing records
updates = YourModel.objects.filter(...).in_bulk()
....
# do something with the updates dict
....
if hasattr(YourModel.objects, 'bulk_update') and updates:
# Use the new method
YourModel.objects.bulk_update(updates.values(), [list the fields to update], batch_size=100)
else:
# The old & slow way
with transaction.atomic():
for obj in updates.values():
obj.save(update_fields=[list the fields to update])
If you want to set the same value on a collection of rows, you can use the update() method combined with any query term to update all rows in one query:
some_list = ModelClass.objects.filter(some condition).values('id')
ModelClass.objects.filter(pk__in=some_list).update(foo=bar)
If you want to update a collection of rows with different values depending on some condition, you can in best case batch the updates according to values. Let's say you have 1000 rows where you want to set a column to one of X values, then you could prepare the batches beforehand and then only run X update-queries (each essentially having the form of the first example above) + the initial SELECT-query.
If every row requires a unique value there is no way to avoid one query per update. Perhaps look into other architectures like CQRS/Event sourcing if you need performance in this latter case.
Here is a useful content which i found in internet regarding the above question
https://www.sankalpjonna.com/learn-django/running-a-bulk-update-with-django
The inefficient way
model_qs= ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
obj.name = 'foo'
obj.save()
The efficient way
ModelClass.objects.filter(name = 'bar').update(name="foo") # for single value 'foo' or add loop
Using bulk_update
update_list = []
model_qs= ModelClass.objects.filter(name = 'bar')
for model_obj in model_qs:
model_obj.name = "foo" # Or what ever the value is for simplicty im providing foo only
update_list.append(model_obj)
ModelClass.objects.bulk_update(update_list,['name'])
Using an atomic transaction
from django.db import transaction
with transaction.atomic():
model_qs = ModelClass.objects.filter(name = 'bar')
for obj in model_qs:
ModelClass.objects.filter(name = 'bar').update(name="foo")
Any Up Votes ? Thanks in advance : Thank you for keep an attention ;)
To update with same value we can simply use this
ModelClass.objects.filter(name = 'bar').update(name='foo')
To update with different values
ob_list = ModelClass.objects.filter(name = 'bar')
obj_to_be_update = []
for obj in obj_list:
obj.name = "Dear "+obj.name
obj_to_be_update.append(obj)
ModelClass.objects.bulk_update(obj_to_be_update, ['name'], batch_size=1000)
It won't trigger save signal every time instead we keep all the objects to be updated on the list and trigger update signal at once.
IT returns number of objects are updated in table.
update_counts = ModelClass.objects.filter(name='bar').update(name="foo")
You can refer this link to get more information on bulk update and create.
Bulk update and Create