I am using celery to archive the async job in python, my code flow is as following:
celery task get some data from remote api
celery beat get the celery task result from celery backend which is redis and then insert the result into redis
but in step 2, before I insert result data into mysql, I check if the data is existed.although I do the check, the duplicate data still be inserted.
my code is as following:
def get_task_result(logger=None):
db = MySQLdb.connect(host=MYSQL_HOST, port=MYSQL_PORT, user=MYSQL_USER, passwd=MYSQL_PASSWD, db=MYSQL_DB, cursorclass=MySQLdb.cursors.DictCursor, use_unicode=True, charset='utf8')
cursor = db.cursor()
....
....
store_subdomain_result(db, cursor, asset_id, celery_task_result)
....
....
cursor.close()
db.close()
def store_subdomain_result(db, cursor, top_domain_id, celery_task_result, logger=None):
subdomain_list = celery_task_result.get('result').get('subdomain_list')
source = celery_task_result.get('result').get('source')
for domain in subdomain_list:
query_subdomain_sql = f'SELECT * FROM nw_asset WHERE domain="{domain}"'
cursor.execute(query_subdomain_sql)
sub_domain_result = cursor.fetchone()
if sub_domain_result:
asset_id = sub_domain_result.get('id')
existed_source = sub_domain_result.get('source')
if source not in existed_source:
new_source = f'{existed_source},{source}'
update_domain_sql = f'UPDATE nw_asset SET source="{new_source}" WHERE id={asset_id}'
cursor.execute(update_domain_sql)
db.commit()
else:
insert_subdomain_sql = f'INSERT INTO nw_asset(domain) values("{domain}")'
cursor.execute(insert_subdomain_sql)
db.commit()
I first select if the data is existed, if the data not existed, I will do the insert, the code is as following:
query_subdomain_sql = f'SELECT * FROM nw_asset WHERE domain="{domain}"'
cursor.execute(query_subdomain_sql)
sub_domain_result = cursor.fetchone()
I do this, but it still insert duplicate data, I can't understand this.
I google this question and some one says use insert ignore or relace into or unique index, but I want to know why the code not work as expectedly?
also, In my opinion, I think if there is some cache in mysql, when I do the select, the data not really into mysql it just in the flush, so the select will return none?
Related
First of all, sorry for my lack of knowledge regarding databases, this is my first time working with them.
I am having some issues trying to get the data from an excel file and putting it into a data base.
Using answers from the site, I managed to kind of connect to the database by doing this.
import pandas as pd
import pyodbc
server = 'XXXXX'
db = 'XXXXXdb'
# create Connection and Cursor objects
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=' + server + ';DATABASE=' + db + ';Trusted_Connection=yes')
cursor = conn.cursor()
# read data from excel
data = pd.read_excel('data.csv')
But I dont really know what to do now.
I have 3 tables, which are connected by a 'productID', my excel file mimics the data base, meaning that all the columns in the excel file have a place to go in the DB.
My plan was to read the excel file and make lists with each column, then insert into the DB each column value but I have no idea how to create a query that can do this.
Once I get the query I think the data insertion can be done like this:
query = "xxxxxxxxxxxxxx"
for row in data:
#The following is not the real code
productID = productID
name = name
url = url
values = (productID, name, url)
cursor.execute(query,values)
conn.commit()
conn.close
Database looks like this.
https://prnt.sc/n2d2fm
http://prntscr.com/n2d3sh
http://prntscr.com/n2d3yj
EDIT:
Tried doing something like this, but i'm getting 'not all arguments converted during string formatting' Type error.
import pymysql
import pandas as pd
connStr = pymysql.connect(host = 'xx.xxx.xx.xx', port = xxxx, user = 'xxxx', password = 'xxxxxxxxxxx')
df = pd.read_csv('GenericProducts.csv')
cursor = connStr.cursor()
query = "INSERT INTO [Productos]([ItemID],[Nombre])) values (?,?)"
for index,row in df.iterrows():
#cursor.execute("INSERT INTO dbo.Productos([ItemID],[Nombre])) values (?,?,?)", row['codigoEspecificoProducto'], row['nombreProducto'])
codigoEspecificoProducto = row['codigoEspecificoProducto']
nombreProducto = row['nombreProducto']
values = (codigoEspecificoProducto,nombreProducto)
cursor.execute(query,values)
connStr.commit()
cursor.close()
connStr.close()
I think my problem is in how I'm defining the query, surely thats not the right way
Try this, you seem to have changed the library from pyodbc to mysql, it seems to expect %s instead of ?
import pymysql
import pandas as pd
connStr = pymysql.connect(host = 'xx.xxx.xx.xx', port = xxxx, user = 'xxxx', password = 'xxxxxxxxxxx')
df = pd.read_csv('GenericProducts.csv')
cursor = connStr.cursor()
query = "INSERT INTO [Productos]([ItemID],[Nombre]) values (%s,%s)"
for index,row in df.iterrows():
#cursor.execute("INSERT INTO dbo.Productos([ItemID],[Nombre]) values (%s,%s)", row['codigoEspecificoProducto'], row['nombreProducto'])
codigoEspecificoProducto = row['codigoEspecificoProducto']
nombreProducto = row['nombreProducto']
values = (codigoEspecificoProducto,nombreProducto)
cursor.execute(query,values)
connStr.commit()
cursor.close()
connStr.close()
I have been trying to retrieve data from my database. I was successful, however, this time inside an if statement. The code looks like:
cur_msql = conn_mysql.cursor(cursor=pymysql.cursors.DictCursor)
select_query = """select x,y,z from table where type='sample' and code=%s"""
cur_msql.execute(select_query, code)
result2 = cur_msql.fetchone()
if(result2==None):
insert_func(code)
select_query = f"""select x,y,z from table where type='sample' and code='{code}'"""
mycur = conn_mysql.cursor(cursor=pymysql.cursors.DictCursor)
print(select_query)
mycur.execute(select_query)
result3 = mycur.fetchone()
if(result2==None):
result2=result3
Now I see that insert_func does successfully insert into the 'table'. However, on trying to fetch that row, immediately after the insertion, it returns None as if the row is absent. On debugging I find that result3 is also None. Nothing looks wrong to me but it's not working.
you donĀ“t execute it in the right way, in the cur_msql.execute, you the to send the query and a tuple of values, and you are sending just a value:
cur_msql = conn_mysql.cursor(cursor=pymysql.cursors.DictCursor)
select_query = "select learnpath_code,learnpath_id,learnpath_name from contentgrail.knowledge_vectors_test where Type='chapters' and code=%s"
cur_msql.execute(select_query, (meta['chapter_code'],))
result2 = cur_msql.fetchone()
I need to
1. run a select query on MYSQL DB and fetch the records.
2. Records are processed by python script.
I am unsure about the way I should proceed. Is xcom the way to go here? Also, MYSQLOperator only executes the query, doesn't fetch the records. Is there any inbuilt transfer operator I can use? How can I use a MYSQL hook here?
you may want to use a PythonOperator that uses the hook to get the data,
apply transformation and ship the (now scored) rows back some other place.
Can someone explain how to proceed regarding the same.
Refer - http://markmail.org/message/x6nfeo6zhjfeakfe
def do_work():
mysqlserver = MySqlHook(connection_id)
sql = "SELECT * from table where col > 100 "
row_count = mysqlserver.get_records(sql, schema='testdb')
print row_count[0][0]
callMYSQLHook = PythonOperator(
task_id='fetch_from_testdb',
python_callable=mysqlHook,
dag=dag
)
Is this the correct way to proceed?
Also how do we use xcoms to store the records for the following MySqlOperator?'
t = MySqlOperator(
conn_id='mysql_default',
task_id='basic_mysql',
sql="SELECT count(*) from table1 where id > 10",
dag=dag)
I was really struggling with this for the past 90 minutes, here is a more declarative way to follow for newcomers:
from airflow.hooks.mysql_hook import MySqlHook
def fetch_records():
request = "SELECT * FROM your_table"
mysql_hook = MySqlHook(mysql_conn_id = 'the_connection_name_sourced_from_the_ui', schema = 'specific_db')
connection = mysql_hook.get_conn()
cursor = connection.cursor()
cursor.execute(request)
sources = cursor.fetchall()
print(sources)
...your DAG() as dag: code
task = PythonOperator(
task_id = 'fetch_records',
python_callable = fetch_records
)
This returns to the logs the contents of your DB query.
I hope this is of use to someone else.
Sure, just create a hook or operator and call the get_records() method: https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/hooks/dbapi.html
I'm trying to call a stored procedure on a multi-db Django installation, but am not having any luck getting results. The stored procedure (which is on the secondary database) always returns an empty array in Django, but the expected result does appear when executed in a mysql client.
My view.py file
from SomeDBModel import models
from django.db import connection
def index(request, someid):
#Some related django-style query that works here
loc = getLocationPath(someid, 1)
print(loc)
def getLocationPath(id, someval):
cursor = connection.cursor()
cursor.callproc("SomeDB.spGetLocationPath", [id, someval])
results = cursor.fetchall()
cursor.close()
return results
I have also tried:
from SomeDBModel import models
from django.db import connections
def index(request, someid):
#Some related Django-style query that works here
loc = getLocationPath(someid, 1)
print(loc)
def getLocationPath(id, someval):
cursor = connections["SomeDB"].cursor()
cursor.callproc("spGetLocationPath", [id, someval])
results = cursor.fetchall()
cursor.close()
return results
Each time I print out the results, I get:
[]
Example of data that should be retrieved:
{
Path: '/some/path/',
LocalPath: 'S:\Some\local\Path',
Folder: 'SomeFolderName',
Code: 'SomeCode'
}
One thing I also tried was to print the result of cursor.callproc. I get:
(id, someval)
Also, printing the result of cursor._executed gives:
b'SELECT #_SomeDB.spGetLocationPath_arg1, #_SomeDB.spGetLocationPath_arg2'
Which seems to not have any reference to the stored procedure I want to run at all. I have even tried this as a last resort:
cursor.execute("CALL spGetLocationPath("+str(id)+","+str(someval)+")")
but I get an error about needing multi=True, but putting it in the execute() function doesn't seem to work like some sites have suggested, and I don't know where else to put it in Django.
So...any ideas what I missed? How can I get stored procedures to work?
These are the following steps that I took:
Made my stored procedure dump results into a temporary table so as to flatten the result set to a single result set. This got rid of the need for multi=True
In addition, I made sure the user at my IP address had access to call stored procedures in the database itself.
Finally, I continued to research the callproc function. Eventually someone on another site suggested the following code, which worked:
cur = connections["SomeDB"].cursor()
cur.callproc("spGetLocationPath", [id, someval])
res = next(cur.stored_results()).fetchall()
cur.close()
I'm trying some db schema changes to my db, using the sqlalchemy table.create and sqlalchemy-migrate table.rename methods, plus some insert into select statments. I want to wrap all of this in a transaction. I can't figure out how to do this. This is what I tried:
engine = sqlalchemy.engine_from_config(conf.local_conf, 'sqlalchemy.')
trans = engine.connect().begin()
try:
old_metatadata.tables['address'].rename('address_migrate_tmp', connection=trans)
new_metatadata.tables['address'].create(connection=trans)
except:
trans.rollback()
raise
else:
trans.commit()
But it errors with:
AttributeError: 'RootTransaction' object has no attribute '_run_visitor'
(I tried using sqlalchemy-migrate column.alter(name='newname') but that errors, and does not work in a transaction, and so leaves my db in a broken state. I also need to rename multiple columns, and so I decide to roll my own code.)
Ah - I need to simply use the connection that the transaction was created on.
engine = sqlalchemy.engine_from_config(conf.local_conf, 'sqlalchemy.')
conn = engine.connect()
trans = conn.begin()
try:
old_metatadata.tables['address'].rename('address_migrate_tmp', connection=conn)
new_metatadata.tables['address'].create(bind=conn)
except:
trans.rollback()
raise
else:
trans.commit()