I'm using python to export a large matrixs (shape around 3000 * 3000) into MySQL.
Right now I'm using MySQLdb to insert those values but it's too troublesome and too inefficient. Here is my code:
# -*- coding:utf-8 -*-
import MySQLdb
import numpy as np
import pandas as pd
import time
def feature_to_sql_format(df):
df = df.fillna(value='')
columns = list(df.columns)
index = list(df.index)
index_sort = np.reshape([[int(i)] * len(columns) for i in index], (-1)).tolist()
columns_sort = (columns * len(index))
values_sort = df.values.reshape(-1).tolist()
return str(zip(index_sort, columns_sort, values_sort))[1: -1].replace("'NULL'", 'NULL')
if __name__ == '__main__':
t1 = time.clock()
df = pd.read_csv('C:\\test.csv', header=0, index_col=0)
output_string = feature_to_sql_format(df)
sql_CreateTable = 'USE derivative_pool;DROP TABLE IF exists test1;' \
'CREATE TABLE test1(date INT NOT NULL, code VARCHAR(12) NOT NULL, value FLOAT NULL);'
sql_Insert = 'INSERT INTO test (date,code,value) VALUES ' + output_string + ';'
con = MySQLdb.connect(......)
cur = con.cursor()
cur.execute(sql_CreateTable)
cur.close()
cur = con.cursor()
cur.execute(sql_Insert)
cur.close()
con.commit()
con.close()
t2 = time.clock()
print t2 - t1
And it consumes around 274 seconds totally.
I was wondering if there is a simplier way to do this, I thought of export the matrix to csv and then use LOAD DATA INFILE to import, but it's also too complicated.
I noticed that in pandas documentation pandas dataframe has a function to_sql, and in version 0.14 you can set the 'flavor' to 'mysql', that is:
df.to_sql(con=con, name=name, flavor='mysql')
But now my pandas version is 0.19.2 and the flavor is reduced to only 'sqlite'...... And I still tried to use
df.to_sql(con=con, name=name, flavor='sqlite')
and it gives me an error.
Is there any convinient way to do this?
Later pandas versions support SQLalchemy connectors instead of flavor = "mysql"
First, install dependencies:
pip install mysql-connector-python-rf==2.2.2
pip install MySQL-python==1.2.5
pip install SQLAlchemy==1.1.1
Then create the engine:
from sqlalchemy import create_engine
connection_string= "mysql+mysqlconnector://root:#localhost/MyDatabase"
engine = create_engine(connection_string)
Then you can use df.to_sql(...):
df.to_sql('MyTable', engine)
Here are some things you can do in MYSQL to speed up your data load:
SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET SESSION tx_isolation='READ-UNCOMMITTED';
SET sql_log_bin = 0;
#LOAD DATA LOCAL INFILE....
SET UNIQUE_CHECKS = 1;
SET FOREIGN_KEY_CHECKS = 1;
SET SESSION tx_isolation='READ-REPEATABLE';
Related
I have airflow installed on Ubuntu as WSL on windows.
I am trying to load a delimited file that is stored on my C drive into Mysql database using the code below:
import logging
import os
import csv
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mysql_operator import MySqlOperator
from airflow.hooks.mysql_hook import MySqlHook
def bulk_load_sql(table_name, **kwargs):
local_filepath = 'some c drive path'
conn = MySqlHook(conn_name_attr='mysql_default')
conn.bulk_load(table_name, local_filepath)
return table_name
dag = DAG(
"dag_name",
start_date=datetime.datetime.now() - datetime.timedelta(days=1),
schedule_interval=None)
t1 = PythonOperator(
task_id='csv_to_stgtbl',
provide_context=True,
python_callable=bulk_load_sql,
op_kwargs={'table_name': 'mysqltablnm'},
dag=dag
)
It gives the following exception:
MySQLdb._exceptions.OperationalError: (2068, 'LOAD DATA LOCAL INFILE file request rejected due to restrictions on access.')
I have checked the following setting on mysql and its ON
SHOW GLOBAL VARIABLES LIKE 'local_infile'
Could someone please provide some pointers as to how to fix it.
Is there any other way I can load a delimited file into mysql using airflow.
For now, I have implemented a work around as follows:
def load_staging():
mysqlHook = MySqlHook(conn_name_attr='mysql_default')
#cursor = conn.cursor()
conn = mysqlHook.get_conn()
cursor = conn.cursor()
csv_data = csv.reader(open('c drive file path'))
header = next(csv_data)
logging.info('Importing the CSV Files')
for row in csv_data:
#print(row)
cursor.execute("INSERT INTO table_name (col1,col2,col3) VALUES (%s, %s, %s)",
row)
conn.commit()
cursor.close()
t1 = PythonOperator(
task_id='csv_to_stgtbl',
python_callable=load_staging,
dag=dag
)
However, it would have been great if the LOAD DATA LOCAL INFILE would have worked.
I have mySQL running on-prem and would like to migrate it with mySQL running on Cloud SQL (GCP). I first want to export tables to Cloud Storage as JSON files and then from there move them to mySQL (cloud-sql) & Big Query.
Now I wonder how I should do this - export each table as JSON or just dump the whole database to cloud storage? (we might need to change schemas for some tables that's why im thinking to do it 1 by 1).
Is there any way doing it with python pandas?
I found this --> Pandas Dataframe to Cloud Storage Bucket
but don't understand how to connect this to my GCP's cloud storage, and how to do this mycursor.execute("SELECT * FROM table") for all my tables.
EDIT 1:
so i came up with this, but this works only for the selected schema + table. how can I do this for all tables in the schema??
#!/usr/bin/env python3
import mysql.connector
import pandas as pd
from google.cloud import storage
from google.oauth2 import service_account
import os
import csv
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/home/python2/key.json"
#export GOOGLE_APPLICATION_CREDENTIALS="/home/python2/key.json"
#credentials = storage.Client.from_service_account_json('/home/python2/key.json')
#credentials = service_account.Credentials.from_service_account_file('key.json')
mydb = mysql.connector.connect(
host="localhost", user="root", passwd="pass_word", database="test")
mycursor = mydb.cursor(named_tuple=True)
mycursor.execute("SELECT * FROM test")
myresult = mycursor.fetchall()
df = pd.DataFrame(data=myresult)
storage_client = storage.Client()
bucket = storage_client.get_bucket("my-buckets-1234567")
blob = bucket.blob("file.json")
df = pd.DataFrame(data=myresult).to_json(orient='records')
#df = pd.DataFrame(data=myresult).to_csv(sep=";", index=False, quotechar='"', quoting=csv.QUOTE_ALL, encoding="UTF-8")
blob.upload_from_string(data=df)
After making a connection with mysql library, i'd like to dowload all the database from the connection in my local space (tranforming them into pandas dataframe).
Here's my code:
import MySQLdb
import pandas as pd
conn = MySQLdb.connect(host='host' , user='datbase', passwd='password', db='databases' )
cursor = conn.cursor()
query = cursor.execute(' SELECT * FROM pro ')
df = pd.read_sql(query , conn)
row = cursor.fetchone()
conn.close()
I finnaly got the connection, so i can make some query. But i'd like to use these sql database as a pandas dataframe, '''how can i do it'''?
Just use
query = ' SELECT * FROM pro '
df = pd.read_sql(query , conn)
And df should already be your desired dataframe.
I am trying to insert each row from about 2000 csv files into a mysql table. With the following code, I have inserted only one row from just one file. How can I automate the code so that it inserts all rows for each file? The insertions need to be done just once.
import pymysql.cursors
connection = pymysql.connect(host='localhost',
user='s',
password='n9',
db='si',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
sql = "INSERT INTO `TrainsS` (`No.`, `Name`,`Zone`,`From`,`Delay`,`ETA`,`Location`,`To`) VALUES (%s,%s,%s,%s,%s,%s,%s, %s)"
cursor.execute(sql, ('03', 'P Exp','SF','HWH', 'none','no arr today','n/a','ND'))
connection.commit()
finally:
connection.close()
How about checking this code?
To run this you can put all your .csv files in one folder and os.walk(folder_location) that folder to get locations of all the .csv files and then I've opened them one by one and inserted into the required DB (MySQL) here.
import pandas as pd
import os
import subprocess
import warnings
warnings.simplefilter("ignore")
cwd = os.getcwd()
connection = pymysql.connect(host='localhost',
user='s',
password='n9',
db='si',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
files_csv = []
for subdir, dir, file in os.walk(cwd):
files_csv += [ fi for fi in file if fi.endswith(".csv") ]
print(files_csv)
for i in range(len(files_csv)):
with open(os.path.join(cwd, files_csv[i])) as f:
lis=[line.split() for line in f]
for i,x in enumerate(lis):
#print("line{0} = {1}".format(i,x))
#HERE x contains the row data and you can access it individualy using x[0], x[1], etc
#USE YOUR MySQL INSERTION commands here and insert the x row here.
with connection.cursor() as cursor:
sql = "INSERT INTO `TrainsS` (`No.`, `Name`,`Zone`,`From`,`Delay`,`ETA`,`Location`,`To`) VALUES (%s,%s,%s,%s,%s,%s,%s, %s)"
cursor.execute(sql, (#CONVERTED VALUES FROM x))
connection.commit()
Update -
getting values for (#CONVERTED VALUES FROM X)
values = ""
for i in range(len(columns)):
values = values + x[i] + "," # Here x[i] gives a record data in ith row. Here i'm just appending the all values to be inserted in the sql table.
values = values[:-1] # Removing the last extra comma.
command = "INSERT INTO `TrainsS` (`No.`, `Name`,`Zone`,`From`,`Delay`,`ETA`,`Location`,`To`) VALUES (" + str(values) + ")"
cursor.execute(command)
#Then commit using connection.commit()
import psycopg2
import time
import csv
conn = psycopg2.connect(
host = "localhost",
database = "postgres",
user = "postgres",
password = "postgres"
)
cur = conn.cursor()
start = time.time()
with open('combined_category_data_100 copy.csv', 'r') as file:
reader=csv.reader(file)
ncol = len(next(reader))
next(reader)
for row in reader:
cur.execute(" insert into data values (%s = (no. of columns
))", row)
conn.commit()
print("data entered successfully")
end = time.time()
print(f" time taken is {end - start}")
cur.close()
How can I escape the input to a MySQL db in Python3?
I'm using PyMySQL and works fine, but when I try to do something like:
cursor.execute("SELECT * FROM `Codes` WHERE `ShortCode` = '{}'".format(request[1]))
it won't work if the string has ' or ". I also tried:
cursor.execute("SELECT * FROM `Codes` WHERE `ShortCode` = %s",request[1])
The problem with this is that the library (PyMySQL) uses the formatting syntax for Python2.x, %, that doesn't work anymore.
I also found this possible solution
conn.escape_string()
in here, but I don't know where to add this code.
This is all I got:
import pymysql
import sys
conn = pymysql.connect( host = "localhost",
user = "test",
passwd = "",
db = "test")
cursor = conn.cursor()
cursor.execute("SELECT * FROM `Codes` WHERE `ShortCode` = {}".format(request[1]))
result = cursor.fetchall()
cursor.close()
conn.close()
Edit: I solved it! In PyMySQL the right way is like this:
import pymysql
import sys
conn = pymysql.connect(host="localhost",
user="test",
passwd="",
db="test")
cursor = conn.cursor()
text = conn.escape(request[1])
cursor.execute("SELECT * FROM `Codes` WHERE `ShortCode` = {}".format(text))
cursor.close()
conn.close()
Where the text = conn.escape(request[1]) line is what escapes the code. Found it inside PyMySQL code. There, request[1] is the input.
Although the "solved" answer works, it is not best practice. When using a library conforming to the Python DBI, you should be using bind variables rather than formatting a string and passing it to execute. There are dangers inherent in that methodology.
Therefore, this is the right way to do it:
cursor.execute("SELECT * FROM `Codes` WHERE `ShortCode` = %s", text)
Note that this is not a format string but a bind variable passed to the executing cursor.
For details: Python DBI PEP
Solved. In PyMySQL the right way is like this:
import pymysql
import sys
conn = pymysql.connect(host="localhost",
user="test",
passwd="",
db="test")
cursor = conn.cursor()
text = conn.escape(request[1])
cursor.execute("SELECT * FROM `Codes` WHERE `ShortCode` = {}".format(text))
cursor.close()
conn.close()
Where the text = conn.escape(request[1]) line is what escapes the code. Found it inside PyMySQL code. There, request[1] is the input.
Ready to use helper function
def mysql_insert(conn, table, row):
cols = ', '.join('`{}`'.format(col) for col in row.keys())
vals = ', '.join('%({})s'.format(col) for col in row.keys())
sql = 'INSERT INTO `{0}` ({1}) VALUES ({2})'.format(table, cols, vals)
conn.cursor().execute(sql, row)
conn.commit()
Usage example
insert_into(conn, 'people', {
'firstname': 'John',
'lastname': 'Doe',
'age': 18, })
Reference: https://github.com/PyMySQL/PyMySQL/blob/master/pymysql/cursors.py#L157-L158
def execute(self, query, args=None):
If args is a list or tuple, %s can be used as a placeholder in the query.
If args is a dict, %(name)s can be used as a placeholder in the query.