Is opening multiple cursors in Mysql a costly operation using Python? - mysql

I have few tables in Mysql which are to be loaded into Teradata, I am going with file based approach here, Meaning I export Mysql tables into delimiter file and those files i'm trying to load into teradata. The question/clarity i am expecting is, We are maintaining Mysql stored procedure to extract the data from tables, this stored procedure i'm using in python script to fetch the table data. Is it good/optimal to use stored procedure. Because to get the list of tables, retention period, datebase and other details, i'm creating one cursor to fetch data from 1 table, and again i have to create another cursor to call stored procedure.
Is it a costly operation in mysql creating a cursor.
Instead of table to fetch list of tables, retention period, datebase and other details, is it good thought to maintain them in flat file.
Please share your thoughts.
import sys
import mysql.connector
from mysql.connector import MySQLConnection, Error
import csv
output_file_path='/home/XXXXXXX/'
sys.path.insert(0, '/home/XXXXXXX/')
from mysql_config import *
def stored_proc_call(tbl):
print('SP call:', tbl)
conn_sp = mysql.connector.connect(host=dsn,database=database,user=username,passwd=password,allow_local_infile=True)
conn_sp_cursor = conn_sp.cursor(buffered=True)
conn_sp_cursor.callproc('mysql_stored_proc', [tbl])
output_file = output_file_path + tbl + '.txt'
print('output_file:', output_file)
with open(output_file, 'w') as filehandle:
writer = csv.writer(filehandle, delimiter='\x10')
for result in conn_sp_cursor.stored_results():
print('Stored proc cursor:{}, value:{}'.format(type(result), result))
for row in result:
writer.writerow(row)
#print('cursor row', row)
# Allow loading client-side files using the LOAD DATA LOCAL INFILE statement.
con = mysql.connector.connect(host=dsn,database=database,user=username,passwd=password,allow_local_infile=True)
cursor = con.cursor(buffered=True)
cursor.execute("select * from table")
for row in cursor:
print('Archive table cursor:{}, value:{}'.format(type(row), row))
(db,table,col,orgid,*allvalues)=row
stored_proc_call(table)
#print('db:{}, table:{}, col:{}, orgid:{}, ret_period:{}, allvalues:{}'.format(db,table,col,orgid,ret_period,allvalues))
#print('db:{}, table:{}, col:{}, orgid:{}, ret_period:{}, allvalues:{}'.format(db,table,col,orgid,ret_period,allvalues))

Related

to_mysql inserts more rows in SQL table than there are in pandas dataframe

So I have a MySQL database, let's call it "MySQLDB". When trying to create a new table (let's call it datatable) and insert data from a pandas dataframe, my code keeps adding rows to the SQL table, and I'm not sure if they are duplicates or not. For reference, there are around 50,000 rows in my pandas dataframe, but after running my code, the SQL table contains over 1 million rows. Note that I am using XAMPP to run a local MySQL server on which the database "MYSQLDB" is stored. Below is a simplified/generic version of what I am running. Note I have removed the port number and replaced it with generic [port] in this post.
import pandas as pd
from sqlalchemy import create_engine
import mysql.connector
pandas_db = pd.read_csv('filename.csv', index_col = [0])
engine = create_engine('mysql+mysqlconnector://root:#localhost:[port]/MySQLDB', echo=False)
pandas_db.to_sql(name='datatable', con=engine, if_exists = 'replace', chunksize = 100, index=False)
Is something wrong with the code? Or could it be something to do with XAMPP or the way I set up my database? If there is anything I could improve, please let me know.
I haven't found any other good posts that describe having the same issue.

ETL script in Python to load data from another server .csv file into mysql

I work as a Business Analyst and new to Python.
In one of my project, I want to extract data from .csv file and load that data into my MySQL DB (Staging).
Can anyone guide me with a sample code and frameworks I should use?
Simple program to create sqllite. You can read the CSV file and use dynamic_entry to insert into your desired target table.
import sqlite3
import time
import datetime
import random
conn = sqlite3.connect('test.db')
c = conn.cursor()
def create_table():
c.execute('create table if not exists stuffToPlot(unix REAL, datestamp TEXT, keyword TEXT, value REAL)')
def data_entry():
c.execute("INSERT INTO stuffToPlot VALUES(1452549219,'2016-01-11 13:53:39','Python',6)")
conn.commit()
c.close()
conn.close()
def dynamic_data_entry():
unix = time.time();
date = str(datetime.datetime.fromtimestamp(unix).strftime('%Y-%m-%d %H:%M:%S'))
keyword = 'python'
value = random.randrange(0,10)
c.execute("INSERT INTO stuffToPlot(unix,datestamp,keyword,value) values(?,?,?,?)",
(unix,date,keyword,value))
conn.commit()
def read_from_db():
c.execute('select * from stuffToPlot')
#data = c.fetchall()
#print(data)
for row in c.fetchall():
print(row)
read_from_db()
c.close()
conn.close()
You can iterate through the data in CSV and load into sqllite3. Please refer below link as well.
Quick easy way to migrate SQLite3 to MySQL?
If that's a properly formatted CSV file you can use the LOAD DATA INFILE MySQL command and you won't need any python. Then after it is loaded in the staging area (without processing) you can continue transforming it using sql/etl tool of choice.
https://dev.mysql.com/doc/refman/8.0/en/load-data.html
A problem with that is that you need to add all columns but still even if you have data you don't need you might prefer to load everything in the staging.

PostgreSQL multiple CSV import and add filename to each column

I've got 200k csv files and I need to import them all to a single postgresql table. It's a list of parameters from various devices and each csv's file name contains device's serial number and I need it to be in one of the colums for each row.
So to simplify, I've got few columns of data (no headers), let's say that columns in each csv file are: Date, Variable, Value and file name contains SERIALNUMBER_and_someOtherStuffIDontNeed.csv
I'm trying to use cygwin to write a bash script to iterate over files and do it for me, however for some reason it won't work, showing 'syntax error at or near "as" '
Here's my code:
#!/bin/bash
FILELIST=/cygdrive/c/devices/files/*
for INPUT_FILE in $FILELIST
do
psql -U postgres -d devices -c "copy devicelist
(
Date,
Variable,
Value,
SN as CURRENT_LOAD_SOURCE(),
)
from '$INPUT_FILE
delimiter ',' ;"
done
I'm learning SQL so it might be an obvious mistake, but I can't see it.
Also I know that in that form I will get full file name, not just the serial number bit I want but I can probably handle that somehow later.
Please advise.
Thanks.
I dont think there is a CURRENT_LOAD_SOURCE() function in postgres. A work-around is to leave the name-column NULL on copy, and patch is to the desired value just after the copy. I prefer a shell here-document because that make quoting inside the SQL body easier. (BTW: for 10K of files, the globbing needed to obtain FILELIST might exceed argmax for the shell ...)
#!/bin/bash
FILELIST="`ls /tmp/*.c`"
for INPUT_FILE in $FILELIST
do
echo "File:" $INPUT_FILE
psql -U postgres -d devices <<OMG
-- I have a schema "tmp" for testing purposes
CREATE TABLE IF NOT EXISTS tmp.filelist(name text, content text);
COPY tmp.filelist ( content)
from '/$INPUT_FILE' delimiter ',' ;
UPDATE tmp.filelist SET name = '$FILELIST'
WHERE name IS NULL;
OMG
done
For anyone interested in an answer, I've used a python script to change file names and then another script using psycopg2 to connect to the database and then done everyting in one connection. Took 10 minutes instead of 10 hours.
Here's the code:
Renaming files (also apparently to import from CSV you need all the rows to be filled and the information I needed was in first 4 columns anyway, therefore I've put together a solution to generate whole new CSVs instead of just renaming them):
import os
import csv
path='C:/devices/files'
os.chdir(path)
i=0
for file in os.listdir(path):
try:
i+=1
if i%10000 == 0:
#just to see the progress
print(i)
serial_number = (file[:8])
creader = csv.reader(open(file))
cwriter = csv.writer(open('processed_'+file, 'w'))
for cline in creader:
new_line = [val for col, val in enumerate(cline) if col not in (4, 5, 6, 7)]
new_line.insert(0, serial_number)
#print(new_line)
cwriter.writerow(new_line)
except:
print('problem with file: ' + file)
pass
Updating database:
import os
import psycopg2
path="C:\\devices\\files"
directory_listing = os.listdir(path)
conn = psycopg2.connect("dbname='devices' user='postgres' host='localhost'")
cursor = conn.cursor()
print(len(directory_listing))
i=100001
while i < 218792:
current_file=(directory_listing[i])
i+=1
full_path = "C:/devices/files/" + current_file
with open(full_path) as f:
cursor.copy_from(file=f, table='devicelistlive', sep=",")
conn.commit()
conn.close()
Don't mind while and weird numbers, it's just because I was doing it in portions for testing purposes. Can easily be replaced with for

How to parsimoniously refer to a data frame in RMySQL

I have a MySQL table that I am reading with the RMySQL package of R. I would like to be able to directly refer to the data frame stored in the table so I can seamlessly interact with it rather than having to execute RMySQL statement every time I want to do something. Is there a way to accomplish this? I tried:
data <- dbReadTable(conn = con, name = 'tablename')
For example, if I now want to check how many rows I have in this table I would run:
nrow(data)
Does this go through the database connection, or am I now storing the object "data" locally, defeating the whole purpose of using an external database?
data <- dbReadTable(conn = con, name = 'tablename')
This command downloads all the data into a local R dataframe (assuming you have enough RAM). Any operations with data from that point forward do not require the SQL connection.

How to Create Master script file to run mysql scripts

I want to create DB structure for my application in mysql, I have some 100 scripts which will create tables , sp, functions in different schemas.
Please suggest how can i run script only one after other and how can i stop if previous script failed. I am using MySQL 5.6 version.
I am currrently runnning them using a text file.
mysql> source /mypath/CreateDB.sql
which contains
tee /logout/session.txt
source /mypath/00-CreateSchema.sql
source /mypath/01-CreateTable1.sql
source /mypath/01-CreateTable2.sql
source /mypath/01-CreateTable3.sql
But they are running simultaniously and I have Foreign key in below tables due to which it is giving error.
The scripts are not running simultaneously. The mysql client does not execute in a multi-threaded manner.
But it's possible that you are sourcing the scripts in an order that causes foreign keys to reference tables that you haven't defined yet, and this is a problem.
You have two possible fixes for this problem:
Create the tables in the order to avoid this problem.
Create all the tables without their foreign keys, then run another script that contains ALTER TABLE ADD FOREIGN KEY... statements.
I wrote a Python function to execute SQL files:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Download it at http://sourceforge.net/projects/mysql-python/?source=dlp
# Tutorials: http://mysql-python.sourceforge.net/MySQLdb.html
# http://zetcode.com/db/mysqlpython/
import MySQLdb as mdb
import datetime, time
def run_sql_file(filename, connection):
'''
The function takes a filename and a connection as input
and will run the SQL query on the given connection
'''
start = time.time()
file = open(filename, 'r')
sql = s = " ".join(file.readlines())
print "Start executing: " + filename + " at " + str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M")) + "\n" + sql
cursor = connection.cursor()
cursor.execute(sql)
connection.commit()
end = time.time()
print "Time elapsed to run the query:"
print str((end - start)*1000) + ' ms'
def main():
connection = mdb.connect('127.0.0.1', 'root', 'password', 'database_name')
run_sql_file("my_query_file.sql", connection)
connection.close()
if __name__ == "__main__":
main()
I haven't tried it with stored procedure or large SQL statements. Also if you have SQL files containing several SQL queries, you might have to split(";") to extract each query and call cursor.execute(sql) for each query. Feel free to edit this answer to incorporate these improvements.