PostgreSQL multiple CSV import and add filename to each column - mysql

I've got 200k csv files and I need to import them all to a single postgresql table. It's a list of parameters from various devices and each csv's file name contains device's serial number and I need it to be in one of the colums for each row.
So to simplify, I've got few columns of data (no headers), let's say that columns in each csv file are: Date, Variable, Value and file name contains SERIALNUMBER_and_someOtherStuffIDontNeed.csv
I'm trying to use cygwin to write a bash script to iterate over files and do it for me, however for some reason it won't work, showing 'syntax error at or near "as" '
Here's my code:
#!/bin/bash
FILELIST=/cygdrive/c/devices/files/*
for INPUT_FILE in $FILELIST
do
psql -U postgres -d devices -c "copy devicelist
(
Date,
Variable,
Value,
SN as CURRENT_LOAD_SOURCE(),
)
from '$INPUT_FILE
delimiter ',' ;"
done
I'm learning SQL so it might be an obvious mistake, but I can't see it.
Also I know that in that form I will get full file name, not just the serial number bit I want but I can probably handle that somehow later.
Please advise.
Thanks.

I dont think there is a CURRENT_LOAD_SOURCE() function in postgres. A work-around is to leave the name-column NULL on copy, and patch is to the desired value just after the copy. I prefer a shell here-document because that make quoting inside the SQL body easier. (BTW: for 10K of files, the globbing needed to obtain FILELIST might exceed argmax for the shell ...)
#!/bin/bash
FILELIST="`ls /tmp/*.c`"
for INPUT_FILE in $FILELIST
do
echo "File:" $INPUT_FILE
psql -U postgres -d devices <<OMG
-- I have a schema "tmp" for testing purposes
CREATE TABLE IF NOT EXISTS tmp.filelist(name text, content text);
COPY tmp.filelist ( content)
from '/$INPUT_FILE' delimiter ',' ;
UPDATE tmp.filelist SET name = '$FILELIST'
WHERE name IS NULL;
OMG
done

For anyone interested in an answer, I've used a python script to change file names and then another script using psycopg2 to connect to the database and then done everyting in one connection. Took 10 minutes instead of 10 hours.
Here's the code:
Renaming files (also apparently to import from CSV you need all the rows to be filled and the information I needed was in first 4 columns anyway, therefore I've put together a solution to generate whole new CSVs instead of just renaming them):
import os
import csv
path='C:/devices/files'
os.chdir(path)
i=0
for file in os.listdir(path):
try:
i+=1
if i%10000 == 0:
#just to see the progress
print(i)
serial_number = (file[:8])
creader = csv.reader(open(file))
cwriter = csv.writer(open('processed_'+file, 'w'))
for cline in creader:
new_line = [val for col, val in enumerate(cline) if col not in (4, 5, 6, 7)]
new_line.insert(0, serial_number)
#print(new_line)
cwriter.writerow(new_line)
except:
print('problem with file: ' + file)
pass
Updating database:
import os
import psycopg2
path="C:\\devices\\files"
directory_listing = os.listdir(path)
conn = psycopg2.connect("dbname='devices' user='postgres' host='localhost'")
cursor = conn.cursor()
print(len(directory_listing))
i=100001
while i < 218792:
current_file=(directory_listing[i])
i+=1
full_path = "C:/devices/files/" + current_file
with open(full_path) as f:
cursor.copy_from(file=f, table='devicelistlive', sep=",")
conn.commit()
conn.close()
Don't mind while and weird numbers, it's just because I was doing it in portions for testing purposes. Can easily be replaced with for

Related

Is opening multiple cursors in Mysql a costly operation using Python?

I have few tables in Mysql which are to be loaded into Teradata, I am going with file based approach here, Meaning I export Mysql tables into delimiter file and those files i'm trying to load into teradata. The question/clarity i am expecting is, We are maintaining Mysql stored procedure to extract the data from tables, this stored procedure i'm using in python script to fetch the table data. Is it good/optimal to use stored procedure. Because to get the list of tables, retention period, datebase and other details, i'm creating one cursor to fetch data from 1 table, and again i have to create another cursor to call stored procedure.
Is it a costly operation in mysql creating a cursor.
Instead of table to fetch list of tables, retention period, datebase and other details, is it good thought to maintain them in flat file.
Please share your thoughts.
import sys
import mysql.connector
from mysql.connector import MySQLConnection, Error
import csv
output_file_path='/home/XXXXXXX/'
sys.path.insert(0, '/home/XXXXXXX/')
from mysql_config import *
def stored_proc_call(tbl):
print('SP call:', tbl)
conn_sp = mysql.connector.connect(host=dsn,database=database,user=username,passwd=password,allow_local_infile=True)
conn_sp_cursor = conn_sp.cursor(buffered=True)
conn_sp_cursor.callproc('mysql_stored_proc', [tbl])
output_file = output_file_path + tbl + '.txt'
print('output_file:', output_file)
with open(output_file, 'w') as filehandle:
writer = csv.writer(filehandle, delimiter='\x10')
for result in conn_sp_cursor.stored_results():
print('Stored proc cursor:{}, value:{}'.format(type(result), result))
for row in result:
writer.writerow(row)
#print('cursor row', row)
# Allow loading client-side files using the LOAD DATA LOCAL INFILE statement.
con = mysql.connector.connect(host=dsn,database=database,user=username,passwd=password,allow_local_infile=True)
cursor = con.cursor(buffered=True)
cursor.execute("select * from table")
for row in cursor:
print('Archive table cursor:{}, value:{}'.format(type(row), row))
(db,table,col,orgid,*allvalues)=row
stored_proc_call(table)
#print('db:{}, table:{}, col:{}, orgid:{}, ret_period:{}, allvalues:{}'.format(db,table,col,orgid,ret_period,allvalues))
#print('db:{}, table:{}, col:{}, orgid:{}, ret_period:{}, allvalues:{}'.format(db,table,col,orgid,ret_period,allvalues))

ETL script in Python to load data from another server .csv file into mysql

I work as a Business Analyst and new to Python.
In one of my project, I want to extract data from .csv file and load that data into my MySQL DB (Staging).
Can anyone guide me with a sample code and frameworks I should use?
Simple program to create sqllite. You can read the CSV file and use dynamic_entry to insert into your desired target table.
import sqlite3
import time
import datetime
import random
conn = sqlite3.connect('test.db')
c = conn.cursor()
def create_table():
c.execute('create table if not exists stuffToPlot(unix REAL, datestamp TEXT, keyword TEXT, value REAL)')
def data_entry():
c.execute("INSERT INTO stuffToPlot VALUES(1452549219,'2016-01-11 13:53:39','Python',6)")
conn.commit()
c.close()
conn.close()
def dynamic_data_entry():
unix = time.time();
date = str(datetime.datetime.fromtimestamp(unix).strftime('%Y-%m-%d %H:%M:%S'))
keyword = 'python'
value = random.randrange(0,10)
c.execute("INSERT INTO stuffToPlot(unix,datestamp,keyword,value) values(?,?,?,?)",
(unix,date,keyword,value))
conn.commit()
def read_from_db():
c.execute('select * from stuffToPlot')
#data = c.fetchall()
#print(data)
for row in c.fetchall():
print(row)
read_from_db()
c.close()
conn.close()
You can iterate through the data in CSV and load into sqllite3. Please refer below link as well.
Quick easy way to migrate SQLite3 to MySQL?
If that's a properly formatted CSV file you can use the LOAD DATA INFILE MySQL command and you won't need any python. Then after it is loaded in the staging area (without processing) you can continue transforming it using sql/etl tool of choice.
https://dev.mysql.com/doc/refman/8.0/en/load-data.html
A problem with that is that you need to add all columns but still even if you have data you don't need you might prefer to load everything in the staging.

RMySQL - dbWriteTable() - The used command is not allowed with this MySQL version

I am trying to read a few excel files into a dataframe and then write to a MySQL database. The following program is able to read the files and create the dataframe but when it tries to write to the db using dbWriteTable command, I get an error message -
Error in .local(conn, statement, ...) :
could not run statement: The used command is not allowed with this MySQL version
library(readxl)
library(RMySQL)
library(DBI)
mydb = dbConnect(RMySQL::MySQL(), host='<ip>', user='username', password='password', dbname="db",port=3306)
setwd("<directory path>")
file.list <- list.files(pattern='*.xlsx')
print(file.list)
dat = lapply(file.list, function(i){
print(i);
x = read_xlsx(i,sheet=NULL, range=cell_cols("A:D"), col_names=TRUE, skip=1, trim_ws=TRUE, guess_max=1000)
x$file=i
x
})
df = do.call("rbind.data.frame", dat)
dbWriteTable(mydb, name="table_name", value=df, append=TRUE )
dbDisconnect(mydb)
I checked the definition of the dbWriteTable function and looks like it is using load data local inpath to store the data in the database. As per some other answered questions on Stackoverflow, I understand that the word local could be the cause for concern but since it is already in the function definition, I don't know what I can do. Also, this statement is using "," as separator. But my data has "," in some of the values and that is why I was interested in using the dataframes hoping that it would preserve the source structure. But now I am not so sure.
Is there any other way/function do write the dataframe to the MySQL tables?
I solved this on my system by adding the following line to the my.cnf file on the server (you may need to use root and vi to edit!). In my this is just below the '[mysqld]' line
local-infile=1
Then restart the sever.
Good luck!
You may need to change
dbWriteTable(mydb, name="table_name", value=df, append=TRUE )
to
dbWriteTable(mydb, name="table_name", value=df,field.types = c(artist="varchar(50)", song.title="varchar(50)"), row.names=FALSE, append=TRUE)
That way, you specify the field types in R and append data to your MySQL table.
Source:Unknown column in field list error Rmysql

MySQL File or Directory not Found ODBC

I am writing a program which deals with data transformations via MySQL and it deals with big files.
I made a question earlier about another issue I was having, while I was trying out someone's answer I got the following error
[MySQL][ODBC 5.3(a) Driver][mysqld-5.5.5-10.1.9-MariaDB]File 'C:\xampp\mysql\data\ingram\' not found (Errcode: 2 "No such file or directory")
I am certain that directory exists and when I change the code to its original state it works perfectly.
What is going on there?
This is the piece of code that gives me the problem
Cmd.CommandText = String.Format("LOAD DATA INFILE ""{0}"" IGNORE INTO TABLE libros_nueva FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '""' ESCAPED BY '""' LINES TERMINATED BY '\r\n';", filepath)
Cmd.Execute()
Any help will be appreciated!
Given the salient portion of the error message:
File 'C:\xampp\mysql\data\ingram\' not found (Errcode: 2 "No such file or directory")
I am pretty sure you are passing just a path when a full path and file name are required. There is certainly no file name in the path it echoed back.
Can you please explain it [MySqlBulkLoader] to me?
Another way to import is to use MySqlBulkLoader from the MySql.Data.MySqlClient namespace:
' columns in the order they appear in the CSV file:
Dim cols As String() = {"Name", "Descr", "`Group`", "ValueA",
"Bird", "Fish", "zDate", "Color", "Active"}
Dim csvFile As String = "C:\Temp\mysqlImport.csv"
Dim rows As Int32
Using dbcon As New MySqlConnection(MySQLConnStr)
Dim bulk = New MySqlBulkLoader(dbcon)
bulk.TableName = "importer"
bulk.FieldTerminator = "," ' this is a CSV
bulk.LineTerminator = "\r\n" ' == CR/LF
bulk.FileName = csvFile ' full file path name to CSV
bulk.NumberOfLinesToSkip = 0 ' has a header?
bulk.Columns.Clear()
For Each s In cols
bulk.Columns.Add(s) ' tell MySQL the order
Next
rows = bulk.Load() ' Make it so.
End Using
Times to import 100k rows: 3619, 2719 and 2987 ms. There is also a LoadAsync method which may be of interest given your last question.
If there are data transforms to do before the insert, CSVHelper can provide an easy way to load records so you can do whatever needs to be done, then use normal SQL Inserts to update the DB.
Part of this answer shows using CSVHelper to import into Access in batches of 50k and which was pretty fast.

How to Create Master script file to run mysql scripts

I want to create DB structure for my application in mysql, I have some 100 scripts which will create tables , sp, functions in different schemas.
Please suggest how can i run script only one after other and how can i stop if previous script failed. I am using MySQL 5.6 version.
I am currrently runnning them using a text file.
mysql> source /mypath/CreateDB.sql
which contains
tee /logout/session.txt
source /mypath/00-CreateSchema.sql
source /mypath/01-CreateTable1.sql
source /mypath/01-CreateTable2.sql
source /mypath/01-CreateTable3.sql
But they are running simultaniously and I have Foreign key in below tables due to which it is giving error.
The scripts are not running simultaneously. The mysql client does not execute in a multi-threaded manner.
But it's possible that you are sourcing the scripts in an order that causes foreign keys to reference tables that you haven't defined yet, and this is a problem.
You have two possible fixes for this problem:
Create the tables in the order to avoid this problem.
Create all the tables without their foreign keys, then run another script that contains ALTER TABLE ADD FOREIGN KEY... statements.
I wrote a Python function to execute SQL files:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Download it at http://sourceforge.net/projects/mysql-python/?source=dlp
# Tutorials: http://mysql-python.sourceforge.net/MySQLdb.html
# http://zetcode.com/db/mysqlpython/
import MySQLdb as mdb
import datetime, time
def run_sql_file(filename, connection):
'''
The function takes a filename and a connection as input
and will run the SQL query on the given connection
'''
start = time.time()
file = open(filename, 'r')
sql = s = " ".join(file.readlines())
print "Start executing: " + filename + " at " + str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M")) + "\n" + sql
cursor = connection.cursor()
cursor.execute(sql)
connection.commit()
end = time.time()
print "Time elapsed to run the query:"
print str((end - start)*1000) + ' ms'
def main():
connection = mdb.connect('127.0.0.1', 'root', 'password', 'database_name')
run_sql_file("my_query_file.sql", connection)
connection.close()
if __name__ == "__main__":
main()
I haven't tried it with stored procedure or large SQL statements. Also if you have SQL files containing several SQL queries, you might have to split(";") to extract each query and call cursor.execute(sql) for each query. Feel free to edit this answer to incorporate these improvements.