Workaround for load local data infile? [closed] - mysql

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
My web host has upgraded its servers. The newer 5.7.27 version of MySQL that they installed has LOAD DATA LOCAL INFILE disabled by default, resulting in Error 1148 when I try to execute the command. Unfortunately I can't start or stop the MySQL instance as that is under the control of the web host. What are some workarounds or alternate methods that will allow me to import data with the least effort? All the data I want to import are currently in TSV (tab separated value) format, but I could switch to CSV or something else if required. I have Workbench installed as well if it helps.
The problem is basically the same as this one, except I cannot access and reconfigure the server (the selected answer to that question).

I'd write a Python script to take your TSV input, and use it to generate INSERT statements in a loop. Each statement would handle perhaps 100-200* new rows. Then it would execute those statements.
Run it on the same server. Do it in a transaction so you don't make a mess on your first few tries if there are errors.
There you have it: TSV import.
* Or, well, whatever you want. Doing them one at a time will be slow (because there is a small overhead associated with the execution of each SQL statement), but you probably can't just dump them all into a single INSERT unless the amount of information is small. Check your server settings/limits, and come up with a reasonable batch size for your use case. For <2000 rows, and reasonably "short" row data, 100-200 rows per statement would usually be appropriate.
Pseudo-code:
batchSize = 100
buffer = []
handleInput():
for each line in tsvFile:
data = parse(line)
add data to buffer
if size(buffer) > batchSize:
flushBuffer
if size(buffer) > batchSize:
flushBuffer
flushBuffer:
str = "INSERT INTO tbl (col1, col2, col3) VALUES"
for each row in buffer:
if !str.empty():
str += ","
str += "(" + row[col1] + ", " + row[col2] + ", " + row[col3];
executeSqlStatement(str)
buffer = []

Related

Query of load csv not completing even after 12 hours

I have been using Neo4j for quite a while now. I ran this query earlier before my computer crashed 7 days ago and somehow unable to run it now. I need to create a graph database out of a csv of bank transactions. The original dataset has around 5 million rows and has around 60 columns.
This is the query I used, starting from 'Export CSV from real data' demo by Nicole White:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Transactions_with_risk_scores.csv" AS line
WITH DISTINCT line, SPLIT(line.VALUE_DATE, "/") AS date
WHERE line.TRANSACTION_ID IS NOT NULL AND line.VALUE_DATE IS NOT NULL
MERGE (transaction:Transaction {id:line.TRANSACTION_ID})
SET transaction.base_currency_amount =toInteger(line.AMOUNT_IN_BASE_CURRENCY),
transaction.base_currency = line.BASE_CURRENCY,
transaction.cd_code = line.CREDIT_DEBIT_CODE,
transaction.txn_type_code = line.TRANSACTION_TYPE_CODE,
transaction.instrument = line.INSTRUMENT,
transaction.region= line.REGION,
transaction.scope = line.SCOPE,
transaction.COUNTRY_RISK_SCORE= line.COUNTRY_RISK_SCORE,
transaction.year = toInteger(date[2]),
transaction.month = toInteger(date[1]),
transaction.day = toInteger(date[0]);
I tried:
Using LIMIT 0 before running query as per Micheal Hunger's suggestion in a post about 'Loading Large datasets'.
Used single MERGE per statement (this is first merge and there are 4 other merges to be used) as suggested by Michael again in another post.
Tried CALL apoc.periodic.iterate and apoc.cypher.parallel but doesn't work with LOAD CSV (seem to work only with MERGE and CREATE queries without LOAD CSV).
I get following error with CALL apoc.periodic.iterate(""):
Neo.ClientError.Statement.SyntaxError: Invalid input 'f': expected whitespace, '.', node labels, '[', "=~", IN, STARTS, ENDS, CONTAINS, IS, '^', '*', '/', '%', '+', '-', '=', '~', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, ',' or ')' (line 2, column 29 (offset: 57))
Increased max heap size to 16G as my laptop is of 16GB RAM. Btw finding it difficult to write this post as I tried running again now with 'PROFILE ' and it is still running since an hour.
Help needed to load query of this 5 million rows dataset. Any help would highly be appreciated.Thanks in advance! I am using Neo4j 3.5.1 on PC.
MOST IMPORTANT: Create Index/Constraint on the key property.
CREATE CONSTRAINT ON (t:Transaction) ASSERT t.id IS UNIQUE;
Don't set the max heap size to full of system RAM. Set it to 50%.
Try ON CREATE SET instead of SET.
You can also use apoc.periodic.iterate to load the data, but USING PERIODIC COMMIT is also fine.
Importantly, if you are 'USING PERIODIC COMMIT' and the query is not finishing or running out of memory, it is likely because of using Distinct. Avoid Distinct as duplicate transactions will be handled by MERGE.
NOTE: (If you use apoc.periodic.iterate to MERGE nodes/relationships with parameter parallel=true then it fails with NULL POINTER EXCEPTION. use it carefully)
Questioner edit: Removing Distinct in 3rd line for Transaction node and re-running the query worked!

Python hangs on fetchall using MySQL connector

I am fairly new to Python and MySQL. I am writing code that queries 60 different tables each containing records for each second in a five minute period. The code executes every five minutes. A few of the queries can reach 1/2 MB of data but most are in the 50 KB range. I am running on a workstation running Windows 7,64-bit using MySQL Connector/Python. I am testing my code using PowerShell windows but the code will eventually run as a scheduled task. The workstation has plenty of RAM (8 GB). Other processes are running but according to the Task Manager only half of memory is being used. Mostly everything performs as expected but sometimes processing hangs. I have inserted print statements in the code (I've also used debugger tracing) to determine where the hang occurs. It is occurring on a call to fetchall. Below is the germane parts of the code. All CAPS are (pseudo)constants.
mncdb = mysql.connector.connect(
option_files=ENV_MCG_MYSQL_OPTION_FILE,
option_groups=ENV_MCG_MYSQL_OPTION_GROUP,
host=ut_get_workstation_hostname(),
database=ENV_MNC_DATABASE_NAME
)
for generic_table_id in DBR_TABLE_INDEX:
site_table_id = DBR_SITE_TABLE_NAMES[site_id][generic_table_id]
db_cursor = mncdb.cursor()
db_command = (
"SELECT *"
+" FROM "
+site_table_id
+" WHERE "
+DBR_DATETIME_FIELD
+" >= '"
+query_start_time+"'"
+" AND "
+DBR_DATETIME_FIELD
+" < '"
+query_end_time+"'"
)
try:
db_cursor.execute(db_command)
print "selected data for table "+site_table_id
try:
table_info = db_cursor.fetchall()
print "extracted data for table "+site_table_id
except:
print "DB exception "+formatExceptionInfo()
print "FETCH failed to return any rows..."
table_info = []
raise
except:
print "uncaught DB error "+formatExceptionInfo()
raise
.
.
.
other processing that uses the data
.
.
.
db_cursor.close()
mncdb.close()
.
.
.
No exceptions are being raised. In a separate PowerShell window I can access the data being processed by the code. For my testing all data in the database is loaded before the code is executed. No processes are updating the database while the code is being tested. The hanging can occur on the first execution of the code or after several hours of execution.
My question is what could be causing the code to hang on the fetchall statement?
You can alleviate this by setting the fetch size:
mncdb = mysql.connector.connect(option_files=ENV_MCG_MYSQL_OPTION_FILE, option_groups=ENV_MCG_MYSQL_OPTION_GROUP,host=ut_get_workstation_hostname(,database=ENV_MNC_DATABASE_NAME, cursorclass = MySQLdb.cursors.SSCursor)
But before you do this, you should also use the mysql excuse for prepared statements instead of string concatenation when building your statement.
Hanging could involve the MySQL tables themselves and not specifically the Python code. Do they contain many records? Are they very wide tables? Are they indexed on the datetime_field?
Consider various strategies:
Specifically select the needed columns instead of the asterisk, calling all columns.
Index on the DBR_DATETIME_FIELD being used in the where clause (i.e., implicit join).
Diagnose further with printed timers print(datetime.datetime.now()) to see which are the bottleneck tables. In doing so, be sure to import the datetime module.

Mysql LOAD DATA from Powershell with variable

I try to insert the data from a csv file into a mysql database using a powershell script. When using a (dummy) variable in the LOAD DATA query I run into troubles.
Reproducible example:
Create a Mysql database and table with
CREATE DATABASE loadfiletest;
USE loadfiletest;
CREATE TABLE testtable (field1 INT, field2 INT DEFAULT 0);
Create a csv file named loadfiletestdata.csv containing
1,3
2,4
Create the powershell script (don't forget to change the db password and possibly the username)
[system.reflection.assembly]::LoadWithPartialName("MySql.Data")
$mysqlConn = New-Object -TypeName MySql.Data.MySqlClient.MySqlConnection
$mysqlConn.ConnectionString = "SERVER=localhost;DATABASE=loadfiletest;UID=root;PWD=pwd"
$mysqlConn.Open()
$MysqlQuery = New-Object -TypeName MySql.Data.MySqlClient.MySqlCommand
$MysqlQuery.Connection = $mysqlConn
$MysqlQuery.CommandText = "LOAD DATA LOCAL INFILE 'C:/path/to/files/loadfiletestdata.csv' INTO TABLE loadfiletest.testtable FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '""' LINES TERMINATED BY '\r\n' (field1, field2)"
$MysqlQuery.ExecuteNonQuery()
Put everything in the folder C:/path/to/files/ (should also be your path in the powershell script) and run the script. This populates the table testtable with
field1 field2
1 3
2 4
as one would expect. This implies that quotes and such are like they should be. Each time the script is executed, those values are inserted in the table. Now, when I replace in the one but last line of the powershell script (field1, field2) by (field1, #dummy), I would expect that the values
field1 field2
1 0
2 0
are inserted into the table. However, I receive the error
Exception calling "ExecuteNonQuery" with "0" argument(s): "Fatal error encountered during command execution."
At C:\path\to\files\loadfiletest.ps1:8 char:1
+ $queryOutput = $MysqlQuery.ExecuteNonQuery()
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : MySqlException
When running the query with #dummy from a mysql client it works. Also the syntax looks the same to me as what can be found in the mysql manual (somewhere in the middle of the page, look for #dummy).
A few further experiment that I did, suggest that any LOAD DATA query containing a variable #whatever gives the error.
So the questions:
Why doesn't it work?
Is there a way to execute a LOAD DATA query with (dummy) variables from powershell?
If not, is there an elegant workaround?
Obvious workarounds are creating an intermediate csv file according to the layout of the table or creating an intermediate table matching the layout of the csv file. However that seems ugly and cumbersome for something that imho should "just work".
Note: The present question is a follow up and generalization of this question. I chose to start a new one since replacing the old content would make the answers already given obsolete and adding the content of this question would make the old question veeeeery long and full of useless sidetracks.
I know this is old, but I had the same problem and I found the solution here:
http://blog.tjitjing.com/index.php/2009/05/mysqldatamysqlclientmysqlexception-parameter-id-must-be-defined.html
Quoting from the above blog:
"Starting from version 5.2.2 of the Connector you should add the Allow User Variables=True Connection String Setting in order to use User Defined Variables in your SQL statements.
Example of Connection String:
Database=testdb;Data Source=localhost;User Id=root;Password=hello;Allow User Variables=True"
Thank you for down-voting my answer.

How to generate script in SQL SERVER 2008 R2 without using UI? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Hi i am using SQL SERVER 2008 R2, can someone please help me to generate a script like create, alter without using UI.
Stored procedures, views, functions etc. can all be scripted from sys.sql_modules as long as they're not encrypted:
SELECT definition
FROM sys.sql_modules
WHERE [object_id] = OBJECT_ID(N'dbo.object_name');
Or if you want to script multiple:
SELECT definition + CHAR(13) + CHAR(10) + 'GO'
FROM sys.sql_modules
WHERE OBJECT_NAME([object_id]) IN (N'name1', N'name2', ...);
Or all:
SELECT '--' + QUOTENAME(OBJECT_SCHEMA_NAME([object_id]))
+ '.' + QUOTENAME(OBJECT_NAME([object_id]))
+ CHAR(13) + CHAR(10) + definition
+ CHAR(13) + CHAR(10) + 'GO' + CHAR(13) + CHAR(10)
FROM sys.sql_modules
WHERE definition IS NOT NULL;
(Of course these are all doomed if you run them in Management Studio and any exceeds the max length of an output string there, ~8K in results to text. But it sounds like you want to consume these elsewhere.)
Note that this won't script the SET settings that were in force at the time the object was created, but you could extend this query to include settings like ANSI_NULLS and QUOTED_IDENTIFIER - which you can get from the same view.
Tables are a little trickier. If you generate the script in SSMS while profiler is running, you will see that it does this through a slew of queries and constructs the create table script within the code (in other words you can't sniff it out). It can be quite complex depending on what options you're using for your table, whether you need to script all foreign keys and dependent objects, etc. For this I would prefer the SMO method highlighted in podiluska's answer.
If you're already using SSMS then I don't understand the purpose of NOT using the generate scripts menu items. You can do so for multiple objects by using Object Explorer Details instead of Object Explorer, if the singleton approach is the problem:
You can use the Scripter class in SQL Management Objects (SMO) to do this.
eg: http://www.mssqltips.com/sqlservertip/1833/generate-scripts-for-database-objects-with-smo-for-sql-server/
Try this:
Create a sproc with following steps.
1.First get all the table names for which you need create table script.
2.loop through each table and get the below info:
select COLUMN_NAME,DATA_TYPE,CHARACTER_MAXIMUM_LENGTH,IS_NULLABLE from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'tablename'
3.Now in the loop itself try to dynamically populate the create table script.
I wrote an open source command line utility named SchemaZen that does this. It's much faster than scripting from management studio and it's output is more version control friendly. It supports scripting both schema and data.
To generate scripts run:
schemazen.exe script --server localhost --database db --scriptDir c:\somedir
Then to recreate the database from scripts run:
schemazen.exe create --server localhost --database db --scriptDir c:\somedir

MySQL LOAD DATA INFILE slows down after initial insert using raw sql in django

I'm using the following custom handler for doing bulk insert using raw sql in django with a MySQLdb backend with innodb tables:
def handle_ttam_file_for(f, subject_pi):
import datetime
write_start = datetime.datetime.now()
print "write to disk start: ", write_start
destination = open('temp.ttam', 'wb+')
for chunk in f.chunks():
destination.write(chunk)
destination.close()
print "write to disk end", (datetime.datetime.now() - write_start)
subject = Subject.objects.get(id=subject_pi)
def my_custom_sql():
from django.db import connection, transaction
cursor = connection.cursor()
statement = "DELETE FROM ttam_genotypeentry WHERE subject_id=%i;" % subject.pk
del_start = datetime.datetime.now()
print "delete start: ", del_start
cursor.execute(statement)
print "delete end", (datetime.datetime.now() - del_start)
statement = "LOAD DATA LOCAL INFILE 'temp.ttam' INTO TABLE ttam_genotypeentry IGNORE 15 LINES (snp_id, #dummy1, #dummy2, genotype) SET subject_id=%i;" % subject.pk
ins_start = datetime.datetime.now()
print "insert start: ", ins_start
cursor.execute(statement)
print "insert end", (datetime.datetime.now() - ins_start)
transaction.commit_unless_managed()
my_custom_sql()
The uploaded file has 500k rows and is ~ 15M in size.
The load times seem to get progressively longer as files are added.
Insert times:
1st: 30m
2nd: 50m
3rd: 1h20m
4th: 1h30m
5th: 1h35m
I was wondering if it is normal for load times to get longer as files of constant size (# rows) are added and if there is anyway to improve performance of bulk inserts.
I found the main issue with bulk inserting to my innodb table was a mysql innodb setting I had overlooked.
The setting for innodb_buffer_pool_size is default 8M for my version of mysql and causing a huge slow down as my table size grew.
innodb-performance-optimization-basics
choosing-innodb_buffer_pool_size
The recommended size according to the articles is 70 to 80 percent of the memory if using a dedicated mysql server. After increasing the buffer pool size, my inserts went from an hour+ to less than 10 minutes with no other changes.
Another change I was able to make was getting ride of the LOCAL argument in the LOAD DATA statement (thanks #f00). My problem before was that i kept getting file not found, or cannot get stat errors when trying to have mysql access the file django uploaded.
Turns out this is related to using ubuntu and this bug.
Pick a directory from which mysqld should be allowed to load files.
Perhaps somewhere writable only by
your DBA account and readable only by
members of group mysql?
sudo aa-complain /usr/sbin/mysqld
Try to load a file from your designated loading directory: 'load
data infile
'/var/opt/mysql-load/import.csv' into
table ...'
sudo aa-logprof aa-logprof will identify the access violation
triggered by the 'load data infile
...' query, and interactively walk you
through allowing access in the future.
You probably want to choose Glob from
the menu, so that you end up with read
access to '/var/opt/mysql-load/*'.
Once you have selected the right
(glob) pattern, choose Allow from the
menu to finish up. (N.B. Do not
enable the repository when prompted to
do so the first time you run
aa-logprof, unless you really
understand the whole apparmor
process.)
sudo aa-enforce /usr/sbin/mysqld
Try to load your file again. It should work this time.