I have a Macbook pro 2019 touch,i5, 8M,HD 256 GB used 75GB, OSX 10.15.2. Installed AMPPS 3.9 and MSQL Workbench 8.0
I'm importing to Localhost, a mysql table in csv 1.9 GB 15 million register, after 12 hours only 3 Million have been imported.
Already probe XAMPP VM and in the mariaDB console failed to do LOAD DATA INFILE. It disconnects from the database.
I'm going to process the table with PHP. Any help or recommendation so that these processes become faster ?
I use Sequel pro to read remote databases, very fast but no longer works for localhost.
I had a similar problem in importing a csv file (approx 5 mil records) into a mysql table. I managed to do that using nodejs as it is able to open and read a file line by line (without loading the entire file into ram). The script reads 100 lines, then makes an INSERT then reads another 100 lines and so on. It processed the entire file in about 9 minutes. It uses the readline module. The main part of the script looks something like this:
const Fs = require('fs');
const Mysql = require('mysql2');
const Readline = require('readline');
const fileStream = Fs.createReadStream('/path/to/file');
var dbConnection = Mysql.createConnection({
host : "yourHost",
user : "yourUser",
password : "yourPassword",
database : "yourDatabase"
});
const rl = Readline.createInterface({
input: fileStream,
crlfDelay: Infinity
});
async function run() {
var lineElements = [];
for await (const line of rl) {
// Each line in will be successively available here as `line`.
lineElements = line.split(",");
// now lineElements is an array containing all the values of the current line
// here you can queue up multiple lines, to make a bigger insert
// or just insert line by line
await dbConnection.query('INSERT INTO ..........');
}
}
run();
The script above inserts on line per query. Fell free to modify it, if you want a query to insert 100 lines or any other value.
As a side note: because my file was "trusted" I did not used prepared statements, as I think that the simple query is faster. I do not know if the speed gain was significant, as I did not do any tests.
Related
What I want to know is, what the heck happens, under the hoods, when I upload data through R and it turns to be way much faster than MySQL Workbench or KNIME?
I work with data and, everyday, I upload data into a MySQL server. I used to upload data using KNIME since it was much faster than uploading with MySQL Workbench (select the table -> "import data").
Some infos: The CSV has 4000 rows and 15 columns. The library I used in R is RMySQL. The node I used in KNIME is database writer.
library('RMySQL')
df=read.csv('C:/Users/my_user/Documents/file.csv', encoding = 'UTF-8', sep=';')
connection <- dbConnect(
RMySQL::MySQL(),
dbname = "db_name",
host = "yyy.xxxxxxx.com",
user = "vitor",
password = "****"
)
dbWriteTable(connection, "table_name", df, append=TRUE, row.names=FALSE)
So, to test, I did the exact same process, using the same file. It took 2 minutes in KNIME and only seconds in R.
Everything happens under the hood! Data upload to DB depends on parameters such as interface between DB and tool, network connectivity, batch size set, memory available for tool and tool data processing speed itself and probably some more. In your case RMySQL package uses batch size of 500 by default and KNIME only 1 so probably that is where the difference comes from. Try setting it to 500 in KNIME and then compare. Have no clue how MySQL Workbench works...
I am trying to query a comments table from mysql database by language.
Whenever I query by language to fetch chinese comments it displays encoded gibberish characters. but whenever I use python to query, it works.
Cloud Platform: Google Cloud SQL
Database location: Google Cloud SQL
Programming Language: Nodejs
Below is my code
// Require process, so we can mock environment variables
const process = require('process');
const Knex = require('knex');
const express = require('express');
const app = express();
const config = {
user: process.env.SQL_USER,
password: process.env.SQL_PASSWORD,
database: process.env.SQL_DATABASE,
socketPath: `/cloudsql/${process.env.INSTANCE_CONNECTION_NAME}`
};
var knex = Knex({
client: 'mysql',
connection: config
});
app.get('/', (req, res) => {
knex.select('post')
.from('comment')
.where({
'language': 'zh'
}).limit(1).then((rows) => {
res.send(rows);
}).catch((err) => {
res.send(err);
});
});
This is my query result:
"post": "最白痴的部长ï¼æœ€åŸºæœ¬çš„常识和逻辑都没有。真丢人ï¼"
please help.....
The text "最白痴的部长ï¼æœ€åŸºæœ¬çš„常识和逻辑都没有。真丢人ï¼" is what you get if "最白痴的部长基本的常识和逻辑都没有。真丢人" is sent encoded as UTF-8, but is then read and decoded as windows-1252 character set.
There are several different places this mis-decoding could happen:
From the client to the application writing to the database when the data was first added
Between the application and MySQL when adding the data
Across a configuration change in MySQL that wasn't applied correctly.
Between MySQL and the application reading the data.
Between the application and the end client displaying the data to you.
To investigate, I suggest being systematic. Start by accessing the data using other tools, e.g. PHPMyAdmin or the mysql command line in Cloud Shell. If you see the right data, you know the issue is (4) or (5). If the database definitly has the wrong data in it, then it's (1), (2) or (3).
The most common place for this error to happen is (5), so I'll go into that a bit more. This is because often websites set the character set to something wrong, or not at all. To fix this, we must make the character set explicit. You can do this in express.js by adding:
res.set('Content-Type', 'text/plain; charset=utf-8')
I've made dashboard using shiny, shinydashboard and RMySQL package.
Following is what I wrote in order to refresh data every 10 minutes if any change occured.
In global.R
con = dbConnect(MySQL(), host, user, pass, db)
check_func <- function() {dbGetQuery(con, check_query}
get_func <- function() {dbGetQuery(con, get_query}
In server.R
function(input, output, session) {
# check every 10 minutes for any change
data <- reactivePoll(10*60*1000, session, checkFunc = check_func, valueFunc = get_func)
session$onSessionEnded(function() {dbDisconnect(con)})
However, above code infrequently generates corrupt connection handle error from check_func.
Warning: Error in .local: internal error in RS_DBI_getConnection: corrupt connection handle
Should I put dbConnect code inside server function?
Any better ideas?
link: using session$onsessionend to disconnect rshiny app from the mysql server
"pool" package is the answer: http://shiny.rstudio.com/articles/pool-basics.html
This adds a new level of abstraction when connecting to a database: instead of directly fetching a connection from the database, you will create an object (called a pool) with a reference to that database. The pool holds a number of connections to the database. ... Each time you make a query, you are querying the pool, rather than the database. ... You never have to create or close connections directly: the pool knows when it should grow, shrink or keep steady.
I've got answer from here. -> https://stackoverflow.com/a/39661853/4672289
I'm working on my first node.js-socket.io project. Until now i coded only in PHP. In PHP it is common to close the mysql connection, when it is not needed any more.
My Question: Does it make sense to keep just one mysql-connection during server is running open, or should i handle this like PHP.
Info: In the happy hours i will have about 5 requests/seconds from socket clients and for almost all of them i have to make a mysql_crud.
Which one would you prefer?
io = require('socket.io').listen(3000); var mysql = require('mysql');
var connection = mysql.createConnection({
host:'localhost',user:'root',password :'pass',database :'myDB'
});
connection.connect(); // and never 'end' or 'destroy'
// ...
or
var app = {};
app.set_geolocation = function(driver_id, driver_location) {
connection.connect();
connection.query('UPDATE drivers set ....', function (err) {
/* do something */
})
connection.end();
}
...
The whole idea of Node.js is async io (that includes db queries).
And the rule with a mysql connection is that you can only have one query per connection at a time. So you either make a queue and have a single connection, as in the first option or create a connection each time as with option 2.
I personally would go with option 2, as opening and closing connections are not such a big overhead.
Here are some code samples to help you out:
https://codeforgeek.com/2015/01/nodejs-mysql-tutorial/
I'm using the following custom handler for doing bulk insert using raw sql in django with a MySQLdb backend with innodb tables:
def handle_ttam_file_for(f, subject_pi):
import datetime
write_start = datetime.datetime.now()
print "write to disk start: ", write_start
destination = open('temp.ttam', 'wb+')
for chunk in f.chunks():
destination.write(chunk)
destination.close()
print "write to disk end", (datetime.datetime.now() - write_start)
subject = Subject.objects.get(id=subject_pi)
def my_custom_sql():
from django.db import connection, transaction
cursor = connection.cursor()
statement = "DELETE FROM ttam_genotypeentry WHERE subject_id=%i;" % subject.pk
del_start = datetime.datetime.now()
print "delete start: ", del_start
cursor.execute(statement)
print "delete end", (datetime.datetime.now() - del_start)
statement = "LOAD DATA LOCAL INFILE 'temp.ttam' INTO TABLE ttam_genotypeentry IGNORE 15 LINES (snp_id, #dummy1, #dummy2, genotype) SET subject_id=%i;" % subject.pk
ins_start = datetime.datetime.now()
print "insert start: ", ins_start
cursor.execute(statement)
print "insert end", (datetime.datetime.now() - ins_start)
transaction.commit_unless_managed()
my_custom_sql()
The uploaded file has 500k rows and is ~ 15M in size.
The load times seem to get progressively longer as files are added.
Insert times:
1st: 30m
2nd: 50m
3rd: 1h20m
4th: 1h30m
5th: 1h35m
I was wondering if it is normal for load times to get longer as files of constant size (# rows) are added and if there is anyway to improve performance of bulk inserts.
I found the main issue with bulk inserting to my innodb table was a mysql innodb setting I had overlooked.
The setting for innodb_buffer_pool_size is default 8M for my version of mysql and causing a huge slow down as my table size grew.
innodb-performance-optimization-basics
choosing-innodb_buffer_pool_size
The recommended size according to the articles is 70 to 80 percent of the memory if using a dedicated mysql server. After increasing the buffer pool size, my inserts went from an hour+ to less than 10 minutes with no other changes.
Another change I was able to make was getting ride of the LOCAL argument in the LOAD DATA statement (thanks #f00). My problem before was that i kept getting file not found, or cannot get stat errors when trying to have mysql access the file django uploaded.
Turns out this is related to using ubuntu and this bug.
Pick a directory from which mysqld should be allowed to load files.
Perhaps somewhere writable only by
your DBA account and readable only by
members of group mysql?
sudo aa-complain /usr/sbin/mysqld
Try to load a file from your designated loading directory: 'load
data infile
'/var/opt/mysql-load/import.csv' into
table ...'
sudo aa-logprof aa-logprof will identify the access violation
triggered by the 'load data infile
...' query, and interactively walk you
through allowing access in the future.
You probably want to choose Glob from
the menu, so that you end up with read
access to '/var/opt/mysql-load/*'.
Once you have selected the right
(glob) pattern, choose Allow from the
menu to finish up. (N.B. Do not
enable the repository when prompted to
do so the first time you run
aa-logprof, unless you really
understand the whole apparmor
process.)
sudo aa-enforce /usr/sbin/mysqld
Try to load your file again. It should work this time.