Why does R upload data much faster than KNIME or Workbench? - mysql

What I want to know is, what the heck happens, under the hoods, when I upload data through R and it turns to be way much faster than MySQL Workbench or KNIME?
I work with data and, everyday, I upload data into a MySQL server. I used to upload data using KNIME since it was much faster than uploading with MySQL Workbench (select the table -> "import data").
Some infos: The CSV has 4000 rows and 15 columns. The library I used in R is RMySQL. The node I used in KNIME is database writer.
library('RMySQL')
df=read.csv('C:/Users/my_user/Documents/file.csv', encoding = 'UTF-8', sep=';')
connection <- dbConnect(
RMySQL::MySQL(),
dbname = "db_name",
host = "yyy.xxxxxxx.com",
user = "vitor",
password = "****"
)
dbWriteTable(connection, "table_name", df, append=TRUE, row.names=FALSE)
So, to test, I did the exact same process, using the same file. It took 2 minutes in KNIME and only seconds in R.

Everything happens under the hood! Data upload to DB depends on parameters such as interface between DB and tool, network connectivity, batch size set, memory available for tool and tool data processing speed itself and probably some more. In your case RMySQL package uses batch size of 500 by default and KNIME only 1 so probably that is where the difference comes from. Try setting it to 500 in KNIME and then compare. Have no clue how MySQL Workbench works...

Related

How to parsimoniously refer to a data frame in RMySQL

I have a MySQL table that I am reading with the RMySQL package of R. I would like to be able to directly refer to the data frame stored in the table so I can seamlessly interact with it rather than having to execute RMySQL statement every time I want to do something. Is there a way to accomplish this? I tried:
data <- dbReadTable(conn = con, name = 'tablename')
For example, if I now want to check how many rows I have in this table I would run:
nrow(data)
Does this go through the database connection, or am I now storing the object "data" locally, defeating the whole purpose of using an external database?
data <- dbReadTable(conn = con, name = 'tablename')
This command downloads all the data into a local R dataframe (assuming you have enough RAM). Any operations with data from that point forward do not require the SQL connection.

Most effective way to push data from a SQL Server database into a Greenplum database?

Greenplum Database version:
PostgreSQL 8.2.15 (Greenplum Database 4.2.3.0 build 1)
SQL Server Database version:
Microsoft SQL Server 2008 R2 (SP1)
Our current approach:
1) Export each table to a flat file from SQL Server
2) Load the data into Greenplum with pgAdmin III using PSQL Console's psql.exe utility
Benifits...
Speed: OK, but is there anything faster? We load millions of rows of data in minutes
Automation: OK, we call this utility from an SSIS package using a Shell script in VB
Pitfalls...
Reliability: ETL is dependent on the file server to hold the flat files
Security: Lots of potentially sensitive data on the file server
Error handling: It's a problem. psql.exe never raises an error that we can catch even if it does error out and loads no data or a partial file
What else we have tried...
.Net Providers\Odbc Data Provider: We have configured a System DSN using DataDirect 6.0 Greenplum Wire Protocol. Good performance for a DELETE. Dog awful slow for an INSERT.
For reference, this is the aforementioned VB script in SSIS...
Public Sub Main()
Dim v_shell
Dim v_psql As String
v_psql = "C:\Program Files\pgAdmin III\1.10\psql.exe -d "MyGPDatabase" -h "MyGPHost" -p "5432" -U "MyServiceAccount" -f \\MyFileLocation\SSIS_load\sql_files\load_MyTable.sql"
v_shell = Shell(v_psql, AppWinStyle.NormalFocus, True)
End Sub
This is the contents of the "load_MyTable.sql" file...
\copy MyTable from '\\MyFileLocation\SSIS_load\txt_files\MyTable.txt' with delimiter as ';' csv header quote as '"'
If you're getting your data load done in minutes, then the current method is probably good enough. However, if you find yourself having to load larger volumes of data (terabyte scale for instance), the usual preferred method for bulk-loading into Greenplum is via gpfdist and corresponding EXTERNAL TABLE definitions. gpload is a decent wrapper that provides abstraction over much of this process and is driven by YAML control files. The general idea is that gpfdist instance(s) are spun up at the location(s) where your data is staged, preferrably as CSV text files, and then the EXTERNAL TABLE definition within Greenplum is made aware of the URIs for the gpfdist instances. From the admin guide, a sample definition of such an external table could look like this:
CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('gpfdist://<host>:<portNum>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in,
name=20, address=30, age=4,
preserve_blanks='on',null='NULL');
The above example expects to read text files whose fields from left to right are a 20-character (at most) string, a 30-character string, and an integer. To actually load this data into a staging table inside GP:
CREATE TABLE staging_table AS SELECT * FROM students;
For large volumes of data, this should be the most efficient method since all segment hosts are engaged in the parallel load. Do keep in mind that the simplistic approach above will probably result in a randomly distributed table, which may not be desirable. You'd have to customize your table definitions to specify a distribution key.

MySQL LOAD DATA INFILE slows down after initial insert using raw sql in django

I'm using the following custom handler for doing bulk insert using raw sql in django with a MySQLdb backend with innodb tables:
def handle_ttam_file_for(f, subject_pi):
import datetime
write_start = datetime.datetime.now()
print "write to disk start: ", write_start
destination = open('temp.ttam', 'wb+')
for chunk in f.chunks():
destination.write(chunk)
destination.close()
print "write to disk end", (datetime.datetime.now() - write_start)
subject = Subject.objects.get(id=subject_pi)
def my_custom_sql():
from django.db import connection, transaction
cursor = connection.cursor()
statement = "DELETE FROM ttam_genotypeentry WHERE subject_id=%i;" % subject.pk
del_start = datetime.datetime.now()
print "delete start: ", del_start
cursor.execute(statement)
print "delete end", (datetime.datetime.now() - del_start)
statement = "LOAD DATA LOCAL INFILE 'temp.ttam' INTO TABLE ttam_genotypeentry IGNORE 15 LINES (snp_id, #dummy1, #dummy2, genotype) SET subject_id=%i;" % subject.pk
ins_start = datetime.datetime.now()
print "insert start: ", ins_start
cursor.execute(statement)
print "insert end", (datetime.datetime.now() - ins_start)
transaction.commit_unless_managed()
my_custom_sql()
The uploaded file has 500k rows and is ~ 15M in size.
The load times seem to get progressively longer as files are added.
Insert times:
1st: 30m
2nd: 50m
3rd: 1h20m
4th: 1h30m
5th: 1h35m
I was wondering if it is normal for load times to get longer as files of constant size (# rows) are added and if there is anyway to improve performance of bulk inserts.
I found the main issue with bulk inserting to my innodb table was a mysql innodb setting I had overlooked.
The setting for innodb_buffer_pool_size is default 8M for my version of mysql and causing a huge slow down as my table size grew.
innodb-performance-optimization-basics
choosing-innodb_buffer_pool_size
The recommended size according to the articles is 70 to 80 percent of the memory if using a dedicated mysql server. After increasing the buffer pool size, my inserts went from an hour+ to less than 10 minutes with no other changes.
Another change I was able to make was getting ride of the LOCAL argument in the LOAD DATA statement (thanks #f00). My problem before was that i kept getting file not found, or cannot get stat errors when trying to have mysql access the file django uploaded.
Turns out this is related to using ubuntu and this bug.
Pick a directory from which mysqld should be allowed to load files.
Perhaps somewhere writable only by
your DBA account and readable only by
members of group mysql?
sudo aa-complain /usr/sbin/mysqld
Try to load a file from your designated loading directory: 'load
data infile
'/var/opt/mysql-load/import.csv' into
table ...'
sudo aa-logprof aa-logprof will identify the access violation
triggered by the 'load data infile
...' query, and interactively walk you
through allowing access in the future.
You probably want to choose Glob from
the menu, so that you end up with read
access to '/var/opt/mysql-load/*'.
Once you have selected the right
(glob) pattern, choose Allow from the
menu to finish up. (N.B. Do not
enable the repository when prompted to
do so the first time you run
aa-logprof, unless you really
understand the whole apparmor
process.)
sudo aa-enforce /usr/sbin/mysqld
Try to load your file again. It should work this time.

Share 1 table between 2 different types of databases

The problem that I have is that I want to synchronize one table between two different databases.
Database 1 is on a XP server with MySQL
Database 2 is on a Novell server with Clarion.
Is it possible to share one table users between the two databases?
So when data is put in database 1, the database automatically synchornize with database 2. When this is done the table: user is in both databases the same?
Thanks in advance!
Diederik,
Your question isn't very clear in that we don't know if you have access to the source code or can only operate on a database level.
You didn't mention clearly if you're using Clarion to drive those databases. I'm assuming you are, since you tagged your post with it.
Also, you didn't mention which file format you're using at the Novell server. I'm assuming you are using the TopSpeed file format - here a bit of information about the TopSpeed file format: most programmers think it is the "native" file format for Clarion for Windows. It is not. Clarion for Windows doesn't have such thing as a native file format, but employs a totally driver driven approach. Clarion Professional Developer (a DOS IDE) had a native file format, which was the Clarion .DAT format. Clarion for Windows can use whatever file format that offers a driver or ODBC driver, including the old .DAT.
If you have access to the source code, then it is a pretty straight situation. In Clarion you can easily have different buffers pointing to different tables.
PROGRAM
MAP
END
szConnMySQL CSTRING( 256 )
users_mysql FILE, DRIVER( 'ODBC' ), OWNER( szConnMySQL ), NAME( 'users' )
RECORD RERCORD
id LONG
name STRING( 20 )
END
END
users_tps FILE, DRIVER( 'TopSpeed' ), NAME( 'users' )
RECORD RECORD
name STRING( 20 )
id LONG
END
END
CODE
szConnMySQL = 'Driver={{MySQL ODBC 3.51 Driver};' & |
'Server=myServerAddress;Database=myDataBase;User=myUsername;' & |
'Password=myPassword;Option=3;'
OPEN( users_mysql, 42h )
OPEN( users_tps, 42h )
users_mysql.id = 1
users_mysql.name = 'GUSTAVO PINSARD'
ADD( users_mysql )
IF NOT ERRORCODE()
users_tps.RECORD :=: users_mysql.RECORD
ADD( users_tps )
ELSE
! Do your thing
END
CLOSE( users_mysql )
CLOSE( users_tps )
If you don't have access to source code, then you'll have to write a MySQL stored procedure to update the remote file. The problem is that the remote file, being a TopSpeed file, would bot be directly accessible from the MySQL server, since it, MySQL, doesn't know anything about it.
One solution to overcome this is by using the TopSpeed ODBC driver at the MySQL server, and having the MySQL SP access the ODBC driver. I consider the TopSpeed ODBC driver a must have, because it allows for a strategy to escape such situations, and promote a better integration.
Details on the MySQL SP are out of the scope of this post, also because I don't know MySQL SPs to that level.
Regards

How do I customise the CREATE DATABASE statement in VSTS DB Edition Deploy?

I'm using VSTS Database Edition GDR Version 9.1.31024.02
I've got a project where we will be creating multiple databases with identical schema, on the fly, as customers are added to the system. It's one DB per customer. I thought I should be able to use the deploy script to do this. Unfortunately I always get the full filenames specified on the CREATE DATABASE statement. For example:
CREATE DATABASE [$(DatabaseName)]
ON
PRIMARY(NAME = [targetDBName], FILENAME = N'$(DefaultDataPath)targetDBName.mdf')
LOG ON (NAME = [targetDBName_log], FILENAME = N'$(DefaultDataPath)targetDBName_log.ldf')
GO
I'd expected something more like this
CREATE DATABASE [$(DatabaseName)]
ON
PRIMARY(NAME = [targetDBName], FILENAME = N'$(DefaultDataPath)$(DatabaseName).mdf')
LOG ON (NAME = [targetDBName_log], FILENAME = N'$(DefaultDataPath)$(DatabaseName)_log.ldf')
GO
Or even
CREATE DATABASE [$(DatabaseName)]
I'm not going to be running this on an on-going basis so I'd like to make it as simple as possible, for the next guy. There are a bunch of options for deployment in the project properties, but I can't get this to work the way I'd like.
Any one know how to set this up?
Better late than never, I know how to get the $(DefaultDataPath)$(DatabaseName) file names from your second example.
The SQL you're showing in your first code snippet suggests that you don't have scripts for creating the database files in your VSTS:DB project, perhaps by deliberately excluded them from any schema comparisons you've done. I found it a little counter-intuitive, but the solution is to let VSTS:DB script the MDF and LDF in you development environment, then edit those scripts to use the SQLCMD variables.
In your database project, go to the folder Schema Objects > Database Level Objects > Storage > Files. In there, add these two files:
Database.sqlfile.sql
ALTER DATABASE [$(DatabaseName)]
ADD FILE (NAME = [$(DatabaseName)],
FILENAME = '$(DefaultDataPath)$(DatabaseName).mdf',
SIZE = 2304 KB, MAXSIZE = UNLIMITED, FILEGROWTH = 1024 KB)
TO FILEGROUP [PRIMARY];
Database_log.sqlfile.sql
ALTER DATABASE [$(DatabaseName)]
ADD LOG FILE (NAME = [$(DatabaseName)_log],
FILENAME = '$(DefaultDataPath)$(DatabaseName)_log.ldf',
SIZE = 1024 KB, MAXSIZE = 2097152 MB, FILEGROWTH = 10 %);
The full database creation script that VSTS:DB, or for that matter VSDBCMD.exe, generates will now use the SQLCMD variables for naming the MDF and LDF files, allowing you to specify them on the command line, or in MSBuild.
We do this using a template database, that we back up, copy, and restore as new customers are brought online. We don't do any of the schema creation with scripts but with a live, empty DB.
Hmm, well it seems that the best answer so far (given the over whelming response) is to edit the file after the fact... Still looking