I am attempting to analyze some data in RStudio which originates from a MySQL database, so I used dbConnect to connect to said database, and copied the single table I needed for this project. I then used R to clean the data a bit, getting rid of some un-needed columns. So far, so good.
My problems arose when I realized my data had some outliers, and I needed to delete rows which contained obvious outlier data. This is something I have no problem doing in SQL, but lack the R experience to do effectively. So I looked into it, and found out about sqldf, a package which bills itself as a way to use SQL commands to manipulate data.frames. Perfect! But I'm having some trouble with this, as sqldf seems to require a database connection of some kind. Is there a way to simply connect to a data.frame I have in my global environment in RStudio?
Q: Couldn't you just manipulate the data in MySQL before importing it to R?
A: Yes, and that's what I'll do if I have to, but I'd like to understand sqldf better.
Try:
options(sqldf.driver = "SQLite")
sqldf("select * from book;", drv = 'SQLite')
Related
I'm migrating a large(approx. 10GB) MySQL database(InnoDB engine).
I've figured out the migration part. Export -> mysqldump, Import -> mysql.
However, I'm trying to figure out the optimum way to validate if the migrated data is correct. I thought of the following approaches but they don't completely work for me.
One approach could have been using CHECKSUM TABLE. However, I can't use it since the target database would have data continuously written to it(from other sources) even during migration.
Another approach could have been using the combination of MD5(), GROUP_CONCAT, and CONCAT. However, that also won't work for me as some of the columns contain large JSON data.
So, what would be the best way to validate that the migrated data is correct?
Thanks.
How about this?
Do SELECT ... INTO OUTFILE from each old and new table, writing them into .csv files. Then run diff(1) between the files, eyeball the results, and convince yourself that the new tables' rows are an appropriate superset of the old tables'.
These flat files are modest in size compared to a whole database and diff is fast enough to be practical.
We have an R program that filters some data in a table and creates a new table with the results. On Windows and OSX, the program runs and our table is created properly. However, on our Linux (Ubuntu 12.04) server, the same R program produces a table with garbage data.
When we compare the garbage data produced on Linux to the proper data, we find that:
Seemingly arbitrary numbers in columns that should have text values
Extra rows
We think the issue is something with encoding, but all of our efforts to change the encoding of the database have failed so far.
Our R script uses RMySQL to connect with a MySQL Database, filter the contents, and write it to a new table (using the dbReadTable and dbWriteTable commands).
We know that the commands themselves are not the problem, as we are able to examine the data.frame before and after filtering them - the problem is with dbWriteTable.
These two links seem to be closest to the solution to our problem, but we have to wait for the pull request to go through:
https://github.com/jeffreyhorner/RMySQL/issues/6
https://github.com/gagern/RMySQL/commit/b0fbef105ca61d69992a2ec5a5eafde30530b8d5
And these are also relevant:
http://zee.balogh.sk/?p=928
What does character set and collation mean exactly?
From past experience I will suggest that this is not a problem in dbWriteTable;
and is not even an encoding issue!
It is likely that you have stringsAsFactors = T when writing the data.frame,
and those numbers are the factor numbers.
I have a large .sql file, created as a backup from a MySQL database (containing several tables), and I would like to search elements within it from R.
Ideally, there would be a read.sql function that would turn the tables into some R list with data.frames in it. Is there something that comes close? If not, can RSQLite or RMySQL help? (going through the reference manuals, I don't see a simple function for what I described)
No can do, boss. For R to interpret your MySQL database file, it would have to do a large part of what the DBMS itself does. That's a tall order, infeasible in the general case.
Would this return what you seek (which I think upon review you will admit is not yet particularly well described):
require(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect(drv)
dbListTables(con)
# Or
names(dbGetInfo(drv))
If these are just source code than all you would need is readLines. If you are looking for an R-engine that can take SQL code and produce useful results then the sqldf package may provide some help. It parses SQL code embedded in quoted strings and applies it either to dataframe objects in memory or to disk-resident tables (or both). Its default driver for disk files is SQLite but other drivers can be used.
My workaround so far (I am also a newbie with db) is to export the database as .csv file in the phpMyAdmin (need to tick "Export tables as separate files" in the "custom" method). And then use read_csv() on tables I want to work with.
It is not ideal because I would like to export the database and work on it on my computer with R (creating functions that will work when accessing the database that is online) and access the real database later, when I have done all my testing. But from the answers here, it seems the .sql export would not help for that anyway (?) and that I would need to recreate the db locally...
I am trying to transfer bulk data on a constant and continuous based from a SQL Server database to a MYSQL database. I wanted to use SQL Server's SSMS's replication but this apparently is only for SQL Server to Oracle or IBM DB2 connection. Currently we are using SSIS to transform data and push it to a temporary location at the MYSQL database where it is copied over. I would like the fastest way to transfer data and am complication several methods.
I have a new way I plan on transforming the data which I am sure will solve most time issues but I want to make sure we do not run into time problems in the future. I have set up a linked server that uses a MYSQL ODBC driver to talk between SQL Server and MYSQL. This seems VERY slow. I have some code that also uses Microsoft's ODBC driver but is used so little that I cannot gauge the performance. Does anyone know of lightening fast ways to communicate between these two databases? I have been researching MYSQL's data providers that seem to communicate with a OleDB layer. Im not too sure what to believe and which way to steer towards, any ideas?
I used the jdbc-odbc bridge in Java to do just this in the past, but performance through ODBC is not great. I would suggest looking at something like http://jtds.sourceforge.net/ which is a pure Java driver that you can drop into a simple Groovy script like the following:
import groovy.sql.Sql
sql = Sql.newInstance( 'jdbc:jtds:sqlserver://serverName/dbName-CLASS;domain=domainName',
'username', 'password', 'net.sourceforge.jtds.jdbc.Driver' )
sql.eachRow( 'select * from tableName' ) {
println "$it.id -- ${it.firstName} --"
// probably write to mysql connection here or write to file, compress, transfer, load
}
The following performance numbers give you a feel for how it might perform:
http://jtds.sourceforge.net/benchTest.html
You may find some performance advantages to dumping data to a mysql dumpfile format and using mysql loaddata instead of writing row by row. MySQL has some significant performance improvements for large data sets if you load infile's and doing things like atomic table swaps.
We use something like this to quickly load large datafiles into mysql from one system to another e.g. This is the fastest mechanism to load data into mysql. But real time row by row might be a simple loop to do in groovy + some table to keep track of what row had been moved.
mysql> select * from table into outfile 'tablename.dat';
shell> myisamchk --keys-used=0 -rq '/data/mysql/schema_name/tablename'
mysql> load data infile 'tablename.dat' into table tablename;
shell> myisamchk -rq /data/mysql/schema_name/tablename
mysql> flush tables;
mysql> exit;
shell> rm tablename.dat
The best way I have found to transfer SQL data (if you have the space) is a SQL dump in one language and then to use a converting software tool (or perl script, both are prevalent) to convert the SQL dump from MSSQL to MySQL. See my answer to this question about what converter you may be interested in :) .
We've used the ado.net driver for mysql in ssis with quite a bit of success. Basically, install the driver on the machine with integration services installed, restart bids, and it should show up in the driver list when you create an ado.net connection manager.
As for replication, what exactly are you trying to accomplish?
If you are monitoring changes, treat it as a type 1 slowly changing dimension (data warehouse terminology, but same principal applies). Insert new records, update changed records.
If you are only interested in new records and have no plans to update previously loaded data, try an incremental load strategy. Insert records where source.id > max(destination.id).
After you've tested the package, schedule a job in the sql server agent to run the package every x minutes.
Cou can also try the following.
http://kofler.info/english/mssql2mysql/
I tried this a longer time before and it worked for me. But I woudn't recommend it to you.
What is the real problem, what you try to do?
Don´t you get a MSSQL DB Connection, for example from Linux?
I have reached the limit of RAM in analyzing large datasets in R. I think my next step is to import these data into a MySQL database and use the RMySQL package. Largely because I don't know database lingo, I haven't been able to figure out how to get beyond installing MySQL with hours of Googling and RSeeking (I am running MySQL and MySQL Workbench on Mac OSX 10.6, but can also run Ubuntu 10.04).
Is there a good reference on how to get started with this usage? At this point I don't want to do any sort of relational databasing. I just want to import .csv files into a local MySQL database and do the subsetting in with RMySQL.
I appreciate any pointers (including "You're way off base!" as I'm new to R and newer to large datasets... this one's around 80 mb)
The documentation for RMySQL is pretty good - but it does assume that you know the basics of SQL. These are:
creating a database
creating a table
getting data into the table
getting data out of the table
Step 1 is easy: in the MySQL console, simply "create database DBNAME". Or from the command line, use mysqladmin, or there are often MySQL admin GUIs.
Step 2 is a little more difficult, since you have to specify the table fields and their type. This will depend on the contents of your CSV (or other delimited) file. A simple example would look something like:
use DBNAME;
create table mydata(
id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
height FLOAT(3,2)
);
Which says create a table with 2 fields: id, which will be the primary key (so has to be unique) and will autoincrement as new records are added; and height, which here is specified as a float (a numeric type), with 3 digits total and 2 after the decimal point (e.g. 100.27). It's important that you understand data types.
Step 3 - there are various ways to import data to a table. One of the easiest is to use the mysqlimport utility. In the example above, assuming that your data are in a file with the same name as the table (mydata), the first column a tab character and the second the height variable (with no header row), this would work:
mysqlimport -u DBUSERNAME -pDBPASSWORD DBNAME mydata
Step 4 - requires that you know how to run MySQL queries. Again, a simple example:
select * from mydata where height > 50;
Means "fetch all rows (id + height) from the table mydata where height is more than 50".
Once you have mastered those basics, you can move to more complex examples such as creating 2 or more tables and running queries that join data from each.
Then - you can turn to the RMySQL manual. In RMySQL, you set up the database connection, then use SQL query syntax to return rows from the table as a data frame. So it really is important that you get the SQL part - the RMySQL part is easy.
There are heaps of MySQL and SQL tutorials on the web, including the "official" tutorial at the MySQL website. Just Google search "mysql tutorial".
Personally, I don't consider 80 Mb to be a large dataset at all; I'm surprised that this is causing a RAM issue and I'm sure that native R functions can handle it quite easily. But it's good to learn new skill such as SQL, even if you don't need them for this problem.
I have a pretty good suggestion. For 80MB use SQLite. SQLite is a super public domain, lightweight, super fast file-based database that works (almost) just like a SQL database.
http://www.sqlite.org/index.html
You don't have to worry about running any kind of server or permissions, your database handle is just a file.
Also, it stores all data as a string, so you don't even have to worry about storing the data as types (since all you need to do is emulate a single text table anyway).
Someone else mentioned sqldf:
http://code.google.com/p/sqldf/
which does interact with SQLite:
http://code.google.com/p/sqldf/#9._How_do_I_examine_the_layout_that_SQLite_uses_for_a_table?_whi
So your SQL create statement would be like this
create table tablename (
id INT(11) INTEGER PRIMARY KEY,
first_column_name TEXT,
second_column_name TEXT,
third_column_name TEXT
);
Otherwise, neilfws' explanation is a pretty good one.
P.S. I'm also a little surprised that your script is choking on 80mb. It's not possible in R to just seek through the file in chunks without opening it all up in memory?
The sqldf package might give you an easier way to do what you need: http://code.google.com/p/sqldf/. Especially if you are the only person using the database.
Edit: Here is why I think it would be useful in this case (from the website):
With sqldf the user is freed from having to do the following, all of which are automatically done:
database setup
writing the create table statement which defines each table
importing and exporting to and from the database
coercing of the returned columns to the appropriate class in common cases
See also here: Quickly reading very large tables as dataframes in R
I agree with what's been said so far. Though I guess getting started with MySQL (databases) in general is not a bad idea for the long if you are going to deal with data. I mean I checked your profile which says finance PhD student. I don't know if that means quant. finance, but it is likely that you will come across really large datasets in your career. I you can afford some time, I would recommend to learn something about databases. It just helps.
The documentation of MySQL itself is pretty solid and you can a lot of additional (specific) help here at SO.
I run MySQL with MySQL workbench on Mac OS X Snow Leopard too. So here´s what helped me to get it done comparatively easy.
I installed MAMP , which gives my an local Apache webserver with PHP, MySQL and the MySQL tool PHPmyadmin, which can be used as a nice webbased alternative for MySQL workbench (which is not always super stable on a Mac :) . You will have a little widget to start and stop servers and can access some basic configuration settings (such as ports through your browser) . It´s really one-click install here.
Install the Rpackage RMySQL . I will put my connection string here, maybe that helps:
Create your databases with MySQL workbench. INT and VARCHAR (for categorical variables that contain characters) should be the field types you basically need at the beginning.
Try to find the import routine that works best for you. I don't know if you are a shell / terminal guy – if so you'll like what was suggested by neilfws. You could also use LOAD DATA INFILE which is I prefer since it's only one query as opposed to INSERT INTO (line by line)
If you specify the problems that you have more accurately, you'll get some more specific help – so feel free to ask ;)
I assume you have to work a lot with time series data – there is a project (TSMySQL) around that use R and relational databases (such as MySQL, but also available for other DBMS) to store time series data. Besides you can even connect R to FAME (which is popular among financers, but expensive). The last paragraph is certainly nothing basic, but I thought it might help you to consider if it´s worth the hustle to dive into it a little deeper.
Practical Computing for Biologists as a nice (though subject-specific) introduction to SQLite
Chapter 15. Data Organization and Databases