Loading large data sets into MySQL tables - mysql

I'd like to start tinkering with large government data sets—in particular, I want to work with campaign contribution records and lobbying disclosure records. The Sunlight Foundation and the Center for Responsive Politics offer cleaned-up versions of these data sets for download.
I want to load these data sets into MySQL tables, since MySQL is the database management system I'm most familiar with.
I have two questions:
What is the best method for loading these large CSV files into MySQL tables?
Is there a better method for loading these data sets into a database and running queries? Should I consider a different database platform? I'm open to alternatives, but I don't know where to start. For now, I'm only going to perform local queries, but at some point I'd like to build a public web app.

To answer your first question, try MySQL's LOAD DATA INFILE command. It is normally quite fast for this type of data load.

To answer your second question, properly indexed MySQL tables have no problem with 10s of millions of rows. Especially if you are only performing reads after your import.

Related

Store in memory or in local database

I'm developing an app in which I'll need to collect, from a MySQL server, a 5 years daily data (so, approximately 1825 rows of a table with about 6, 7 columns).
So, for handling this data, I can, after retrieving it, store it in a local SQLite database, or just keep it in memory.
I admit that, so far, the only advantage I could find for storing it in a local database, instead of just using what's already loaded, would be to have the data accessible in a next time the user were to open the app.
But I think I might not be taking into account all important factors.
Which factors should I take into account to decide between storing data in a local database or keep it in memory?
Best regards,
Nicolas Reichert
With respect, you're overthinking this. You're talking about a small amount of data: 2K rows is nothing for a MySQL server.
Therefore, I suggest you keep your app simple. When you need those rows in your app fetch them from MySQL. If you run the app again tomorrow, run the query again and fetch them again.
Are the rows the result of some complex query? To keep things simple you might consider creating a VIEW from the query. On the other hand, you can just as easily keep the query in your app.
Are the rows the result of a time-consuming query? In that case you could create a table in MySQL to hold your historical data. That way you'd only have to do the time-consuming query on your newer data.
At any rate, adding some alternative storage tech to your app (be it RAM or be it a local sqlite instance) isn't worth the trouble IMHO. Keep It Simple™.
If you're going to store the data locally, you have to figure out how to make it persistent. sqlite does that. It's not clear to me how RAM would do that unless you dump it to the file system.

Solutions for transition from small scale to mid-scale MySQL database

I'm studying up on the future of the database I maintain. Right now we have one database server running MySQL using InnoDB and MyISAM tables. I'm watching the metrics closely and I can see that this will not be sustainable forever. Where does one go next? I have reviewed solutions like Cassandra, but I want to stick to an SQL approach so I'm not sure about that. I have also reviewed NDB cluster and federated database solutions, but I've noticed no one has anything good to say about those. Basically, I looking for advice on intermediate solutions. We do not yet need a vast multi-node array operating on tens of DB servers, but one server is about to reach its limit. I don't want to just throw another server on the pile without making sure that the DB architecture at hand benefits well from the extra power. What do you guys suggest for when it is time to move beyond a single server and how to manage this transition. Thank you to anyone who can help.
Edit to better explain: At present, we have about a hundred tables. We run many join operations to gather the data the end user needs to see, such that most of our queries join at least two tables to complete any operation. The data set is not too big yet, only a few hundred Megs, but the data is accessed in such a way that each table has a few writes everyday, the heaviest of which has about a thousand writes a day. We probably have about a few hundred thousand reads a day too, so read do outnumber writes about 9 to 1.
First Solutions:
Indices go a LONG way
Use profiling software to find your slow queries and optimize them
Depending on your hosting company you can usually update the RAM/CPU of the server
Second Solutions:
Split your reads and your writes into two databases. (I don't know if you're using PHP or not but PHP has a plugin that will automatically split them for you without having to change any of your code http://php.net/manual/en/mysqlnd-ms.rwsplit.php)
Use software like memcache to store database information that is frequently queried but not frequently updated

Solr vs. MySQL performance for autocomplete

In one of our applications, we need to hold some plain tabular data and we need to be able to perform user-side autocompletion on one of the columns.
The initial solution we came up with, was to couple MySQL with Solr to achieve this (MySQL to hold data and Solr to just hold the tokenized column and return ids as result). But something unpleasant happened recently (developers started storing some of the data in Solr, because the MySQL table and the operations done on it are nothing that Solr can not provide) and we thought maybe we could merge them together and eliminate one of the two.
So we had to either: (1) move all the data to Solr (2) use MySQL for autocompletion
(1) sounded terrible so I gave it a shot with (2), I started with loading that single column's data into MySQL, disabled all caches on both MySQL and Solr, wrote a tiny webapp that is able to perform very similar queries [1] on both databases, and fired up a few JMeter scenarios against both in a local and similar environment. The results show a 2.5-3.5x advantage for Solr, however, I think the results may be totally wrong and fault prone.
So, what would you suggest for:
Correctly benchmarking these two systems, I believe you need to
provide similar[to MySQL] environment to JVM.
Designing this system.
Thanks for any leads.
[1] SELECT column FROM table WHERE column LIKE 'USER-INPUT%' on MySQL and column:"USER-INPUT" on Solr.
I recently moved a website over from getting its data from the database (postgres) to getting all data from Solr. Unbelievable difference in speed. We also have autocomplete for Australian suburbs (about 15K of them) and it finds them in a couple of milliseconds, so the ajax auto-complete (we used jQuery) reacts almost instantly.
All updates are done against the original database, but our site is a mostly-read site. We used triggers to fire events when records were updated and that spawns a reindex into Solr of the record.
The other big speed improvement was pre-caching data required to render the items - ie we denormalize data and pre-calculate lots of stuff at Solr indexing time so the rendering is easy for the web guys and super fast.
Another advantage is that we can put our site into read-only mode if the database needs to be taken offline for some reason - we just fall back to Solr. At least the site doesn't go fully down.
I would recommend using Solr as much as possible, for both speed and scalability.

sqlite or mysql for large datasets

I am working with large datasets (10s of millions of records, at times, 100s of millions), and want to use a database program that links well with R. I am trying to decide between mysql and sqlite. The data is static, but there are lot of queries that I need to do.
In this link to sqlite help, it states that:
"With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (241 bytes). And even if it could handle larger databases, SQLite stores the entire database in a single disk file and many filesystems limit the maximum size of files to something less than this. So if you are contemplating databases of this magnitude, you would do well to consider using a client/server database engine that spreads its content across multiple disk files, and perhaps across multiple volumes."
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests. I'm wondering if mysql is a better choice for me than sqlite due to the size of my dataset. The description above seems to suggest that this might be the case, but my data is no where near 2TB.
I'd appreciate any insights into understanding this constraint of maximum file size from the filesystem and how this could affect speed for indexing tables and running queries. This could really help me in my decision of which database to use for my analysis.
The SQLite database engine stores the entire database into a single file. This may not be very efficient for incredibly large files (SQLite's limit is 2TB, as you've found in the help). In addition, SQLite is limited to one user at a time. If your application is web based or might end up being multi-threaded (like an AsyncTask on Android), mysql is probably the way to go.
Personally, since you've done tests and mysql is faster, I'd just go with mysql. It will be more scalable going into the future and will allow you to do more.
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests.
The short short version is:
If your app needs to fit on a phone or some other embedded system, use SQLite. That's what it was designed for.
If your app might ever need more than one concurrent connection, do not use SQLite. Use PostgreSQL, MySQL with InnoDB, etc.
It seems that (in R, at least), that SQLite is awesome for ad hoc analysis. With the RSQLite or sqldf packages it is really easy to load data and get started. But for data that you'll use over and over again, it seems to me that MySQL (or SQL Server) is the way to go because it offers a lot more features in terms of modifying your database (e.g., adding or changing keys).
SQL if you are mainly using this as a web service.
SQLite, if you want it to able to function offline.
SQLite generally is much much faster, as majority (or ALL) of data/indexes will be cached in memory. However, in the case of SQLite. If the data is split up across multiple tables, or even multiple SQLite database files, from my experience so far. For even millions of records (i yet to have 100's of millions though), it is far more effective then SQL (compensate the latency / etc). However that is when the records are split apart in differant tables, and queries are specific to such tables (dun query all tables).
An example would be a item database used in a simple game. While this may not sound much, a UID would be issued for even variations. So the generator soon quickly work out to more then a million set of 'stats' with variations. However this was mainly due to each 1000 sets of records being split among different tables. (as we mainly pull records via its UID). Though the performance of splitting was not properly measured. We were getting queries that were easily 10 times faster then SQL (Mainly due to network latency).
Amusingly though, we ended up reducing the database to a few 1000 entries, having item [pre-fix] / [suf-fix] determine the variations. (Like diablo, only that it was hidden). Which proved to be much faster at the end of the day.
On a side note though, my case was mainly due to the queries being lined up one after another (waiting for the one before it). If however, you are able to do multiple connections / queries to the server at the same time. The performance drop in SQL, is more then compensated, from your client side. Assuming this queries do not branch / interact with one another (eg. if got result query this, else that)

MongoDB as a cache for frequent joins and queries from MySQL

Im thinking about doing the following and need suggestions if it makes sense to approach it this way. Basically since I am able to do queries in MongoDB and MongoDb is wicked fast at these since the hotspots of the data are cached in memory. I was thinking of storing data I would normally do a join from in mysql in mongoDB. While I am using memcached to store simple query results (for example a movie description page), for bigger stuff that requires more realtime/ondemand queries I was thinking about storing this in MongoDB. For example the view count for movies and who saw it, and doing analysis on it.
Hopefully I explained it clearly.
more info:
We dont want to keep writing to our mysql server on every rating like etc, MongoDB seemed like a good option to store the ratings,views of movies etc and then later on be able to do processing on that data. Whereas with Memcached data is not persisted and were unable to do queries
Thanks,
Faisal
Memory caching alone is not a good reason to go with MongoDB. Any properly configured RDBMS will cache frequently used data in memory.
What aspect of MySQL is currently limiting your performance? Do you have enough RAM in your server? Are your disks fast enough? Do you have a low latency cache device like an SSD configured appropriately?
There's nothing wrong with using both solutions in your application. As a matter of fact, I'm using mysql to store user sessions as suppose to cookies. In addition, I have another project that utilizes mysql but for certain parts of my application, I will be using MongoDB. Why? It's wicked fast and I hate writing join queries. It's so much easier to pop data in/out of mongo as suppose to having to do join queries in mysql.
i.e.
When saving tags for a particular user, it's so gosh darn eary to save/modify/delete the tag that's stored in mongodb. With MySQL, I would've had to write a query that JOINs multiple tables. For data such as user account, password, city, state - I saved everything into MySQL.
What your talking about is the idea of having normalized and de-normalized data. Using MongoDB as a denormalized data store for your normalized sql data is fine. Using Mongodb as your only data store for certain kinds of data is also fine. Just make sure that its clear in the system design as to where the true data is, and where the denormalized data is.
Normalized data is true fact. Denormalized data is gossip - you aren't sure if its up to date or not.