Porting from MySql to Redis - mysql

This is probably more of a two part question, but the main focus is really how to port data from MySql into Redis. I've read over [the process here][1] but its a bit over my head in terms of how that would work for my multi-colum set of data. The end goal is that I would like to move over my activity feed from SQL to Redis. From what I can tell, the best move is to store this data in a hash and then sort it using sorted sets. The most common sort to start would just be created_at.
So the question is:
1) What is a fast way to port this over to SQL as via command line?
2) How could this be structured to sort via created_at?

Warning: moving to Redis is not necessarily a good idea for you. Most of the time moving an entire MySQL database to Redis is a bad idea because MySQL let's you query much more loosely and makes it much easier to make complicated queries than Redis. I'll tell you how to move things over, but there's a good chance you only want to do this for some of your data (or maybe even not at all) for functionality that requires speed above all else.
So, assuming you need to make the switch while data is coming in live, you'll need to do a few things:
You'll want to start sending data to both MySQL and Redis and mark the rows that have been sent to both in some way (probably your updated_at column works fine for this-- just mark the time when you first started sending data to both).
Move all older data over to Redis. You can do this by writing a quick script in a language like Python that will just grab all data with updated_at <= redis_start_date and insert it into Redis. Now your Redis database should mirror your MySQL database exactly, and you can expect it to continue mirroring it into the future.
Make sure all appropriate APIs that used to rely on MySQL have new versions that work with Redis.
Test everything heavily and make sure the data in Redis matches what's in MySQL and is in the format you want, also make sure all appropriate APIs work with it, etc...
Once you're confident all these are set, just switch your live APIs to be the ones that use Redis and not MySQL. You can stagger this process switching them one at a time and verifying all is as it should be. Since data is being written to both MySQL and Redis, you can rollback at anytime in case something went wrong.
Once you've switched over everything and are confident it all works with Redis the way you want, turn off the code that's writing to MySQL and enjoy your fast new Redis database.
In terms of structuring this, it really depends on how you use it. If you just need to be able to look up individual items based on id and have them sorted based on created_at, you can use a sorted set of (key = id; score=unix_time(created_at)) along with hashes of (id->column) for every other column. Again, this isn't something you necessarily want to do and it depends on your use case.

Related

Store a VarChar or JSON in mySQL

I am at a point in which while implementing a simple database schema I am attemtping to think ahead, and do things that would not hinder performance should te database grow larger.
The main object that will be manipulated within the application that will be using the database will be a JSON String. Now I am fully aware of the ability to store a JSON formatted string (JSONfs) within SQL as just a VARCHAR, but I recently learned that they have adopted the JSONfs type into their provided base types.
Now comes the question, the JSONfs will be manipulated, possibly heavily so, and I am unsure if this would cause issues while using the JSONfs type in MySQL. I am not sure if updates to the JSONfs that is populated and placed into the DB could be altered within simple mySQL commands, and honestly I would rather leave that task to the initial JSON parser/formatter due to the possible overcomplexity of the JSONfs that will be used.
Questions:
(1) Is the above statement of added complexity with trying to modify a string, specifically the JSONfs, within SQL more of an overhead causing situation than running an UPDATE, and just altering the data within that Column?
(2) Does the UPDATE process being called multiple times on a DB cause some type of bottleneck after a while if say, 25K UPDATES are done within a short amount of time? I hope the DB's can handle much more than those requests, but just a rough estimate of the possible read/writes to the system.
(3) Am i completely wrong in the manner that I believe mySQL to handle it's internal objects/DB Entries? If so could some enlightenment on this procedure be shed, as I am still in the process of creating the ER Model, and simple is better for now than overcomplicated, but long term is also something that I try and keep my head around.

storing telemetry data from 10000s of nodes

I need to store telemetry data that is being generated every few minutes from over 10000 nodes (which may increase), each supplying the data over the internet to a server for logging. I'll also need to query this data from a web application.
I'm having a bit of trouble deciding what the best storage solution would be..
Each node has a unique ID, and there will be a timestamp for each packet of variables. (probably will need to be generated by the server).
The telemetry data has all of the variables in the same packet, so conceptually it could easily be stored in a single database table with a column per variable. The serial number + timestamp would suffice as a key.
The size of each telemetry packet is 64 bytes, including the device ID and timestamp. So around 100Gb+ per year.
I'd want to be able to query the data to get variables across time ranges and also store aggregate reports of this data so that I can draw graphs.
Now, how best to handle this? I'm pretty familiar with using MySQL, so I'm leaning towards this. If I were to go for MySQL would it make sense to have a separate table for each device ID? - Would this make queries much faster or would having 10000s of tables be a problem?
I don't think querying the variables from all devices in one go is going to be needed but it might be. Or should I just stick it all in a single table and use MySQL cluster if it gets really big?
Or is there a better solution? I've been looking around at some non relational databases but can't see anything that perfectly fits the bill or looks very mature. MongoDB for example would have quite a lot of size overhead per row and I don't know how efficient it would be at querying the value of a single variable across a large time range compared to MySQL. Also MySQL has been around for a while and is robust.
I'd also like it to be easy to replicate the data and back it up.
Any ideas or if anyone has done anything similar you input would be greatly appreciated!
Have you looked at time-series databases? They're designed for the use case you're describing and may actually end up being more efficient in terms of space requirements due to built-in data folding and compression.
I would recommend looking into implementations using HBase or Cassandra for raw storage as it gives you proven asynchronous replication capabilities and throughput.
HBase time-series databases:
OpenTSDB
KairosDB
Axibase Time-Series Database - my affiliation
If you want to go with MySQL, keep in mind that although it will keep on going when you throw something like a 100GB per year at it easily on modern hardware, do be advised that you cannot execute schema changes afterwards (on a live system). This means you'll have to have a good, complete database schema to begin with.
I don't know if this telemetry data might grow more features, but if they do, you don't want to have to lock your database for hours if you need to add a column or index.
However, some tools such as http://www.percona.com/doc/percona-toolkit/pt-online-schema-change.html are available nowadays which make such changes somewhat easier. No performance problems to be expected here, as long as you stay with InnoDB.
Another option might be to go with PostgreSQL, which allows you to change schemas online, and sometimes is somewhat smarter about the use of indexes. (For example, http://kb.askmonty.org/en/index-condition-pushdown is a new trick for MySQL/MariaDB, and allows you to combine two indices at query time. PostgreSQL has been doing this for a long time.)
Regarding overhead: you will be storing your 64 bytes of telemetry data in an unpacked form, probably, so your records will take more than 64 bytes on disk. Any kind of structured storage will suffer from this.
If you go with an SQL solution, backups are easy: just dump the data and you can restore it afterwards.

Document-oriented dbms as primary db and a RDBMS db as secondary db?

I'm having some performance issues with MySQL database due to it's normalization.
Most of my applications that uses a database needs to do some heavy nested queries, which in my case takes a lot of time. Queries can take up 2 seconds to run, with indexes. Without indexes about 45 seconds.
A solution I came a cross a few month back was to use a faster more linear document based database, in my case Solr, as a primary database. As soon as something was changed in the MySQL database, Solr was notified.
This worked really great. All queries using the Solr database only took about 3ms.
The numbers looks good, but I'm having some problems.
Huge database
The MySQL database is about 200mb, the Solr db contains about 1.4Gb of data.
Each time I need to change a table/column the database need to be reindexed, which in this example took over 12 hours.
Difficult to render both a Solr object and a Active Record (MySQL) object without getting wet.
The view is relying on a certain object. It doesn't care if the object it self is an Active Record object or an Solr object, as long as it can call a set of attributes on the it.
Like this.
# Controller
#song = Song.first
# View
#song.artist.urls.first.service.name
The problem in my case is that the data being returned from Solr is flat like this.
{
id: 123,
song: "Waterloo",
artist: "ABBA",
service_name: "Groveshark",
urls: ["url1", "url2", "url3"]
}
This forces me to build an active record object that can be passed to the view.
My question
Is there a better way to solve the problem?
Some kind of super duper fast primary read only database that can handle complex queries fast would be nice.
Solr individual fields update
About reindexing all on schema change: Solr does not support updating individual fields yet, but there is a JIRA issue about this that's still unresolved. However, how many times do you change schema?
MongoDB
If you can live without a RDBMS (without joins, schema, transactions, foreign key constrains), a document-based DB like MongoDB,
or CouchDB would be a perfect fit. (here is a good comparison between them )
Why use MongoBD:
data is in native format (you can use an ORM mapper like Mongoid directly in the views, so you don't need to adapt your records as you do with Solr)
dynamic queries
very good performance on non-full text search queries
schema-less (no need for migrations)
build-in, easy to setup replication
Why use SOLR:
advanced, very performant full-text search
Why use MySQL
joins, constrains, transactions
Solutions
So, the solutions (combinations) would be:
Use MongoDB + Solr
but you would still need to reindex all on schema change
Use only MongoDB
but drop support for advanced full-text search
Use MySQL in a master-slave configuration, and balance reads from slave(s) (using a plugin like octupus) + Solr
setup complexity
Keep current setup, denormalize data in MySQL
messy
Solr reindexing slowness
The MySQL database is about 200mb, the Solr db contains about 1.4Gb of
data. Each time I need to change a table/column the database need to
be reindexed, which in this example took over 12 hours.
Reindexing 200MB DB in Solr SHOULD NOT take 12 hours! Most probably you have also other issues like:
MySQL:
n+1 issue
indexes
SOLR:
commit after each request - this is the default setup is you use a plugin like sunspot, but it's a perf killer for production
From http://outoftime.github.com/pivotal-sunspot-presentation.html:
By default, Sunspot::Rails commits at the end of every request
that updates the Solr index. Turn that off.
Use Solr's autoCommit
functionality. That's configured in solr/conf/solrconfig.xml
Be
glad for assumed inconsistency. Don't use search where results need to
be up-to-the-second.
other setup issues (http://wiki.apache.org/solr/SolrPerformanceFactors#Indexing_Performance)
Look at the logs for more details
Instead of pushing your data into Solr to flatten the records, why don't you just create a separate table in your MySQL database that is optimized for read only access.
Also you seem to contradict yourself
The view is relying on a certain object. It doesn't care if the object it self is an Active Record object or an Solr object, as long as it can call a set of attributes on the it.
The problem in my case is that the data being returned from Solr is flat... This forces me to build a fake active record object that can be rendered by the view.

What is the most efficient way to keep track of all user traffic inside a database

Currently I am using mysql to log all traffic from all users coming into a website that I manage. The database has grown to almost 11m rows in a month, and queries are getting quite slow. Is there a more efficient way to log user information? All we are storing is their request, useragent, and their ip, and associating it with a certain website.
Why not try Google Analytics? Even if you might not think it would be sufficient for you, I bet you it can track 99% of what you want to be tracked.
The answer depends completely on what you expect to retrieve in the query side. Are you looking for aggregate information, are you looking for all of history or only a portion? Often, if you need to look at every row to find out what you need, storing in basic text files is quickest.
What are the kind of queries that you want to run on the data? I assume most of your queries are over data in current or recent time window. I would suggest to use time based partitioning of the table. This will make such queries faster as the queries will hit only the partition having the data, so less disk seeks. Also regularly purge old data and put them in summary tables. Some useful links are:
http://forge.mysql.com/w/images/a/a2/FOSDEM_2009-Giuseppe_Maxia-Partitions_Performance.pdf
http://www.slideshare.net/bluesmoon/scaling-mysql-writes-through-partitioning-3397422
the most efficient way is probably to have apache (assuming thats what the site is running on) simply use its built in logging to text logs, and configure something like AWStats. This removes the need to log this information yourself, and should provide you with the information you are looking for - probably all ready configured in existing reports. The benefit of this over something like Google Analytics would be its server side tracking - etc.
Maybe stating the obvious but have you got a good index in relation to the querys that you are making?
1) Look at using Piwik to perform Google Analytic type tracking, while retaining control of the MySQL data.
2) If you must continue to use your own system, look at using InnoDB Plugin in order to support compressed table types. In addition, convert IP to unsigned integer, convert both useragent and request to unsigned int referencing lookup tables that are compressed using either Innodb compression or the archive engine.
3) Skip partitioning and shard the DB by month.
This is what "Data Warehousing" is for. Consider buying a good book on warehousing.
Collect the raw data in some "current activity" schema.
Periodically, move it into a "warehouse" (or "datamart") star schema that's (a) separate from the current activity schema and (b) optimized for count/sum/group-by queries.
Move, BTW, means insert into warehouse schema and delete from current activity schema.
Separate your ongoing transactional processing from your query/analytical processing.

MySQL table modified timestamp

I have a test server that uses data from a test database. When I'm done testing, it gets moved to the live database.
The problem is, I have other projects that rely on the data now in production, so I have to run a script that grabs the data from the tables I need, deletes the data in the test DB and inserts the data from the live DB.
I have been trying to figure out a way to improve this model. The problem isn't so much in the migration, since the data only gets updated once or twice a week (without any action on my part). The problem is having the migration take place only when it needs to. I would like to have my migration script include a quick check against the live tables and the test tables and, if need be, make the move. If there haven't been updates, the script quits.
This way, I can include the update script in my other scripts and not have to worry if the data is in sync.
I can't use time stamps. For one, I have no control over the tables on the live side once it goes live, and also because it seems a bit silly to bulk up the tables more for conviencience.
I tried doing a "SHOW TABLE STATUS FROM livedb" but because the tables are all InnoDB, there is no "Update Time", plus, it appears that the "Create Time" was this morning, leading me to believe that the database is backed up and re-created daily.
Is there any other property in the table that would show which of the two is newer? A "Newest Row Date" perhaps?
In short: Make the development-live updating first-class in your application. Instead of depending on the database engine to supply you with the necessary information to enable you to make a decision (to update or not to update ... that is the question), just implement it as part of your application. Otherwise, you're trying to fit a round peg into a square hole.
Without knowing what your data model is, and without understanding at all what your synchronization model is, you have a few options:
Match primary keys against live database vs. the test database. When test > live IDs, do an update.
Use timestamps in a table to determine if it needs to be updated
Use the md5 hash of a database table and modification date (UTC) to determine if a table has changed.
Long story short: Database synchronization is very hard. Implement a solution which is specific to your application. There is no "generic" solution which will work ideally.
If you have an autoincrement in your tables, you could compare the maximum autoincrement values to see if they're different.
But which version of mysql are you using?
Rather than rolling your own, you could use a preexisting solution for keeping databases in sync. I've heard good things about SQLYog's SJA (see here). I've never used it myself, but I've been very impressed with their other programs.