Strategy to handle large datasets in a heavily inserted into table - mysql

I have a web application that has a MySql database with a device_status table that looks something like this...
deviceid | ... various status cols ... | created
This table gets inserted into many times a day (2000+ per device per day (estimated to have 100+ devices by the end of the year))
Basically this table gets a record when just about anything happens on the device.
My question is how should I deal with a table that is going to grow very large very quickly?
Should I just relax and hope the database will be fine in a few months when this table has over 10 million rows? and then in a year when it has 100 million rows? This is the simplest, but seems like a table that large would have terrible performance.
Should I just archive older data after some time period (a month, a week) and then make the web app query the live table for recent reports and query both the live and archive table for reports covering a larger time span.
Should I have an hourly and/or daily aggregate table that sums up the various statuses for a device? If I do this, what's the best way to trigger the aggregation? Cron? DB Trigger? Also I would probably still need to archive.
There must be a more elegant solution to handling this type of data.

I had a similar issue in tracking the number of views seen for advertisers on my site. Initially I was inserting a new row for each view, and as you predict here, that quickly led to the table growing unreasonably large (to the point that it was indeed causing performance issues which ultimately led to my hosting company shutting down the site for a few hours until I had addressed the issue).
The solution I went with is similar to your #3 solution. Instead of inserting a new record when a new view occurs, I update the existing record for the timeframe in question. In my case, I went with daily records for each ad. what timeframe to use for your app would depend entirely on the specifics of your data and your needs.
Unless you need to specifically track each occurrence over the last hour, you might be over-doing it to even store them and aggregate later. Instead of bothering with the cron job to perform regular aggregation, you could simply check for an entry with matching specs. If you find one, then you update a count field of the matching row instead of inserting a new row.

Related

coldfusion/jquery like dislike functionality mysql DB size optimization concern

We are only a few days away from launching our new website, hence we started putting the finishing touches on the development process. Although everything works great due to our effort to optimize every part of the application, my partner has raised a question about mysql database size and possible issue we could end up with over time. More specifically, we built like dislike functionality (cfc, jquery) which though works flawlessly will significantly increase the DB size if we manage to attract a lot of visitors.
here is our logic:
- every user can vote only once on one article (vote up or vote down) as we store IP in the DB
- say 10 000 users come to the website and vote on 10 posts, that's 100 000 inserts (via stored procedure) and 100 000 rows in our database. Count that by 10,100 or 1000, you get the picture.
votes table has 4 colums
- typeID (voteUp = 1 and voteDown = 2)
- articleID
- IP
- vCount (we use it to count SUM, how many votes each article has)
are we missing the point here? in your experience what's the best approach to handle this type of functionality?
I'd say there's nothing wrong with your approach. Assuming that you have a not too limited data storage capacity, you should not run out of space for a long time.
You could of course use just one record per article, but this might create a bottleneck when the records needs to be locked for updating every time a user votes.
What you might consider is adding a date/time field to your votes table to store when a vote was recorded. By creating an additional table with one row per article to keep track of the over-all votes, you could query all votes that are older than let's say 12 months, update your new table accordingly, and delete the old votes from the votes table. Stuff that functionality into a scheduled task and you're done. That way you will lose the IP information (after 12 months or whatever timespan you chose) but gain back some storage.

Whats the best way to create summary records from detail records with MySQL?

I have maybe 10 to 20 million detail records coming in a day (statistical and performance data), that must be read in, and summarized into 24 hourly and 1 daily summary records.
The process calculates averages on several fields, gets the max and min values of others, nothing significant CPU wise.
Is it better to:
A) summarize the detail records into the summary records while the records are coming in, delaying each detail record insert slightly? I assume there will be a lot of locking (select for update's etc) in the summary tables, as there are several different systems importing data.
B) wait until the hour is over, and then select the entire previous hours data and create the summary records? There would be a delay for users to see the statistics, however the detail records would be available during the time.
Perhaps there are alternative methods to this?
Just make view for summary tables. Your all insert will work as usual. Just make views according to your need as summary. That will update automatically with main tables.
Also you can make the 24 hourly and 1 daily summary basis. Views are stored queries that when invoked produce a result set. A view acts as a virtual table.
For more details about views refer : http://dev.mysql.com/doc/refman/5.0/en/create-view.html
Let me know if you want further assistance regarding mysql views.
It'd depend on the load required to run the single update, but I'd probably go with a separate summary run. I'd probably put a small bet on saying that a single update would take a shorter amount of time than the cumulative on-every-insert idea.

Right design for MySQL database

I want to build a MySQL database for storing the ranking of a game every 1h.
Since this database will become quite large in a short time, I figured it's important to have a proper design. Therefor some advice would be gratefully appreciated.
In order to keep it as small as possible, I decided to log only the first 1500 positions of the ranking. Every ranking of a player holds the following values:
ranking position, playername, location, coordinates, alliance, race, level1, level2, points1, points2, points3, points4, points5, points6, date/time
My approach was to simply grab all values of each top 1500 player every hour by a php script and insert them into the MySQL as one row. So every day the MySQL will grow 36,000 rows. I will have a second script that deletes every row that is older than 28 days, otherwise the database would get insanely huge. Both scripts will run as a cronjob.
The following queries will be performed on this data:
The most important one is simply the query for a certain name. It should return all stats for the player for every hour as an array.
The second is a query in which all players have to be returned that didn't gain points1 during a certain time period from the latest entry. This should return a list of players that didn't gain points (for the last 24h for example).
The third is a query in which all players should be listed that lost a certain amount or more points2 in a certain time period from the latest entry.
The queries shouldn't take a lifetime, so I thought I should probably index playernames, points1 and points2.
Is my approach to this acceptable or will I run into a performance/handling disaster? Is there maybe a better way of doing this?
Here is where you risk a performance problem:
Your indexes will speed up your reads, but will considerably slow down your writes. Especially since your DB will have over 1 million rows in that one table at any given time. Since your writes are happening via cron, you should be okay as long as you insert your 1500 rows in batches rather than one round trip to the DB for every row. I'd also look into query compiling so that you save that overhead as well.
Ranhiru Cooray is correct, you should only store data like the player name once in the DB. Create a players table and use the primary key to reference the player in your ranking table. The same will go for location, alliance and race. I'm guessing that those are more or less enumerated values that you can store in another table to normalize your design and be returned in your results with appropriates JOINs. Normalizing your data will reduce the amount of redundant information in your database which will decrease it's size and increase it's performance.
Your design may also be flawed in your ranking position. Can that not be calculated by the DB when you select your rows? If not, can it be done by PHP? It's the same as with invoice tables, you never store the invoice total because it is redundant. The items/pricing/etc can be used to calculate the order totals.
With all the adding/deleting, I'd be sure to run OPTIMIZE frequently and keep good backups. MySQL tables---if using MyISAM---can become corrupted easily in high writing/deleting scenarios. InnoDB tends to fair a little better in those situations.
Those are some things to think about. Hope it helps.

What is a more efficient way to keep a daily ranking log for each user with MySQL?

I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???

MySQL speed optimization on a table with many rows : what is the best way to handle it?

I'm developping a chat application. I want to keep everything logged into a table (i.e. "who said what and when").
I hope that in a near future I'll have thousands of rows.
I was wondering : what is the best way to optimize the table, knowing that I'll do often rows insertion and sometimes group reading (i.e. showing an entire conversation from a user (look when he/she logged in/started to chat then look when he/she quit then show the entire conversation)).
This table should be able to handle (I hope though !) many many rows. (15000 / day => 4,5 M each month => 54 M of rows at the end of the year).
The conversations older than 15 days could be historized (but I don't know how I should do to do it right).
Any idea ?
I have two advices for you:
If you are expecting lots of writes
with little low priority reads. Then you
are better off with as little
indexes as possible. Indexes will
make insert slower. Only add what you really need.
If the log table
is going to get bigger and bigger
overtime you should consider log
rotation. Otherwise you might end up
with one gigantic corrupted table.
54 million rows is not that many, especially over a year.
If you are going to be rotating out lots of data periodically, I would recommend using MyISAM and MERGE tables. Since you won't be deleting or editing records, you won't have any locking issues as long as concurrency is set to 1. Inserts will then always be added to the end of the table, so SELECTs and INSERTs can happen simultaneously. So you don't have to use InnoDB based tables (which can use MERGE tables).
You could have 1 table per month, named something like data200905, data200904, etc. Your merge table would them include all the underlying tables you need to search on. Inserts are done on the merge table, so you don't have to worry about changing names. When it's time to rotate out data and create a new table, just redeclare the MERGE table.
You could even create multiple MERGE tables, based on quarter, years, etc. One table can be used in multiple MERGE tables.
I've done this setup on databases that added 30 million records per month.
Mysql does surprisingly well handling very large data sets with little more than standard database tuning and indexes. I ran a site that had millions of rows in a database and was able to run it just fine on mysql.
Mysql does have an "archive" table engine option for handling many rows, but the lack of index support will make it not a great option for you, except perhaps for historical data.
Index creation will be required, but you do have to balance them and not just create them because you can. They will allow for faster queries (and will required for usable queries on a table that large), but the more indexes you have, the more cost there will be inserting.
If you are just querying on your "user" id column, an index on there will not be a problem, but if you are looking to do full text queries on the messages, you may want to consider only indexing the user column in mysql and using something like sphynx or lucene for the full text searches, as full text searches in mysql are not the fastest and significantly slow down insert time.
You could handle this with two tables - one for the current chat history and one archive table. At the end of a period ( week, month or day depending on your traffic) you can archive current chat messages, remove them from the small table and add them to the archive.
This way your application is going to handle well the most common case - query the current chat status and this is going to be really fast.
For queries like "what did x say last month" you will query the archive table and it is going to take a little longer, but this is OK since there won't be that much of this queries and if someone does search like this he would be willing to wait a couple of seconds more.
Depending on your use cases you could extend this principle - if there will be a lot of queries for chat messages during last 6 months - store them in separate table too.
Similar principle (for completely different area) is used by the .NET garbage collector which has different storage for short lived objects, long lived objects, large objects, etc.