I have a huge amount of data that is generated over the period of an year spanning across many tables. However, keeping all this data in the same tables over the years is making queries slower. While I want to preserve the old data for the purpose of maintaining records and for some end user queries, they become less relevant when a new calendar year starts. Earlier, I had tried having a archive table for each such table where I would store data that was older than 1 year but this approach causes the archive table to grow pretty big in a small time.
Would it be better to have separate tables altogether for every new calendar year ? i.e.
myTable_2011
myTable_2012
myTable_2013
myTable_2014
...
thanks
You might use partitions. Maybe these links can give you an idea:
http://dev.mysql.com/tech-resources/articles/partitioning.html
How to partition a MySQL table based on char column?
http://refcardz.dzone.com/refcardz/database-partitioning
Regards
Related
Suppose I have a site with 10,000 members and I want to collect one particular metric from those users everyday. These metric points would be used to form points on a graph. So as you can figure, the data points are very much related to the day on which they were collected. To keep things relevant, any data older than 30 days would be deleted.
What would be the best way to store this data? I'm currently using MariaDB. I would like to have a table where the first column is for the member id, and the other columns are for each of the days where the data was collected. With each day, a new column would be created on the end of the table to store that day's data, and the first column of the table would be deleted. This seems like the most computationally efficient approach.
This would mean the titles for those columns would be the date. Similar questions have been posed on here before and people are quick to condemn the database design for having dates as column titles. I'm not necessarily disagreeing but I would like to know what a better alternative is.
The other big problem here is that trying to get MariaDB / MySQL to automatically add a column with a unique title every day and delete the first column is quite a bit more complicated than I think it ought to be. I wonder if a different database service such as PostgreSQL would be better suited for these sorts of things.
This question already has answers here:
Many tables or rows, which one is more efficient in SQL?
(3 answers)
Closed 7 years ago.
Every month I get sent a file from a external company which needs to be stored in a database, each file containing up to a million records. The main data fields are Month, Year, Postcode and TransactionType.
I was proposing that we should save the data in our database as a new SQL table each month so we know there is only a finite amount of data in each table. However one of my collegues said he was once told that to create a new table every month is bad practice, but he didn't know why.
If I was to have multiple tables, there would only be a maximum of 60 tables, though there may be far fewer (down to 12) dependent on how far into the past my client needs to look. This means that every month I will need to delete a month's worth of data.
However when I do my SQL queries I will only need a single row of data from a single table per query. I would think in theory this would be more efficient than having a single table filled with millions of rows.
I was wondering if anyone had any definitive reasons as to why splitting the data this way would be a bad thing to do?
All "like" items should be stored together in a database for the following reasons:
You should be able to provide any subset of the items using a single SELECT statement only by changing the WHERE clause of that statement. With separate tables you will have to write code to decompose the request into the parts that compute the table name and the parts that filter that table. And you will have to duplicate that logic in each application, or teach it to each user, that wants to use your database.
You should not artificially limit the use to which your data can be put. If you have separate monthly tables you have already substantially limited the types of queries you can enter against them without having to write more complex UNION queries.
The addition of more instances of a known data type to your database should not require ALTERing the structure of your database and, as a general principal, regularly-run code should not even have ALTER permissions
If proper indexes are maintained, there is very little performance difference when SELECTing data from a table 60 times the size of a smaller table. (There can be more effect on INSERT and UPDATE commands but it sound like you'll be doing a bulk update rather than updating the data constantly).
I can think of only two reasons for sharding data into separate tables:
You discover that you have a performance issue that can't be resolved through better data design.
You have records with different level of security and are relying on GRANT SELECT permissions to allow some users to see the records at higher levels of security.
A simpler method would be to add a column to that table which contains a datetimestamp of when that was loaded into the system. That way you can filter by that perticular column to segregate that data into the months/years that it was loaded in.
Another advantage from a performance perspective, that if you regularly filter data this way, you can create an index based on this date column.
Having multiple tables that contain the same information is not recommended for performance reasons and how information is stored in SQL. Eventually it will take up more space and if one month's data needs to reference another month's data it will be quite slow.
Hope this helps.
If you think it isn't difficult for you to manage your application, you can do it.
Example. Do you need to change SQL queries every month?
If user need more report that need data more than 1 month, What happen?
Using partitioning, DBMS will split your data to multiple table on the physical storage but You can call all of them by the same name. DBMS will analyse with partition it should take. Performance isn't different significantly.
Problem: We have a very big table, and growing. Most of its entries (say 80%) are historical data (with "DATE" field past current date) that are seldom queried, while small part of it (say 20%) are current data ("DATE" field after current date), most queries search these current entries.
Consider two possible scenarios, which one would be better (considering the overall implementation difficulty and performance,...)
Breaking the big table into two table: Historical and Current data. And on daily basis I move the records with expired date from Current table to Historical table.
Keeping record in one table (the DATA field is defined as INDEXED).
The scenario A would indicate more hustle in implementation and maintenance, and overload on daily bases for moving date between tables, while scenario B would indicate searching a big database (though indexed). Does it impose memory problems? Which scenario is recommended? IS there any other recommendations?
You usually don't want to break a big table into multiple tables, although having a current and historical table is totally reasonable. Your process makes sense. You can then optimize the current table for your query needs. I would probably go for two tables (given the limited information you provide), because it allows such optimization.
However, don't split the historical data. Instead, use partitioning. See the documentation. One caveat: queries need to specify the partitioning key in the where clause to take advantage of the partitions. With a large table, this is typical anyway.
Question: is the historical data necessary for system functionality or are these records stored for other purposes (e.g. audits)? It may be time to clean house by moving the historical data to an archive.
In my experience, most systems with big data have historical tables. In most cases that I have been, both the current data and historical data have different user-groups. The current data are used by the front end users to deal with customers with their current or recent transactions. The historical data are usually used by the user groups who do not have to talk with customers/clients directly.
Do not worry much about the issue of implementation and maintenance as I think your main consideration is all about performance. Implementation is only a one-time deal that will run on a specified frequency (like weekly, monthly or yearly archival) after you moved the program/s in production. Maintenance is very small and you can just forget about it once it is already implemented. You just have to make sure that you test the programs thoroughly.
For a normalized historical tables, tables have the same structure and field names which makes the data copy much easier. This way, one can just to a table join between the tables.
If you choose to not split the data, you will continue to add index after index. But somewhere down the road, you will still encounter the same issue again.
I have a web application that has a MySql database with a device_status table that looks something like this...
deviceid | ... various status cols ... | created
This table gets inserted into many times a day (2000+ per device per day (estimated to have 100+ devices by the end of the year))
Basically this table gets a record when just about anything happens on the device.
My question is how should I deal with a table that is going to grow very large very quickly?
Should I just relax and hope the database will be fine in a few months when this table has over 10 million rows? and then in a year when it has 100 million rows? This is the simplest, but seems like a table that large would have terrible performance.
Should I just archive older data after some time period (a month, a week) and then make the web app query the live table for recent reports and query both the live and archive table for reports covering a larger time span.
Should I have an hourly and/or daily aggregate table that sums up the various statuses for a device? If I do this, what's the best way to trigger the aggregation? Cron? DB Trigger? Also I would probably still need to archive.
There must be a more elegant solution to handling this type of data.
I had a similar issue in tracking the number of views seen for advertisers on my site. Initially I was inserting a new row for each view, and as you predict here, that quickly led to the table growing unreasonably large (to the point that it was indeed causing performance issues which ultimately led to my hosting company shutting down the site for a few hours until I had addressed the issue).
The solution I went with is similar to your #3 solution. Instead of inserting a new record when a new view occurs, I update the existing record for the timeframe in question. In my case, I went with daily records for each ad. what timeframe to use for your app would depend entirely on the specifics of your data and your needs.
Unless you need to specifically track each occurrence over the last hour, you might be over-doing it to even store them and aggregate later. Instead of bothering with the cron job to perform regular aggregation, you could simply check for an entry with matching specs. If you find one, then you update a count field of the matching row instead of inserting a new row.
I'm developping a chat application. I want to keep everything logged into a table (i.e. "who said what and when").
I hope that in a near future I'll have thousands of rows.
I was wondering : what is the best way to optimize the table, knowing that I'll do often rows insertion and sometimes group reading (i.e. showing an entire conversation from a user (look when he/she logged in/started to chat then look when he/she quit then show the entire conversation)).
This table should be able to handle (I hope though !) many many rows. (15000 / day => 4,5 M each month => 54 M of rows at the end of the year).
The conversations older than 15 days could be historized (but I don't know how I should do to do it right).
Any idea ?
I have two advices for you:
If you are expecting lots of writes
with little low priority reads. Then you
are better off with as little
indexes as possible. Indexes will
make insert slower. Only add what you really need.
If the log table
is going to get bigger and bigger
overtime you should consider log
rotation. Otherwise you might end up
with one gigantic corrupted table.
54 million rows is not that many, especially over a year.
If you are going to be rotating out lots of data periodically, I would recommend using MyISAM and MERGE tables. Since you won't be deleting or editing records, you won't have any locking issues as long as concurrency is set to 1. Inserts will then always be added to the end of the table, so SELECTs and INSERTs can happen simultaneously. So you don't have to use InnoDB based tables (which can use MERGE tables).
You could have 1 table per month, named something like data200905, data200904, etc. Your merge table would them include all the underlying tables you need to search on. Inserts are done on the merge table, so you don't have to worry about changing names. When it's time to rotate out data and create a new table, just redeclare the MERGE table.
You could even create multiple MERGE tables, based on quarter, years, etc. One table can be used in multiple MERGE tables.
I've done this setup on databases that added 30 million records per month.
Mysql does surprisingly well handling very large data sets with little more than standard database tuning and indexes. I ran a site that had millions of rows in a database and was able to run it just fine on mysql.
Mysql does have an "archive" table engine option for handling many rows, but the lack of index support will make it not a great option for you, except perhaps for historical data.
Index creation will be required, but you do have to balance them and not just create them because you can. They will allow for faster queries (and will required for usable queries on a table that large), but the more indexes you have, the more cost there will be inserting.
If you are just querying on your "user" id column, an index on there will not be a problem, but if you are looking to do full text queries on the messages, you may want to consider only indexing the user column in mysql and using something like sphynx or lucene for the full text searches, as full text searches in mysql are not the fastest and significantly slow down insert time.
You could handle this with two tables - one for the current chat history and one archive table. At the end of a period ( week, month or day depending on your traffic) you can archive current chat messages, remove them from the small table and add them to the archive.
This way your application is going to handle well the most common case - query the current chat status and this is going to be really fast.
For queries like "what did x say last month" you will query the archive table and it is going to take a little longer, but this is OK since there won't be that much of this queries and if someone does search like this he would be willing to wait a couple of seconds more.
Depending on your use cases you could extend this principle - if there will be a lot of queries for chat messages during last 6 months - store them in separate table too.
Similar principle (for completely different area) is used by the .NET garbage collector which has different storage for short lived objects, long lived objects, large objects, etc.