Suppose I have a site with 10,000 members and I want to collect one particular metric from those users everyday. These metric points would be used to form points on a graph. So as you can figure, the data points are very much related to the day on which they were collected. To keep things relevant, any data older than 30 days would be deleted.
What would be the best way to store this data? I'm currently using MariaDB. I would like to have a table where the first column is for the member id, and the other columns are for each of the days where the data was collected. With each day, a new column would be created on the end of the table to store that day's data, and the first column of the table would be deleted. This seems like the most computationally efficient approach.
This would mean the titles for those columns would be the date. Similar questions have been posed on here before and people are quick to condemn the database design for having dates as column titles. I'm not necessarily disagreeing but I would like to know what a better alternative is.
The other big problem here is that trying to get MariaDB / MySQL to automatically add a column with a unique title every day and delete the first column is quite a bit more complicated than I think it ought to be. I wonder if a different database service such as PostgreSQL would be better suited for these sorts of things.
Related
This question already has answers here:
Many tables or rows, which one is more efficient in SQL?
(3 answers)
Closed 7 years ago.
Every month I get sent a file from a external company which needs to be stored in a database, each file containing up to a million records. The main data fields are Month, Year, Postcode and TransactionType.
I was proposing that we should save the data in our database as a new SQL table each month so we know there is only a finite amount of data in each table. However one of my collegues said he was once told that to create a new table every month is bad practice, but he didn't know why.
If I was to have multiple tables, there would only be a maximum of 60 tables, though there may be far fewer (down to 12) dependent on how far into the past my client needs to look. This means that every month I will need to delete a month's worth of data.
However when I do my SQL queries I will only need a single row of data from a single table per query. I would think in theory this would be more efficient than having a single table filled with millions of rows.
I was wondering if anyone had any definitive reasons as to why splitting the data this way would be a bad thing to do?
All "like" items should be stored together in a database for the following reasons:
You should be able to provide any subset of the items using a single SELECT statement only by changing the WHERE clause of that statement. With separate tables you will have to write code to decompose the request into the parts that compute the table name and the parts that filter that table. And you will have to duplicate that logic in each application, or teach it to each user, that wants to use your database.
You should not artificially limit the use to which your data can be put. If you have separate monthly tables you have already substantially limited the types of queries you can enter against them without having to write more complex UNION queries.
The addition of more instances of a known data type to your database should not require ALTERing the structure of your database and, as a general principal, regularly-run code should not even have ALTER permissions
If proper indexes are maintained, there is very little performance difference when SELECTing data from a table 60 times the size of a smaller table. (There can be more effect on INSERT and UPDATE commands but it sound like you'll be doing a bulk update rather than updating the data constantly).
I can think of only two reasons for sharding data into separate tables:
You discover that you have a performance issue that can't be resolved through better data design.
You have records with different level of security and are relying on GRANT SELECT permissions to allow some users to see the records at higher levels of security.
A simpler method would be to add a column to that table which contains a datetimestamp of when that was loaded into the system. That way you can filter by that perticular column to segregate that data into the months/years that it was loaded in.
Another advantage from a performance perspective, that if you regularly filter data this way, you can create an index based on this date column.
Having multiple tables that contain the same information is not recommended for performance reasons and how information is stored in SQL. Eventually it will take up more space and if one month's data needs to reference another month's data it will be quite slow.
Hope this helps.
If you think it isn't difficult for you to manage your application, you can do it.
Example. Do you need to change SQL queries every month?
If user need more report that need data more than 1 month, What happen?
Using partitioning, DBMS will split your data to multiple table on the physical storage but You can call all of them by the same name. DBMS will analyse with partition it should take. Performance isn't different significantly.
Problem: We have a very big table, and growing. Most of its entries (say 80%) are historical data (with "DATE" field past current date) that are seldom queried, while small part of it (say 20%) are current data ("DATE" field after current date), most queries search these current entries.
Consider two possible scenarios, which one would be better (considering the overall implementation difficulty and performance,...)
Breaking the big table into two table: Historical and Current data. And on daily basis I move the records with expired date from Current table to Historical table.
Keeping record in one table (the DATA field is defined as INDEXED).
The scenario A would indicate more hustle in implementation and maintenance, and overload on daily bases for moving date between tables, while scenario B would indicate searching a big database (though indexed). Does it impose memory problems? Which scenario is recommended? IS there any other recommendations?
You usually don't want to break a big table into multiple tables, although having a current and historical table is totally reasonable. Your process makes sense. You can then optimize the current table for your query needs. I would probably go for two tables (given the limited information you provide), because it allows such optimization.
However, don't split the historical data. Instead, use partitioning. See the documentation. One caveat: queries need to specify the partitioning key in the where clause to take advantage of the partitions. With a large table, this is typical anyway.
Question: is the historical data necessary for system functionality or are these records stored for other purposes (e.g. audits)? It may be time to clean house by moving the historical data to an archive.
In my experience, most systems with big data have historical tables. In most cases that I have been, both the current data and historical data have different user-groups. The current data are used by the front end users to deal with customers with their current or recent transactions. The historical data are usually used by the user groups who do not have to talk with customers/clients directly.
Do not worry much about the issue of implementation and maintenance as I think your main consideration is all about performance. Implementation is only a one-time deal that will run on a specified frequency (like weekly, monthly or yearly archival) after you moved the program/s in production. Maintenance is very small and you can just forget about it once it is already implemented. You just have to make sure that you test the programs thoroughly.
For a normalized historical tables, tables have the same structure and field names which makes the data copy much easier. This way, one can just to a table join between the tables.
If you choose to not split the data, you will continue to add index after index. But somewhere down the road, you will still encounter the same issue again.
I have a huge amount of data that is generated over the period of an year spanning across many tables. However, keeping all this data in the same tables over the years is making queries slower. While I want to preserve the old data for the purpose of maintaining records and for some end user queries, they become less relevant when a new calendar year starts. Earlier, I had tried having a archive table for each such table where I would store data that was older than 1 year but this approach causes the archive table to grow pretty big in a small time.
Would it be better to have separate tables altogether for every new calendar year ? i.e.
myTable_2011
myTable_2012
myTable_2013
myTable_2014
...
thanks
You might use partitions. Maybe these links can give you an idea:
http://dev.mysql.com/tech-resources/articles/partitioning.html
How to partition a MySQL table based on char column?
http://refcardz.dzone.com/refcardz/database-partitioning
Regards
I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???
I was wondering if somebody knows an elegant solution to the following:
Suppose I have a table that holds orders, with a bunch of data. So I'm at 1M records, and searches begin to take time. So I want to speed it up by archiving some data that is more than 3 years old - saving it into a table called orders-archive, and then purging them from the orders table. So if we need to research something or customer wants to pull older information - they still can, but 99% of the lookups are done on the orders no older than a year and a half - so there is no reason to keep looking through older data all the time. These move & purge operations can be then croned to be done on a weekly basis. I already did some tests and I know that I will slash my search times by about 4 times. So far so good, right?
However I was thinking about how to implement older archival lookups and the only reasonable thing I can think of is some sort of if-else If not found in orders, do a search in orders-archive. However - I have about 20 tables that I want to archive and god knows how many searches / finds are done through out the code, that I don't want to modify. So I was wondering if there is an elegant rails-way solution to this problem, by extending a model somehow? Has anyone dealt with similar case before?
Thank you.
MySQL 5.x can handle this natively using Horizontal Partitioning.
The basic idea behind partitioning is that you tell the database to store records in a certain range in a separate file. You can still query against all the records, but as long as you're querying only current records, the database engine won't be encumbered with all of the archived records.
You can use the order_date column or something similar as the cutoff for your partitions. This is the elegant solution.
Overview of Partitioning in MySQL
Otherwise, your if/else idea with dynamically generated queries seems about right. You can add year numbers after the archival tables and use reflection to build a list of tables, then have at it.