I have a table for news articles, containing amongst others the author, the time posted and the word count for each article. The table is rather large, containing more than one million entries and growing with an amount of 10.000 entries each day.
Based on this data, a statistical analysis is done, to determine the total number of words a specific author has published in a specific time-window (i.e. one for each hour of each day, one for each day, one for each month) combined with an average for a time-span. Here are two examples:
Author A published 3298 words on 2011-11-04 and 943.2 words on average for each day two month prior (from 2011-09-04 to 2011-11-03)
Author B published 435 words on 2012-01-21 between 1pm and 2pm and an average of 163.94 words each day between 1pm and 2pm in the 30 days before
Current practice is to start a script at the end of each defined time-window via cron-job, which calculates the count and the averages and stores it in a separate table for each time-window (i.e. one for each hourly window, one for each daily, one for each monthly etc...).
The calculation of sums and averages can easily be done in SQL, so I think Views might be a more elegant solution to this, but I don't know about the implications on performance.
Are Views an appropriate solution to the problem described above?
I think you can use materialize views for it. It's not really implemented in MySQL, but you can implement it with tables. Look at
views will not be equivalent to your denormalization.
if you are moving aggregate numbers somewhere else, then that has a certain cost, which you are paying - in order to keep the data correct, and a certain benefit, which is much less data to look through when querying.
a view will save you from having to think too hard about the query each time you run it, but it will still need to look through the larger amount of data in the original tables.
while i'm not a fan of denormalization, since you already did it, i think the view will not help.
Related
I need help regarding how to structure overlapping date ranges in my data warehouse. My objective is to model the data in a way that allows date-level filtering on the reports.
I have dimensions — DimEmployee, DimDate and a fact called FactAttendance. The records in this fact are stored as follows —
To represent this graphically —
A report needs to be created out of this data, that will allow the end-user to filter it by making a selection of a date range. Let's assume user selects date range D1 to D20. On making this selection, the user should see the value for how many days at least one of the employees was on leave. In this particular example, I should see the addition of light-blue segments in the bottom i.e. 11 days.
An approach that I am considering is to store one row per employee per date for each of the leaves. The only problem with this approach is that it will exponentially increase the number of records in the fact table. Besides, there are other columns in the fact that will have redundant data.
How are such overlapping date/time problems usually handled in a warehouse? Is there a better way that does not involve inserting numerous rows?
Consider modelling your fact like this:
fact_attendance (date_id,employee_id,hours,...)
This will enable you to answer your original question by simply filtering on the Date dimension, but you will also be able to handle issues like leave credits, and fractional day leave usage.
Yes, it might use a little more storage than your first proposal, but it is a better dimensional representation, and will satisfy more (potential) requirements.
If you are really worried about storage - probably not a real worry - use a DBMS with columnar compression, and you'll see large savings in disk.
The reason I say "not a real worry" about storage is that your savings are meaningless in today's world of storage. 1,000 employees with 20 days leave each per year, over five years would mean a total of 100,000 rows. Your DBMS would probably execute the entire star join in RAM. Even one million employees would require less than one terabyte before compression.
I have maybe 10 to 20 million detail records coming in a day (statistical and performance data), that must be read in, and summarized into 24 hourly and 1 daily summary records.
The process calculates averages on several fields, gets the max and min values of others, nothing significant CPU wise.
Is it better to:
A) summarize the detail records into the summary records while the records are coming in, delaying each detail record insert slightly? I assume there will be a lot of locking (select for update's etc) in the summary tables, as there are several different systems importing data.
B) wait until the hour is over, and then select the entire previous hours data and create the summary records? There would be a delay for users to see the statistics, however the detail records would be available during the time.
Perhaps there are alternative methods to this?
Just make view for summary tables. Your all insert will work as usual. Just make views according to your need as summary. That will update automatically with main tables.
Also you can make the 24 hourly and 1 daily summary basis. Views are stored queries that when invoked produce a result set. A view acts as a virtual table.
For more details about views refer : http://dev.mysql.com/doc/refman/5.0/en/create-view.html
Let me know if you want further assistance regarding mysql views.
It'd depend on the load required to run the single update, but I'd probably go with a separate summary run. I'd probably put a small bet on saying that a single update would take a shorter amount of time than the cumulative on-every-insert idea.
I'm looking for the best way to implement the following.
I have a MySQL query that analyses performance of a company, and outputs sales, revenue, costs and so on, and basically outputs a final Gross Profit and Net Profit figure. It's working well, and managers can select any date range to run it for, and it will output everything for that range - therefore they can in theory see how the company performed today, yesterday, this week, last month or on the 17th of three months ago... you get the point.
There comes a problem however when some of the figures to be used for the report are variable, and involve fluctuating external costs, such as overheads and so on. I allow users to specify these costs and overheads in a settings table, and the performance query uses these to calculate it's figures. But these variable figures represent now, so would bear no relevance if you wanted to look at the company performance from X months/years in the past, when the overheads for today are being offset against them, creating inaccuracy.
I thought of a couple of solutions.
I could allow the managers to set a date range to apply the overheads for. For example, for June 2011, the daily overhead was £2000, whereas in July 2011 the overhead is £2250.
Or I could save the performance report/query to another table, which would obviously have the variable figure locked in from the time it ran. This could even be automated with a crontab, and perhaps just ran every night.
Which way would you recommend?
If I were you I would go with the first option (1), and create a table to store the delayed overheads for a specific date. This would be much more flexible for you to run any kind of query at any point on time against the pure/"virgin" data you have on the table.
On the other hand the second option doesn't seem that feasible to me, because you can't possibly calculate all the complexity of the queries and reports needed, in different date ranges.
I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???
I've created an affilaite system that tracks leads and conversions. The leads and conversions records will go into the millions so I need a good way to store them. Users will need to track the stats hourly, daily, weekly and monthly.
Whats the best way to store the leads and conversions?
For this type of system, you need to keep all of the detail records. Reason being at some point someone is going to contest an invoice.
However, you should have some roll up tables. Each hour, compute current hours work and store the results. Do the same for daily, weekly, and monthly.
If some skewing is okay, you can compute the daily amounts off of the 24 hourly computed records. Weekly, off of the last 7 daily records. For monthly you might want to compute back off of the hourly records, because each month doesn't quite add up to 4 full weeks.. Also, it helps reduce noise from any averaging you might be doing.
I'd recommend a two step archival process. The first one should run once a day and move the records into a separate "hot" database. Try to keep 3 months hot for any type of research queries you need to do.
The second archive process is up to you. You could simply move any records older than 3 months into some type of csv file and simply back it up. After some period of time (a year?) delete them depending on your data retention agreements.
Depending on the load, you may need to have multiple web servers handling the lead and conversion pixels firing. One option is to store the raw data records on each web/mysql server, and then run an archival process every 5-10 minutes that stores them in a highly normalized table structure, and which performs any required roll-ups to achieve the performance you are looking for.
Make sure you keep row size as small as possible, store IP's as unsigned ints, store referees as INTs that reference lookup tables, etc.