I'm looking for the best way to implement the following.
I have a MySQL query that analyses performance of a company, and outputs sales, revenue, costs and so on, and basically outputs a final Gross Profit and Net Profit figure. It's working well, and managers can select any date range to run it for, and it will output everything for that range - therefore they can in theory see how the company performed today, yesterday, this week, last month or on the 17th of three months ago... you get the point.
There comes a problem however when some of the figures to be used for the report are variable, and involve fluctuating external costs, such as overheads and so on. I allow users to specify these costs and overheads in a settings table, and the performance query uses these to calculate it's figures. But these variable figures represent now, so would bear no relevance if you wanted to look at the company performance from X months/years in the past, when the overheads for today are being offset against them, creating inaccuracy.
I thought of a couple of solutions.
I could allow the managers to set a date range to apply the overheads for. For example, for June 2011, the daily overhead was £2000, whereas in July 2011 the overhead is £2250.
Or I could save the performance report/query to another table, which would obviously have the variable figure locked in from the time it ran. This could even be automated with a crontab, and perhaps just ran every night.
Which way would you recommend?
If I were you I would go with the first option (1), and create a table to store the delayed overheads for a specific date. This would be much more flexible for you to run any kind of query at any point on time against the pure/"virgin" data you have on the table.
On the other hand the second option doesn't seem that feasible to me, because you can't possibly calculate all the complexity of the queries and reports needed, in different date ranges.
Related
I need help regarding how to structure overlapping date ranges in my data warehouse. My objective is to model the data in a way that allows date-level filtering on the reports.
I have dimensions — DimEmployee, DimDate and a fact called FactAttendance. The records in this fact are stored as follows —
To represent this graphically —
A report needs to be created out of this data, that will allow the end-user to filter it by making a selection of a date range. Let's assume user selects date range D1 to D20. On making this selection, the user should see the value for how many days at least one of the employees was on leave. In this particular example, I should see the addition of light-blue segments in the bottom i.e. 11 days.
An approach that I am considering is to store one row per employee per date for each of the leaves. The only problem with this approach is that it will exponentially increase the number of records in the fact table. Besides, there are other columns in the fact that will have redundant data.
How are such overlapping date/time problems usually handled in a warehouse? Is there a better way that does not involve inserting numerous rows?
Consider modelling your fact like this:
fact_attendance (date_id,employee_id,hours,...)
This will enable you to answer your original question by simply filtering on the Date dimension, but you will also be able to handle issues like leave credits, and fractional day leave usage.
Yes, it might use a little more storage than your first proposal, but it is a better dimensional representation, and will satisfy more (potential) requirements.
If you are really worried about storage - probably not a real worry - use a DBMS with columnar compression, and you'll see large savings in disk.
The reason I say "not a real worry" about storage is that your savings are meaningless in today's world of storage. 1,000 employees with 20 days leave each per year, over five years would mean a total of 100,000 rows. Your DBMS would probably execute the entire star join in RAM. Even one million employees would require less than one terabyte before compression.
Suppose I've a table where visitors'(website visitor) information is stored. Suppose, the table structure consists of the following fields:
ID
visitor_id
visit_time (stored as milliseconds in UTC since
'1970-01-01 00:00:00')
Millions of rows are in this table and it's still growing.
In that case, If I want to see a report (day vs visitors) from any timezone then one solution is :
Solution #1:
Get the timezone of the report viewer (i.e. client)
Aggregate the data from this table considering the client's timezone
Show the result day wise
But In that case performance will degrade. Another solution may be the following:
Solution #2:
Using Pre-aggregated tables / summary tables where client's timezone is ignored
But in either case there is a trade off between performance and correctness.
Solution #1 ensures correctness and Solution #2 ensures better performance.
I want to know what is the best practice in this particular scenario?
The issue of handling time comes up a fair amount when you get into distributed systems, users and matching events between various sources of data.
I would strongly suggest that you ensure all logging systems use UTC. This allows collection from any variety of servers (which are all hopefully kept synchronized with respect to their view of the current UTC time) located anywhere in the world.
Then, as requests come in, you can convert from the users timezone to UTC. At this point you have the same decision -- perform a real-time query or perhaps access some data previously summarized.
Whether or not you want to aggregate the data in advance will depend on a bunch of things. Some of these might entail the ability to reduce the amount of data kept, reducing the amount of processing to support queries, how often queries will be performed or even the cost of building a system versus the amount of use it might see.
With respect to best practices -- keep the display characteristics (e.g. time zone) independent from the processing of the data.
If you haven't already, be sure you consider the lifetime of the data you are keeping. Will you need ten years of back data available? Hopefully not. Do you have a strategy for culling old data when it is no longer required? Do you know how much data you'll have if you store every record (estimate with various traffic growth rates)?
Again, a best practice for larger data sets is to understand how you are going to deal with the size and how you are going to manage that data over time as it ages. This might involve long term storage, deletion, or perhaps reduction to summarized form.
Oh, and to slip in a Matrix analogy, what is really going to bake your noodle in terms of "correctness" is the fact that correctness is not at issue here. Every timezone has a different view of traffic during a "day" in their own zone and every one of them is "correct". Even those oddball time zones that differ from yours by an adjustment that isn't measured only in hours.
I have a table for news articles, containing amongst others the author, the time posted and the word count for each article. The table is rather large, containing more than one million entries and growing with an amount of 10.000 entries each day.
Based on this data, a statistical analysis is done, to determine the total number of words a specific author has published in a specific time-window (i.e. one for each hour of each day, one for each day, one for each month) combined with an average for a time-span. Here are two examples:
Author A published 3298 words on 2011-11-04 and 943.2 words on average for each day two month prior (from 2011-09-04 to 2011-11-03)
Author B published 435 words on 2012-01-21 between 1pm and 2pm and an average of 163.94 words each day between 1pm and 2pm in the 30 days before
Current practice is to start a script at the end of each defined time-window via cron-job, which calculates the count and the averages and stores it in a separate table for each time-window (i.e. one for each hourly window, one for each daily, one for each monthly etc...).
The calculation of sums and averages can easily be done in SQL, so I think Views might be a more elegant solution to this, but I don't know about the implications on performance.
Are Views an appropriate solution to the problem described above?
I think you can use materialize views for it. It's not really implemented in MySQL, but you can implement it with tables. Look at
views will not be equivalent to your denormalization.
if you are moving aggregate numbers somewhere else, then that has a certain cost, which you are paying - in order to keep the data correct, and a certain benefit, which is much less data to look through when querying.
a view will save you from having to think too hard about the query each time you run it, but it will still need to look through the larger amount of data in the original tables.
while i'm not a fan of denormalization, since you already did it, i think the view will not help.
I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???
I've created an affilaite system that tracks leads and conversions. The leads and conversions records will go into the millions so I need a good way to store them. Users will need to track the stats hourly, daily, weekly and monthly.
Whats the best way to store the leads and conversions?
For this type of system, you need to keep all of the detail records. Reason being at some point someone is going to contest an invoice.
However, you should have some roll up tables. Each hour, compute current hours work and store the results. Do the same for daily, weekly, and monthly.
If some skewing is okay, you can compute the daily amounts off of the 24 hourly computed records. Weekly, off of the last 7 daily records. For monthly you might want to compute back off of the hourly records, because each month doesn't quite add up to 4 full weeks.. Also, it helps reduce noise from any averaging you might be doing.
I'd recommend a two step archival process. The first one should run once a day and move the records into a separate "hot" database. Try to keep 3 months hot for any type of research queries you need to do.
The second archive process is up to you. You could simply move any records older than 3 months into some type of csv file and simply back it up. After some period of time (a year?) delete them depending on your data retention agreements.
Depending on the load, you may need to have multiple web servers handling the lead and conversion pixels firing. One option is to store the raw data records on each web/mysql server, and then run an archival process every 5-10 minutes that stores them in a highly normalized table structure, and which performs any required roll-ups to achieve the performance you are looking for.
Make sure you keep row size as small as possible, store IP's as unsigned ints, store referees as INTs that reference lookup tables, etc.