Day wise aggregation considering client's timezone from millions of rows

Day wise aggregation considering client's timezone from millions of rows - mysql

Suppose I've a table where visitors'(website visitor) information is stored. Suppose, the table structure consists of the following fields:
ID
visitor_id
visit_time (stored as milliseconds in UTC since
'1970-01-01 00:00:00')
Millions of rows are in this table and it's still growing.
In that case, If I want to see a report (day vs visitors) from any timezone then one solution is :
Solution #1:
Get the timezone of the report viewer (i.e. client)
Aggregate the data from this table considering the client's timezone
Show the result day wise
But In that case performance will degrade. Another solution may be the following:
Solution #2:
Using Pre-aggregated tables / summary tables where client's timezone is ignored
But in either case there is a trade off between performance and correctness.
Solution #1 ensures correctness and Solution #2 ensures better performance.
I want to know what is the best practice in this particular scenario?

The issue of handling time comes up a fair amount when you get into distributed systems, users and matching events between various sources of data.
I would strongly suggest that you ensure all logging systems use UTC. This allows collection from any variety of servers (which are all hopefully kept synchronized with respect to their view of the current UTC time) located anywhere in the world.
Then, as requests come in, you can convert from the users timezone to UTC. At this point you have the same decision -- perform a real-time query or perhaps access some data previously summarized.
Whether or not you want to aggregate the data in advance will depend on a bunch of things. Some of these might entail the ability to reduce the amount of data kept, reducing the amount of processing to support queries, how often queries will be performed or even the cost of building a system versus the amount of use it might see.
With respect to best practices -- keep the display characteristics (e.g. time zone) independent from the processing of the data.
If you haven't already, be sure you consider the lifetime of the data you are keeping. Will you need ten years of back data available? Hopefully not. Do you have a strategy for culling old data when it is no longer required? Do you know how much data you'll have if you store every record (estimate with various traffic growth rates)?
Again, a best practice for larger data sets is to understand how you are going to deal with the size and how you are going to manage that data over time as it ages. This might involve long term storage, deletion, or perhaps reduction to summarized form.
Oh, and to slip in a Matrix analogy, what is really going to bake your noodle in terms of "correctness" is the fact that correctness is not at issue here. Every timezone has a different view of traffic during a "day" in their own zone and every one of them is "correct". Even those oddball time zones that differ from yours by an adjustment that isn't measured only in hours.

Related

Optimal way of storing performance data for statistics (graphs)

Currently I'm working on a dashboard in PHP/MySQL which contains several statistics/facts such as: amount of items sold, revenue, gender (male/female) ratio of users etc. (all filterable on last week/month/year). The amount of data is (currently) not that much: 20.000 user rows, 1.000 items, 500 items sold per day but is expected to grow in the future, perhaps even exponentially.
Now, there is a wish to have several graphs displaying the performance to see whether strategy changes have impacts on the amount of users, revenue, gender ratio etc. For this, it is necessary to have numbers per day. Currently, the dashboard can only display "NOW() - 1 week/1 month/1 year" but for showing a graph outlining the growth, these numbers should be saved on a daily basis.
My question is: what are the options in this case? A cronjob could be set in place to save these numbers and write them to a separate 'performance' or 'history' table that saves the visitors, sales, gender ratio etc. in rows linked to the date of that day. This is good for performance, but certain data gets lost. Another option is to compute these numbers with complex queries (group by day) etc, but that seems to intensive since the queries are performed on the production database. Especially since the database structure is a little complex. Thinking of avoiding doing this on the production database, is setting up a data-warehouse with ETL-processes a better option to avoid overloading the production database? In that case the data would not be displayed live.
I honestly have no idea what is the best option in this case. I'm very curious about the answers! Many thanks.

Running query on a production database (especially one which is growing in volume and complexity) become a losing proposition very quickly. There are a lot of possible alternative, basically the entire field of Business Intelligence is grown as as solution of this problem.
For a small system where you just want to avoid to query the production database probably the development of a full blown Data Warehouse is overkill. It is impossible to give a reasonable answer without knowing more, but I would go for one of the following (in growing order of complexity/degree of result):
Instead of directly show the result of the query, save it in a table and query the table
Clone your production database then query the clone
Extract relevant data from production database in a structure which save relevant data and preserve history (google Data Vault)
Direct over the production DB, or over solution 2 or 3 build a dimensional model (google Kimball Dimensional Model). Pay attention that to do a good job you have to consider what kind of queries you want to do. You could end up with different designs for different requirement.
It is also relevant which technology are you using and what are the options available on your available architecture. Depending on what you have on hand, you could have some solution, even complex ones, very much simplified. Do some research.

Store overlapping date ranges to be filtered using a custom date range

I need help regarding how to structure overlapping date ranges in my data warehouse. My objective is to model the data in a way that allows date-level filtering on the reports.
I have dimensions — DimEmployee, DimDate and a fact called FactAttendance. The records in this fact are stored as follows —
To represent this graphically —
A report needs to be created out of this data, that will allow the end-user to filter it by making a selection of a date range. Let's assume user selects date range D1 to D20. On making this selection, the user should see the value for how many days at least one of the employees was on leave. In this particular example, I should see the addition of light-blue segments in the bottom i.e. 11 days.
An approach that I am considering is to store one row per employee per date for each of the leaves. The only problem with this approach is that it will exponentially increase the number of records in the fact table. Besides, there are other columns in the fact that will have redundant data.
How are such overlapping date/time problems usually handled in a warehouse? Is there a better way that does not involve inserting numerous rows?

Consider modelling your fact like this:
fact_attendance (date_id,employee_id,hours,...)
This will enable you to answer your original question by simply filtering on the Date dimension, but you will also be able to handle issues like leave credits, and fractional day leave usage.
Yes, it might use a little more storage than your first proposal, but it is a better dimensional representation, and will satisfy more (potential) requirements.
If you are really worried about storage - probably not a real worry - use a DBMS with columnar compression, and you'll see large savings in disk.
The reason I say "not a real worry" about storage is that your savings are meaningless in today's world of storage. 1,000 employees with 20 days leave each per year, over five years would mean a total of 100,000 rows. Your DBMS would probably execute the entire star join in RAM. Even one million employees would require less than one terabyte before compression.

Recurring data demand - automated query, or store data directly in SQL?

This is a simple question even though the title sounds complicated.
Let's say I'm storing data from a bunch of applications into one central database/ data warehouse. This is data at a pretty fine level -- say, daily summaries of various metrics.
HOWEVER, I know in the front-end I will be frequently displaying weekly and monthly aggregates of this data as well.
One idea would be to have scripting language do this for me after querying the SQL database - but that seems horribly inefficient, perhaps.
The second idea would be to have views in the database that represent business weeks and months -- this might be the best way to do it.
But my final idea is -- couldn't a SQL client simply run a query that aggregates all the daily data into weeks (or months) and store them in a separate table? The advantage of this is that it would reduce querying time of any user, since all the query work is done before a website or button is even loaded/ pushed. Even with a view, I guess that aggregation calculation would have to be done as soon as the view was queried.
The only downside to having the queries aggregated from the weeks/ months perhaps even once a day (instead of every time the website is loaded) -- is that it won't be up-to-date/ may reflect inconsistencies.
I'm not really an expert when it comes to this bigger picture stuff -- anyone have any thoughts? thanks

It depends on the user experience you're trying to create.
Is the user base expecting to watch monthly aggregates with one finger on the F5 key when watching this month's statistics? To cover this scenario, you might want to have a view with criteria that presents a window always relative to getdate(). Keeping in mind that with good indexing strategies and query design should mitigate the impact of this sort of approach to nearly nothing.
Is the user expecting informational data that doesn't include today's data? More performance might be seen out of a nightly job that does the aggregation into a new table.
Of all the scenarios, though, I would not recommend manual aggregation. Down that road are unexpected bugs and exceptions that can really be handled with a good SQL statement. Aggregates are a big part of all DBMSs', let their software handle that and work on the rest of your application.

Should i recalculate big amounts of data from tables, or should I save it in my database?

My question is more general than specific, yet I am using an example to transfer the idea.
I have a forum, and in each replay I present the number of messages the users have.
Assuming that in some pages there are 15 different users, each has over 20,000 messages, should I recalculate the number of messages by counting how many entries in the messages table the user has, or would it be better to create a column in the users table that contains this data, and update the column every time a reply is made?
I know it defies the database normalizations rules, but it seems like a big waste to calculate it every time.
I'm using mySQL, if it matters.

Generally no, but in some specific cases, yes.
You should avoid having redundant data in a database. However, sometimes you have to make that tradeoff to get a decent performance.
I have actually done exactly the thing in your example. It works great for the performance, but it's really hard to keep the message count correct. You will get some inconsistent values sooner or later, so you need a plan for how to go through the values periodically and recalculate them.

You are talking about denormalization. Quoting wikipedia:
denormalization is the process of attempting to optimise the read
performance of a database by adding redundant data or by grouping
data.
Keep denormalized data in 'plain' code is not an easy issue. Remember than:
You can keep redundant data with triggers.
If your architecture includes ORM it is more easy to keep redundant data.

You could also go half way in your denormalisation: have a table with monthly data per user, filled by a monthly job, and calculate the number of messages on the fly, by counting the msg since 1st of month + sum of monthly data. Or if you don't need the monthly data, you can still calc on the fly over the month + a monthly process that updates the EOM figues. That will avoid triggers...

I'm surprised nobody has mentioned materialized views. These objects are very helpful when it comes to maintaining aggregates of data for performance reasons without violating the normalisation of our actual data. Find out more.

Have you tried to benchmark the results of counting the number of rows?
I'd recommend you just do you're calculation in a view. With the denormalization you're proposing, you're just exposing yourself to the risk of data corruption. The post count column will then end up with some arbitrary value that's go nothing to do with the reality of the number of posts.

dynamic or pre calculate data

a bit new to programming and had a general question that I just thought of.
Say, I have a database with a bunch of stock information and one column with price and another with earnings. To get the price/earning ratio, would it be better to calculate it everyday or to calculate it on demand? I think performance wise, it'd be quicker to read only but I'm wondering if for math type functions its worth the batch job to pre-calculate it(is it even noticeable?).
So how do the professionals do it? have the application process the data for them or have it already available in the database?

The professionals use a variety of methods. It all depends on what you're going for. Do the new real ratios need to be displayed immediately? How often is the core data changing? Ideally you would only calculate the ratio any time the price or earning changes, but this takes extra development, and it's probably not worth it if you don't have a substantial amount of activity on the site.
On the other hand, if you're receiving hundreds of visits every minute, you're definitely going to want to cache whatever you're calculating, as the time required to re-display a cached result is much less that recreating the result (in most scenarios).
However, as a general rule of thumb, don't get stuck trying to optimize something you haven't anticipated any performance issues with.

It would be good to keep statistical data as seperate table as those read only mode. you could calculate avarage, max, min values directly with SQL functions and save them. In mean time, for current period(day), you could dynamically calculate and show it. These statistical information can be use for reports or forcasting.

Pre-calculated value is (of course) faster.
However, it all depends on the requirement itself.
Does this value will be invoked frequently? If it's invoked frequently, then using a precalculated value will bring a huge advantage.
Does the calculation really need long time and/or huge resource? If so, using a precalculated will be helpful.
Please bear in mind, sometimes a slow process or a large resource consumption is caused by the programming implementation itself, not by a wrongly designed system.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008