dynamic or pre calculate data - mysql

a bit new to programming and had a general question that I just thought of.
Say, I have a database with a bunch of stock information and one column with price and another with earnings. To get the price/earning ratio, would it be better to calculate it everyday or to calculate it on demand? I think performance wise, it'd be quicker to read only but I'm wondering if for math type functions its worth the batch job to pre-calculate it(is it even noticeable?).
So how do the professionals do it? have the application process the data for them or have it already available in the database?

The professionals use a variety of methods. It all depends on what you're going for. Do the new real ratios need to be displayed immediately? How often is the core data changing? Ideally you would only calculate the ratio any time the price or earning changes, but this takes extra development, and it's probably not worth it if you don't have a substantial amount of activity on the site.
On the other hand, if you're receiving hundreds of visits every minute, you're definitely going to want to cache whatever you're calculating, as the time required to re-display a cached result is much less that recreating the result (in most scenarios).
However, as a general rule of thumb, don't get stuck trying to optimize something you haven't anticipated any performance issues with.

It would be good to keep statistical data as seperate table as those read only mode. you could calculate avarage, max, min values directly with SQL functions and save them. In mean time, for current period(day), you could dynamically calculate and show it. These statistical information can be use for reports or forcasting.

Pre-calculated value is (of course) faster.
However, it all depends on the requirement itself.
Does this value will be invoked frequently? If it's invoked frequently, then using a precalculated value will bring a huge advantage.
Does the calculation really need long time and/or huge resource? If so, using a precalculated will be helpful.
Please bear in mind, sometimes a slow process or a large resource consumption is caused by the programming implementation itself, not by a wrongly designed system.

Related

Day wise aggregation considering client's timezone from millions of rows

Suppose I've a table where visitors'(website visitor) information is stored. Suppose, the table structure consists of the following fields:
ID
visitor_id
visit_time (stored as milliseconds in UTC since
'1970-01-01 00:00:00')
Millions of rows are in this table and it's still growing.
In that case, If I want to see a report (day vs visitors) from any timezone then one solution is :
Solution #1:
Get the timezone of the report viewer (i.e. client)
Aggregate the data from this table considering the client's timezone
Show the result day wise
But In that case performance will degrade. Another solution may be the following:
Solution #2:
Using Pre-aggregated tables / summary tables where client's timezone is ignored
But in either case there is a trade off between performance and correctness.
Solution #1 ensures correctness and Solution #2 ensures better performance.
I want to know what is the best practice in this particular scenario?
The issue of handling time comes up a fair amount when you get into distributed systems, users and matching events between various sources of data.
I would strongly suggest that you ensure all logging systems use UTC. This allows collection from any variety of servers (which are all hopefully kept synchronized with respect to their view of the current UTC time) located anywhere in the world.
Then, as requests come in, you can convert from the users timezone to UTC. At this point you have the same decision -- perform a real-time query or perhaps access some data previously summarized.
Whether or not you want to aggregate the data in advance will depend on a bunch of things. Some of these might entail the ability to reduce the amount of data kept, reducing the amount of processing to support queries, how often queries will be performed or even the cost of building a system versus the amount of use it might see.
With respect to best practices -- keep the display characteristics (e.g. time zone) independent from the processing of the data.
If you haven't already, be sure you consider the lifetime of the data you are keeping. Will you need ten years of back data available? Hopefully not. Do you have a strategy for culling old data when it is no longer required? Do you know how much data you'll have if you store every record (estimate with various traffic growth rates)?
Again, a best practice for larger data sets is to understand how you are going to deal with the size and how you are going to manage that data over time as it ages. This might involve long term storage, deletion, or perhaps reduction to summarized form.
Oh, and to slip in a Matrix analogy, what is really going to bake your noodle in terms of "correctness" is the fact that correctness is not at issue here. Every timezone has a different view of traffic during a "day" in their own zone and every one of them is "correct". Even those oddball time zones that differ from yours by an adjustment that isn't measured only in hours.

Recurring data demand - automated query, or store data directly in SQL?

This is a simple question even though the title sounds complicated.
Let's say I'm storing data from a bunch of applications into one central database/ data warehouse. This is data at a pretty fine level -- say, daily summaries of various metrics.
HOWEVER, I know in the front-end I will be frequently displaying weekly and monthly aggregates of this data as well.
One idea would be to have scripting language do this for me after querying the SQL database - but that seems horribly inefficient, perhaps.
The second idea would be to have views in the database that represent business weeks and months -- this might be the best way to do it.
But my final idea is -- couldn't a SQL client simply run a query that aggregates all the daily data into weeks (or months) and store them in a separate table? The advantage of this is that it would reduce querying time of any user, since all the query work is done before a website or button is even loaded/ pushed. Even with a view, I guess that aggregation calculation would have to be done as soon as the view was queried.
The only downside to having the queries aggregated from the weeks/ months perhaps even once a day (instead of every time the website is loaded) -- is that it won't be up-to-date/ may reflect inconsistencies.
I'm not really an expert when it comes to this bigger picture stuff -- anyone have any thoughts? thanks
It depends on the user experience you're trying to create.
Is the user base expecting to watch monthly aggregates with one finger on the F5 key when watching this month's statistics? To cover this scenario, you might want to have a view with criteria that presents a window always relative to getdate(). Keeping in mind that with good indexing strategies and query design should mitigate the impact of this sort of approach to nearly nothing.
Is the user expecting informational data that doesn't include today's data? More performance might be seen out of a nightly job that does the aggregation into a new table.
Of all the scenarios, though, I would not recommend manual aggregation. Down that road are unexpected bugs and exceptions that can really be handled with a good SQL statement. Aggregates are a big part of all DBMSs', let their software handle that and work on the rest of your application.

Best Way to logging sensor data with MySQL

I am new to SQL, please advise.
I wish to logging incoming data from sensor every 5 seconds for future graph plotting
What is the best way to design database in MySQL?
Could i log with timestamp and use AVG functions when i like to display graph by hour, day, week, month ?
Or Could I log and make average every minute, hour, day to reduce database size
Is it possible to use trigger function to make average when collect data over 1 minute ?
The answer is that it depends on how much data you are actually going to be logging, how often you are going to be querying it, and how fast your response time needs to be. If it's just one sensor, every 5 seconds, you could probably go on for eternity without running into too many problems with regular sql queries to pull out averages, sums, etc. in a reasonable period of time.
I will say that from experience, you can do a lot with SQL and time series data, but you have to be very careful how you design your queries. I've worked with time series tables with billions of rows and tens of thousands of individual sensors among those rows; it's possible to achieve very fast execution over that many time series rows, but you might spend a week trying to fine-tune the database. It's definitely a trade-off between flexibility and speed.
Again, for your purposes, it probably is not going to make very much difference if you are just talking about one sensor; just write a regular SQL query. However, if you anticipate adding several hundred more sensors or increasing the sample rate, you may want to consider doing periodic "rollup" functions as you suggest. And in that case, I would be more inclined to write a custom solution using a NoSQL database (e.g. Cassandra, Couchbase, etc.) and using a program that runs periodically to do the rollup. If you are interested, I can provide details, but I really don't think you will need to go that far.
This post has a pretty good discussion on storing time series data in SQL vs NOSQL: https://dba.stackexchange.com/questions/7634/timeseries-sql-or-nosql
You should read about RRDtool.
From RRDtool website:
RRDtool is the OpenSource industry standard, high performance data
logging and graphing system for time series data.
http://oss.oetiker.ch/rrdtool/
If you don't want to use it (it may be too complicated, too big for your application etc.) - take a look how is this made, how information is stored etc.

Show calculated results on the fly vs. Storing pre-calculated results in a table

I have results that are calculated from multiple rows in many different tables. These results are then displayed in a profile. To show the most current results on request, should i store these results in a separate table and update them on change? or should i go with calculating these results on the fly?
As usual with performance questions, the answer is "it depends".
I'd probably start by calculating the results on the fly and go with precomputing them when it starts to be a problem.
If you have precomputed/summarized copies of your main data, you'll have to set up an updating process to make sure your summaries are correct. This can be quite tricky and can add a lot of complexity to your application so I wouldn't do it unless I had to. You'll also want to have a set of sanity check tools to make sure your summaries are, in fact, correct. And a set of "kill it all and rebuild the generated summaries" tools will also come in handy.
If these calculations are a problem (and by that I mean that you have measured the performance and know that your the queries in question are a bottleneck and you have the numbers to prove it) then it might be worth the extra coding and maintenance effort.

CPU bound applications vs. IO bound

For 'number-crunching' style applications that use alot of data (reads: "hundreds of MB, but not into GB" ie, it will fit nicely into memory beside the OS), does it make sense to read all your data into memory first before starting processing to avoid potentially making your program IO bound while reading large related datasets, instead loading them from RAM?
Does this answer change between using different data backings? ie, would the answer be the same irrespective of if you were using XML files, flat files, a full DBMS, etc?
Your program is as fast as whatever its bottleneck is. It makes sense to do things like storing your data in memory if that improves the overall performance. There is no hard and fast rule that says it will improve performance however. When you fix one bottleneck, something new becomes the bottleneck. So resolving one issue may get a 1% increase in performance or 1000% depending on what the next bottleneck is. The thing you're improving may still be the bottleneck.
I think about these things as generally fitting into one of three levels:
Eager. When you need something from disk or from a network or the result of a calculation you go and get or do it. This is the simplest to program, the easiest to test and debug but the worst for performance. This is fine so long as this aspect isn't the bottleneck;
Lazy. Once you've done a particular read or calculation don't do it again for some period of time that may be anything from a few milliseconds to forever. This can add a lot of complexity to your program but if the read or calculation is expensive, can reap enormous benefits; and
Over-eager. This is much like a combination of the previous two. Results are cached but instead of doing the read or calculation or requested there is a certain amount of preemptive activity to anticipate what you might want. Like if you read 10K from a file, there is a reasonably high likelihood that you might later want the next 10K block. Rather than delay execution you get it just in case it's requested.
The lesson to take from this is the (somewhat over-used and often mis-quoted) quote from Donald Knuth that "premature optimization is the root of all evil." Eager and over-eager solutions add a huge amount of complexity so there is no point doing them for something that won't yield a useful benefit.
Programmers often make the mistake of creating some highly (alleged) optimized version of something before determining if they need to and whether or not it will be useful.
My own take on this is: don't solve a problem until you have a problem.
I would guess that choosing the right data storage method will have more effect than whether you read from disk all at once or as needed.
Most database tables have regular offsets for fields in each row. For example, a customer record may be 50 bytes long and have a pants_size column start at the 12th byte. Selecting all pants sizes is as easy as getting values at offsets 12, 62, 112, 162, ad nauseum.
XML, however, is a lousy format for fast data access. You'll need to slog through a bunch of variable-length tags and attributes in order to get your data, and you won't be able to jump instantly from one record to the next. Unless you parse the file into a data structure like the one mentioned above. In which case you'd have something very much like an RDMS, so there you go.