de-normalization, weighted aggregates for updated tables in MySQL - mysql

this time I got a more general question. Should I use multiple views rather than stored procedures for weighted aggregation of data, if the original data is updated periodically?
Basically I have a local MySQL database that is updated periodically by importing the same kind of data (tables) from a bigger transaction database.
The local database is used for statistical analysis. Thus I de-normalize (basically aggregate) the data locally for use with statistical software packages. So far I used stored procedures because I felt it was easier to handle (and arranged more clearly) when weighting schemes (basically other tables containing weights that are multiplied with variables) came into play.
Though the disadvantage of stored procedures is that I have the run all of 'em again when the tables are populated with new data. Obviously I am not a DBA... So don´t shy away from stating the obvious :) What´s the best approach to handle this kind of scenario? SP or views ? Or something completely different?
thx for any suggestions in advance!

It depends (that's the generic answer to any "general" questions, isn't it? :) ). You need to evaluate the tradeoffs to see what the best solution is for your needs.
Views are basically just query re-writing (in MySQL), so using a view will be performing the aggregation/denormalization every time the query is run. That may make your queries slower that you would like. Also, if your procedures are really complicated, maybe it's not practical to try to put that logic into a view.
Stored procedures do the work once, so queries will be faster. But then your updates won't show up automatically. So I think the answer depends on how often the data changes, how often queries are run, and how important the performance of the queries is.
As for alternative suggestions, you could also run your stored procedures using events, if your data updates are regular, and you are just trying to save yourself from the manual task of running the procedures.
Another option is to have denormalization/aggregation tables that are updated with triggers. As you update your data in the source table, the triggers will automatically keep the aggregate tables current.
Here is a link to documentation for stored procedures, views, triggers, and events.

Related

How to cache infrequently changing mysql query?

I have a mysql query that is taking 8 seconds to execute/fetch (in workbench).
I won't go into the details of why it may be slow (I think GROUPBY isnt helping though).
What I really want to know is, how I can basically cache it to work more quickly because the tables only change like 5-10 times/hr, while users access the site 1000s times/hour.
Is there a way to just have the results regenerated/cached when the db changes so results are not constantly regenerated?
I'm quite new to sql so any basic thought may go a long way.
I am not familiar with such a caching facility in MySQL. There are alternatives.
One mechanism would be to use application level caching. The application would store the previous result and use that if possible. Note this wouldn't really work well for multiple users.
What you might want to do is store the report in a separate table. Then you can run that every five minutes or so. This would be a simple mechanism using a job scheduler to run the job.
A variation on this would be to have a stored procedure that first checks if the data has changed. If the underlying data has changed, then the stored procedure would regenerate the report table. When the stored procedure is done, the report table would be up-to-date.
An alternative would be to use triggers, whenever the underlying data changes. The trigger could run the query, storing the results in a table (as above). Alternatively, the trigger could just update the rows in the report that would have changed (harder, because it involves understanding the business logic behind the report).
All of these require some change to the application. If your application query is stored in a view (something like vw_FetchReport1) then the change is trivial and all on the server side. If the query is embedded in the application, then you need to replace it with something else. I strongly advocate using views (or in other databases user defined functions or stored procedures) for database access. This defines the API for the database application and greatly facilitates changes such as the ones described here.
EDIT: (in response to comment)
More information about scheduling jobs in MySQL is here. I would expect the SQL code to be something like:
truncate table ReportTable;
insert into ReportTable
select * from <ReportQuery>;
(In practice, you would include column lists in the select and insert statements.)
A simple solution that can be used to speed-up the response time for long running queries is to periodically generate summarized tables, based on underlying data refreshing or business needs.
For example, if your business don't care about sub-minute "accuracy", you can run the process once each minute and make your user interface to query this calculated table, instead of summarizing raw data online.

Run analytics on huge MySQL database

I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.

Functions with in Stored Procedure- SQL 2008

I have a SQl Query which returns 30,000+ records, with 15 columns . I am passing a NVARCHR(50) parameter for the store procedure.
At the moment I am using stored procedure to get the data from the database.
As there are 30,000+ records to be fetched and its taking time, What would be the suggestions for me.
Do I get any performance benefits if I use functions with in the stored procedure(to get individual columns based on the parameter I am passing)
Please let me know, if you need more info on the same.
Thank you
I wouldn't use functions unless there is no other way to get your data.
From SQL2005 you have extra functionality in stored procedures such as WITH and CROSS APPLY clauses that makes easier certain restrictions we had in previous versions of SQL that could be solved using UDF's.
In terms of performance, the stored procedure will generally be quicker, but it depends how optimized is your query and/or how the tables have been designed, maybe you could give us an example of what you are trying to achieve.
Functions would probably not be the way to go. 30000 rows isn't that many, depending on how conplex the query is. You would be better to focus on optimising the SQL in the proc, or on checking that your indexing is setup correctly.

mysql stored routine vs. mysql-alternative?

We are using a mysql database w/ about 150,000 records (names) total. Our searches on the 'names' field is done through an autocomplete function in php. We have the table indexed but still feel that the searching is a bit sluggish (a few full seconds vs. something like Google Finance w/ near-instant response). We came up w/ 2 possibilities, but wanted to get more insight:
Can we create a bunch (many thousands or more) of stored procedures to speed up searches, or will creating that many stored procedures bog-down the db?
Is there a faster alternative to mysql for "select" statements (speed on inserting & updating rows isn't too important so we can sacrifice that, if necessary). I've vaguely heard of BigTable & others that don't support JOIN statements....we need JOIN statements for some of our other queries we do.
thx
Forget about stored procedures. They wont do any good for you.
Mysql is good choice, it's often considered as fastest RDBMS. And there is no need to look for 'faster alternative to select statement'.
Abnormal query execution time you mentioned is a result of server misconfiguration or wrong database schema, or both. Please read this response on serverfault or update your question here: provide server configuration, part of database schema and problem query along with explain select ...
You need to cache the information in memory to avoid making repeated calls to the database.
Yes, you need to expire the cache if you change the data, but as you said, that's not common, so you can even do that on a semi-automated basis and not worry about it if necessary. You should check out this MySQL.com article, as well as perhaps explore the MEMORY storage engine (sorry, new and can't post more than one hyperlink per post?!) which takes a little bit of coding around to use but can be extremely efficient.
What's the actual query time (vs page time)? On a reasonably modern server that's not loaded to hell, MySQL should be able to do an autocomplete query on 150k rows much, much, faster than two seconds. Missing some indexes?

Stored procedures for complex queries

I have a few complex queries that are ran very often.
Caching the results is not possible, as they're updated most of the time, and seeing the updated data is the whole point.
I'm not allowed to change the database settings, and those who are won't do it unless hell freezes over first, so I have to do everything I can to optimize queries and tables.
Since I think I already did all I could for these queries and the tables they use, I was thinking if there would be any gains in speed if I were to create stored procedures for them.
Would it work to increase speed, or should I look for something else?
No, using a stored procedure will not increase the performance of a "hard" query.
Mostly hard queries are caused by the database needing to do a lot of work to find the answer. This won't be any different if it's in a stored procedure.
Changing the database settings might affect some things, but usually the best ways of optimising a query is to change the structure of your data, so that you need to query either fewer rows or fewer columns. Alternatively, you might be able to have it use better indexes or some other way of improving the query.
Use EXPLAIN. Use a non-production system for performance testing. Don't bother putting your queries into a procedure (if performance is your only reason for wanting to do so).
Refer following link
MySQL Stored Procedure vs. complex query
It will give you small performance boost.
Yes. Using Stored Procedures will increase performance. Since the SP is compiled and stored in database server.
But then it also depends upon the structure of the table and query!
As data grows performance will be low if you have poor database structure and non optimized query.