I'm wondering what's best. At this moment I have 3 'activation' codes for certain functionality within our back-end (shop)software. These three codes are checked for validity over 3 queries at this moment. This can also be done by using 1 query with subselects. The point is that in the future more and more codes can be added and what is considered the best practise in this situation? The perspective I'm interested in is reducing load on the DB-server and get the best performance in this scenario. (Indexes are set properly, ofcourse)
I think, almost the only scenario where breaking the query into several makes sence, is when results of some of them is cached. That way the overal permormance of them might be better.
Another scenario might be when you want to move business logic out of the DB to the application, even though the performance might degrade.
Otherwise, I would use a single query.
One wise query is practically always better, than several queries.
In most cases the best way is to rewrite you queries so that you could be able to retreive the info required with one query.
BTW, subqueries are treated like joins by the internal optimzer, so sometimes it's useful to learn how to write sql queries with joins and staff.
Related
I have this sense that - inefficiently written queries aside - getting the information you want out of a database is always faster the less queries you make to do so. I don't know where I got that idea from and it gets challenged the more complicated the queries are that I produce (am I really doing MySQL any favors with all these joins?). I'm not asking for an opinion on ease for the programmer or best coding practices, but do conditions exist under which a program would perform faster with a query broken out into multiple steps? If so, how might one make an educated guess a query might reach such an upper limit before going through the effort of coding and comparing?
Yes, although it is less likely with MySQL. The reason is that MySQL doesn't have a really sophisticated cost-based optimizer. The advantage to intermediate tables is that the sizes are known. A cost-based optimizer can take advantage of this information to improve the query plan.
One place where this can help is when a subquery is repeated multiple times in a query. An intermediate table ensures that it is processed only once (although CTEs would normally do the same thing).
Another place where this can really help is when you add indexes to the intermediate tables. Adding the indexes -- and using them -- can be a big cost savings, more than making up for the cost of creating the index.
That said, I generally discourage using intermediate tables for this purpose, unless the results are needed for multiple queries. I find that just the overhead in debugging makes it not worth it -- for some reason, I don't always delete the intermediate tables and then waste time wondering why some modification doesn't work.
More importantly, as the data changes, modifying the queries can be a pain. I find that changing a column name, for instance, is simpler in a single query than when the logic is spread across multiple queries.
I'm using CakePHP 2.x. When I inspect the sql dump, I notice that it's "automagic" is causing one of my find()s to run several separate SELECT queries (and then presumably merging them all together into a single pretty array of data).
This is normally fine, but I need to run one very large query on a table of 10K rows with several joins, and this is proving too much for the magic to handle because when I try to construct it through find('all', $conditions) the query times out after 300 seconds. But when I write an equivalent query manually with JOINS, it runs very fast.
My theory is that whatever PHP "magic" is required to weave the separate queries together is causing a bottleneck for this one large query.
Is my theory a plausible explanation for what's going on?
Is there a way to tell Cake to just keep it simple and make one big fat SELECT instead of it's fancy automagic?
Update: I forgot to mention that I already know about $this->Model->query(); Using this is how I figured out that the slow-down was coming from PHP magic. It works when we do it this way, but it feels a little clunky to maintain the same query in two different forms. That's why I was hoping CakePHP offered an alternative to the way it builds up big queries from multiple smaller ones.
In cases like this where you query tables with 10k records you shouldn't be doing a find('all') without limiting the associations, these are some of the strategies you can apply:
Set recursive to 0 If you don't need related models
Use Containable Behavior to bring only the associated models you need.
Apply limits to your query
Caching is a good friend
Create and destroy associations on the fly As you need.
Since you didn't specify the problem I just gave you general ideas to apply depending on the problem you have
I'm designing a central search function in a PHP web application. It is focused around a single table and each result is exactly one unique ID out of that table. Unfortunately there are a few dozen tables related to this central one, most of them being 1:n relations. Even more unfortunate, I need to join quite a few of them. A couple to gather the necessary data for displaying the results, and a couple to filter according to the search criteria.
I have been mainly relying on a single query to do this. It has a lot of joins in there and, as there should be exactly one result displayed per ID, it also works with rather complex subqueries and group by uses. It also gets sorted according to a user-set sort method and there's pagination in play as well done by the use of LIMIT.
Anyways, this query has become insanely complex and while I nicely build it up in PHP it is a PITA to change or debug. I have thus been considering another approach, and I'm wondering just how bad (or not?) this is for performance before I actually develop it. The idea is as follows:
run one less complex query only filtering according the search parameters. This means less joins and I can completely ignore group by and similar constructs, I will just "SELECT DISTINCT item_id" on this and get a list of IDs
then run another query, this time only joining in the tables I need to display the results (only about 1/4 of the current total joins) using ... WHERE item_id IN (....), passing the list of "valid" IDs gathered in the first query.
Note: Obviously the IN () could actually contain the first query in full instead of relying on PHP to build up a comma-separated list).
How bad will the IN be performance-wise? And how much will it possibly hurt me that I can not LIMIT the first query at all? I'm also wondering if this is a common approach to this or if there are more intelligent ways to do it. I'd be thankful for any input on this :)
Note to clarify: We're not talking about a few simple joins here. There is even (simple) hierarchical data in there where I need to compare the search parameter against not only the items own data but also against its parent's data. In no other project I've ever worked on have I encountered a query close to this complexity. And before you even say it, yes, the data itself has this inherent complexity, which is why the data model is complex too.
My experience has shown that using the WHERE IN(...) approach tends to be slower. I'd go with the joins, but make sure you're joining on the smallest dataset possible first. Reduce down the simple main table, then join onto that. Make sure your most complex joins are saved to the end to minimize the rows required to search. Try to join on indexes wherever possible to improve speed, and ditch wildcards in JOINS where possible.
But I agree with Andomar, if you have the time build both and measure.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
If you are going down the road of using views, how can you ensure good performance?
Or is it better not to use views in the first place and just incorporate the equivalent into your select statements?
It Depends.
It totally depends on what you are viewing through view. But most probably reducing your effort and giving higher performance. When SQL statement references a nonindexed view, the parser and query optimizer analyze the source of both the SQL statement and the view and then resolve them into a single execution plan. There is not one plan for the SQL statement and a separate plan for the view.
A view is not compiled. Its a virtual table made up of other tables. When you create it, it doesn't reside somewhere on your server. The underlying queries that make up the view are subject to the same performance gains or dings of the query optimizer. I've never tested performance on a view VS its underlying query, but i would imagine the performance may vary slightly. You can get better performance on an indexed view if the data is relatively static. This may be what you are thinking maybe in terms of "compiled".
Advantages of views:
View the data without storing the data into the object.
Restrict the view of a table i.e. can hide some of columns in the tables.
Join two or more tables and show it as one object to user.
Restrict the access of a table so that nobody can insert the rows into the table.
See these useful links:
Performance of VIEW vs. SQL statement
Is a view faster than a simple query?
Mysql VIEWS vs. PHP query
Are MySql Views Dynamic and Efficient?
Materialized View vs. Tables: What are the advantages?
Is querying over a view slower than executing SQL directly?
A workaround for the performance problems of TEMPTABLE views
See performance gains by using indexed views in SQL Server
Here's a tl;dr summary, you can find detailed evaluations from Peter Zaitsev and elsewhere.
Views in MySQL are generally a bad idea. At Grooveshark we consider them to be harmful and always avoid them. If you are careful you can make them work but at best they are a way to remember how to select data or keep you from having to retype complicated joins. At worst they can cause massive inefficiencies, hide complexity, cause accidental nested subselects (requiring temporary tables and leading to disk thrashing), etc.
It's best to just avoid them, and keep your queries in code.
I think the blog by Peter Zaitsev has most of the details. Speaking from personal experience views can perform well if you generally keep them simple. At one of my clients they kept on layering one view on top of another and it ended up in a perfomance nightmare.
Generally I use views to show a different aspect of a table. For example in my employees table show me the managers or hide the salary field from non HR employees. Also always make sure you run a EXPLAIN on the query and view to understand exactly what is happening inside MySQL.
If you want solid proof in your scenario I would suggest that you test. It is really hard to say using views is always a performance killer then again a badly written view is probably going to kill your performance.
They serve their purpose, but the hidden complexities and inefficiencies usually outweigh a more direct approach. I once encountered a SQL statement that was joining on two views, and sorting them the results. The views were sorting as well, so the execution time could be measured in what seemed like hours.
A thing not mentioned so far but making a huge difference is adequate indexing of the views' source tables.
As mentioned above, views do not reside in your DB but are rebuild every time. Thus everything that makes the rebuild easier for the DB increases performance of the view.
Often, views join data in a way that is very bad for storage (no normal form) but very good for further usage (doing analysis, presenting data to user, ...) and therewith joining and aggregating data from different tables.
Whether or not the columns on which the operations are made are indexed or not makes a huge difference on the performance of a view. If the tables and their relevant columns are indexed already accessing the view does not end in re-computing the indexes over and over again first. (on the downside, this is done when data is manipulated in the source tables)
! Index all columns used in JOINS and GROUP BY clauses in your CREATE VIEW statement !
If we are discussing "if you use views, how to ensure performance", and not the performance effect of views in general, I think that it boils down to restraint (as in yourself).
You can get in to big trouble if you just write views to make your query's simple in all cases, but do not take care that your views are actually usefull performance-wise. Any query's you're doing in the end should be running sane (see the comment example from that link by #eggyal). Ofcourse that's a tautology, but therefore not any less valuable
You especially need to be carefull not to make views from views, just because that might make it easier to make that view.
In the end you need to look at the reason you are using views. Any time you do this to make life easier on the programming end you might be better of with a stored procedure IMHO.
To keep things under control you might want to write down why you have a certain view, and decide why you are using it. For every 'new' use within your programming, recheck if you actually need the view, why you need it, and if this would still give you a sane execution-path. Keep on checking your uses to keep it speedy, and keep checking if you really need that view.
I have read that distinct() API call has some performance issues at times. I wanted to try to rewrite a query through the orm which avoided using distinct (at least profile the difference).
My understanding is that values() performs a Group By under the hood. When I test out the two methods, though, the Count of objects differs depending on whether I use distinct() or values()/annotate().
zip_codes = Location.objects.values('zip_code').annotate(zip_count=Count('zip_code')).exclude(zip_code=None).count()
VS.
zip_codes = Location.objects.values_list('zip_code', flat=True).exclude(zip_code=None).distinct()
any thoughts on what is wrong here?
Thanks!
I just quickly checked your queries against a database I have with a similar query. The counts was identical so I'm not sure what about your data is resulting in issues.
I'd also be HIGHLY skeptical of the premise though. DISTINCT is indeed a cpu intensive query. However, so is COUNT(*) and your second query is going to first run an count aggregate with a group by and then run a COUNT on the results. I'd be put money on the single DISTINCT call being faster (I'd also check with whichever database backend you're using to see). All of this has very little to do with django's ORM and a whole heck of a lot more to do with your database backend.
Also think about this. The distinct based query is an order of magnitude clearer as to what it's accomplishing compared to the annotate based one. Do you have evidence to support that DISTINCT is going to be slow in your situation, or better still that it's forming a bottlneck right now? If not you're well into the range of premature optimization and should heavily reconsider your path.
Premature Optimization.
Optimization matters only when it matters. When it matters, it matters a lot, but until you know that it matters, don't waste a lot of time doing it. Even if you know it matters, you need to know where it matters. Without performance data, you won't know what to optimize, and you'll probably optimize the wrong thing.
The result will be obscure, hard to write, hard to debug, and hard to maintain code that doesn't solve your problem. Thus it has the dual disadvantage of (a) increasing software development and software maintenance costs, and (b) having no performance effect at all.
In other words write your software clearly and then when you find a problem trace it to the source and fix it. Anything you do before that is counterproductive. Spend your time worrying about which indexes are going to matter on your db, and where to use select_related. Those are 10000% more effective than what you are worrying about here (unless you are counting zip codes all the time, in which case let me introduce you to caching)