SQL Tips for Query Optimization [closed] - mysql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am newbie in regard to SQL and have a general question about optimization.
According to your personal experience, what are the things I should consider in order to write an optimized query? Is there any particular command (ex. JOIN, CASE) I should try to privilege or avoid, whenever possible? Also, how do you measure a query efficiency?
Sorry for the open question, I am just trying to wrap my head around this subject and would be interested to hear the opinion of someone experienced.
Regards

"Efficiency" means to accomplish a goal with minimum effort. So what is efficient depends on the goal, and you cannot say something like "a query is executed efficiently if it takes less than the tenth of a second". Essentially, a query is efficient if there is no substantially faster way to do the task.
Another, more pragmatic, approach is to make queries efficient enough. If it does what you want it to do and the execution time and resource usage is fine for your purpose, stop worrying. You should also consider that optimizing a query to the theoretical optimum (e.g., by creating a specialized index) might negatively affect other parts of the system (e.g., data modifications become slower). You want to optimize the overall performance and resource usage of the system.
All that said, it should be clear that there can be no simple checklist that you can work off to ensure efficiency. But I can give you a short list of SQL anti-patterns that often lead to inefficient queries in my experience:
Don't use DISTINCT unless you are certain that it is required. It usually requires sorting, which is very expensive for large sets.
Avoid OR in WHERE conditions. It tends to prevent indexes from being used.
Use outer joins only if you are certain that an inner join won't do the trick. The database has fewer possibilities to rearrange such joins.
Use a normalized data model. Don't fall into the trap of using arrays or JSON in the database.
Use UNION ALL instead of UNION unless you need to eliminate duplicates. This is similar to DISTINCT.
Use WHERE EXISTS (/* subquery */) rather than WHERE x IN (/* subquery */). IN can always be rewritten as EXISTS, and the PostgreSQL optimizer is better at dealing with the latter.
These rules should be understood as rules of thumb.

Related

How often is database optimization NOT possible? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Currently I am working on a database that requires me to take raw data from a 3rd party and store it into a database. The problem is that the raw data is obviously not optimized, and the people who I'm building the database for, don't want any data entry involved when uploading the raw data into the database, they pretty much just want to upload the data and be done with it. Some of the raw data files have empty cells all over the place and many instances of duplicate names/numbers/entries. Is there a way to still optimize the data quickly and efficiently without too much data entry or reworking each time data is uploaded or is this an instant where optimization is impossible due to constrants? Does this happen a lot, or do I need to tell them their dreams of just uploading are not possible for long team success?
There are many ways to optimize data and one way to optimize data in one use case may be horrible in another use case. There are tools that will tell you there are multiple values in columns that need to be optimized but there is no single advice which works in all cases.
without specific details, this is always good:
With regards to empty entries, that should not be an issue
With regards to duplicate data, it may be worth considering adding a one to many relationship
One thing need to make sure is to put a key in any field you are going to search for, this will speed up a lot your queries no matter the dataset
as far as changing the database schema... rare are the schemas that do not change over time.
My advice is think through your schema but do not try to over optimize things because you can not plan in advance what the exact usage will be. As long as it is working and there is no bottleneck, focus on other areas. If there is a bottleneck, then by all means, rewrite the affected part, making sure indices are present (consider composite indices in some cases). Consider avoiding unions when possible. and remember the KISS principle (Keep It Simple and Sweet).

Inserting redundant information into the database to prevent table joins? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to build a activity stream which has the following structure :
------------------------------------------------------------------------------------
id | activity_by_user_id | activity_by_username | ... other activity related columns
------------------------------------------------------------------------------------
Is this a good approach to store the activity_by_username too in the activity table ? I understand that this will clutter up the table with the same username again and again. But If not, I will have to do a join with the users table to fetch the username.
The username in my web application never changes.
With this, I will no longer have to join this table with the users table. Is this an optimum way of achieving what I need ?
What you are proposing is to denormalize the data structure. There are advantages and disadvantages to this approach.
Clearly, you think that performance will be an advantage, because you will not need to look up the username on each row. This may not be true. The lookup should be on the primary key of the table and should be quite fast. There are even situations where storing the redundant information could slow down the query. This occurs when the field size is large and there are many apps with the same user. Then you are wasting lots of storage on redundant data, increasing the size of the table. Normally, though, you would expect to see a modest -- very modest -- improvement in performance.
Balanced against that is the fact that you are storing redundant data. So, if the user name were updated, then you would have to change lots of rows with the new information.
On balance, I would only advise you to go with such an approach if you tested it on real data in your environment and the performance improvement is worth it. I am skeptical that you would see much improvement, but the proof is in the pudding.
By the way, there are cases where denormalized data structures are needed to support applications. I don't think that looking up a field using a primary key is likely to be one of them.
There isn't a single answer to your question*
In general, relational database design seeks to avoid redundancy to limit the opportunities for data anomalies. For example, you now have the chance that two given rows might contain the same user id but different user names. Which one is correct? How do you prevent such discrepancies?
On the other hand, denormalization by storing certain columns redundantly is sometimes justified. You're right that you avoid doing a join because of that. But now it's your responsibility to make sure data anomalies don't creep in.
And was it really worth it? In MySQL, doing a join to look up a related row by its primary key is pretty efficient (you see this as a join type "eq_ref" in EXPLAIN). I wouldn't try to solve that problem until you can prove it's a bottleneck.
Basically, denormalization optimizes one type of query, at the expense of other types of queries. The extra work you do to prevent, detect, and correct data anomalies may be greater than any efficiency you gain by avoiding the join in this case. Or if usernames were to change sometimes, you'd have to change them in two places now (I know you said usernames don't change in your app).
The point is it depends entirely on your how frequently different queries are run by your application, so it's not something anyone can answer for you.
* That might explain why some people are downvoting your question -- some people in StackOverflow seem to have a rather strict idea about what is a "valid" question. I have seen questions closed or even deleted because they are too subjective and opinion-based. But I have also seen questions deleted because the answer is too "obvious". One of my answers with 100 upvotes was lost because a moderator thought the question of "Do I really need version control if I work solo?" was invalid. Go figure. I copied that one to my blog here.
I think it is bad idea. Databases are optimized for joins (assuming you did your job and indexed correctly) and denormalized data is notoriously hard to maintain. There may be no username changes now but can you guarantee that for the future, no. Risking your data integrity on such a thing is short-sighted at best.
Only denormalize in rare cases where there is an existing performance problem and other optimitization techniques have failed to improve the situation. Denormalizing isn't even always going to get you any performance improvement. As the tables get wider, it may even slow down performance. So don't do it unless you havea measuable performance problem and you measure and ensure the denormlaization actually helps. It is the last optimation technique to try out of all of them, so if you haven't gone through all the optimation techniques in the very large list of possibilities, first, then denormalization should not be an option.
No. This goes against all principles of data normalization.
And it won't even be that difficult (if I'm interpreting what you mean by id, user_id, and user_name); id will be the primary key tying everything together - and the linchpin of your JOINs. So you'll have your main table with id, other activity col, next activity col, etc. (not sure what you mean by activity). Then a 2nd table with just id and user_id and a third with id and username). And when you want to output whatever you're going to output, and do it by user_id or username, you'll just JOIN (implied join syntax - WHERE table1.id = table2.id).

CakePHP Database - MyISAM, InnoDB, or Postgresql [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I've always just used MyISAM for all of my projects, but I am looking for a seasoned opinion before I start this next one.
I'm about to start a project that will be dealing with hundreds of thousands of rows across many tables. (Several tables may even have millions of rows as the years go on). The project will primarily need fast-read access because it is a Web App, but fast-write obviously doesn't hurt. It needs to be very scalable.
The project will also be dealing with sensitive and important information, meaning it needs to be reliable. MySQL seems to be notorious for ignoring validation.
The project is going to be using CakePHP as a framework, and I'm fairly sure it supports MySQL and Postgresql equally, but if anyone can disagree with me on that please let me know.
I was tempted to go with InnoDB, but I've heard it has terrible performance. Postgresql seems to be the most reliable, but also is not as fast as MyISAM.
If I were able to upgrade the server's version of MySQL to 5.5, would InnoDB be a safer bet than Postgres? Or is MyISAM still a better fit for most needs and more scaleable than the others?
The only answer that this really needs is "not MyISAM". Not if you care about your data. After all, /dev/null has truly amazing performance, but it doesn't meet your reliability requirement either ;-)
The rest is the usual MySQL vs PostgreSQL opinion that we close every time someone asks a new flavour because it really doesn't lead to much that's useful.
What's way more important than your DB choice is how you use it:
Do you cache commonly hit data that can afford to be a little stale in something like Redis or Memcached?
Do you avoid "n+1" selects from inefficient ORMs in favour of somewhat sane joins?
Do you avoid selecting lots of data you don't need?
Do you do selective cache invalidation (I use LISTEN and NOTIFY for this), or just flush the whole cache when something changes?
Do you minimize pagination and when you must paginate, do so based on last-seen ID rather than offset? SELECT ... FROM ... WHERE id > ? ORDER BY id LIMIT 100 can be immensely faster than SELECT ... FROM ... ORDER BY id OFFSET ? LIMIT 100.
Do you monitor query performance and hand-tune problem queries, create appropriate indexes, etc?
(Marked community wiki because I close-voted this question and it seems inappropriate to close-vote and answer unless it's CW).

Many Associations Leading to Slow Query [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I currently have a database that has a lot of many to many associations. I have services which have many variations which have many staff who can perform the variation who then have details on themselves like name, role, etc...
At 10 services with 3 variations each and up to 4 out of 20 staff attached to each service even doing something as getting all variations and the staff associated with them takes 4s.
Is there a way I can reduce these queries that take a while to process? I've cut down the queries by doing eager loading in my DBM to reduce the problems that arise from 1+N issues, but still 4s is a long query for just a testing stage.
Is there a structure out there that would help make such nested many to many associations much quicker to select?
Maybe combining everything past the service level into a single table with a 'TYPE' column ?? I'm just not knowledgable enough to know the solution that turns this 4s query into a 300MS query... Any suggestions would be helpful.
A: It may be possible to restructure the data to make queries more efficient. This usually implies a trade-off with redundancy (repeated values), which can overly complicate the algorithms for insert/update/delete.
Without seeing the schema, and the query (queries?) you are running, it's impossible to diagnose the problem.
I think the the most likely explanation is that MySQL does not have suitable indexes available to efficiently satisfy the query (queries?) being run. Running an EXPLAIN query can be useful to show the access path, and give insight whether suitable indexes are available, whether indexes are even being considered, whether statistics are up-to-date, etc.
But you also mention "N+1" performance issues, and "eager loading", which leads me to believe that you might be using an ORM (like ADO Entity Framework, Hibernate, etc.) These are notorious sources of performance issues, issuing lots of SQL statements (N+1), OR doing a single query that does joins down several distinct paths, that produce a humongous result set, where the query is essentially doing a semi cross join.
To really diagnose the performance issue, you would really need to have the actual SQL statements being issued, and in a development environment, enabling the MySQL general log will capture the SQL being issued along with a rudimentary timing.
The table schemas would be nice to see for this question. As far as MySQL performance in general, make sure you research disk alignment, set the proper block sizes and for this particular issue check your execution plans and evaluate adding indexes.

MySql views performance [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
If you are going down the road of using views, how can you ensure good performance?
Or is it better not to use views in the first place and just incorporate the equivalent into your select statements?
It Depends.
It totally depends on what you are viewing through view. But most probably reducing your effort and giving higher performance. When SQL statement references a nonindexed view, the parser and query optimizer analyze the source of both the SQL statement and the view and then resolve them into a single execution plan. There is not one plan for the SQL statement and a separate plan for the view.
A view is not compiled. Its a virtual table made up of other tables. When you create it, it doesn't reside somewhere on your server. The underlying queries that make up the view are subject to the same performance gains or dings of the query optimizer. I've never tested performance on a view VS its underlying query, but i would imagine the performance may vary slightly. You can get better performance on an indexed view if the data is relatively static. This may be what you are thinking maybe in terms of "compiled".
Advantages of views:
View the data without storing the data into the object.
Restrict the view of a table i.e. can hide some of columns in the tables.
Join two or more tables and show it as one object to user.
Restrict the access of a table so that nobody can insert the rows into the table.
See these useful links:
Performance of VIEW vs. SQL statement
Is a view faster than a simple query?
Mysql VIEWS vs. PHP query
Are MySql Views Dynamic and Efficient?
Materialized View vs. Tables: What are the advantages?
Is querying over a view slower than executing SQL directly?
A workaround for the performance problems of TEMPTABLE views
See performance gains by using indexed views in SQL Server
Here's a tl;dr summary, you can find detailed evaluations from Peter Zaitsev and elsewhere.
Views in MySQL are generally a bad idea. At Grooveshark we consider them to be harmful and always avoid them. If you are careful you can make them work but at best they are a way to remember how to select data or keep you from having to retype complicated joins. At worst they can cause massive inefficiencies, hide complexity, cause accidental nested subselects (requiring temporary tables and leading to disk thrashing), etc.
It's best to just avoid them, and keep your queries in code.
I think the blog by Peter Zaitsev has most of the details. Speaking from personal experience views can perform well if you generally keep them simple. At one of my clients they kept on layering one view on top of another and it ended up in a perfomance nightmare.
Generally I use views to show a different aspect of a table. For example in my employees table show me the managers or hide the salary field from non HR employees. Also always make sure you run a EXPLAIN on the query and view to understand exactly what is happening inside MySQL.
If you want solid proof in your scenario I would suggest that you test. It is really hard to say using views is always a performance killer then again a badly written view is probably going to kill your performance.
They serve their purpose, but the hidden complexities and inefficiencies usually outweigh a more direct approach. I once encountered a SQL statement that was joining on two views, and sorting them the results. The views were sorting as well, so the execution time could be measured in what seemed like hours.
A thing not mentioned so far but making a huge difference is adequate indexing of the views' source tables.
As mentioned above, views do not reside in your DB but are rebuild every time. Thus everything that makes the rebuild easier for the DB increases performance of the view.
Often, views join data in a way that is very bad for storage (no normal form) but very good for further usage (doing analysis, presenting data to user, ...) and therewith joining and aggregating data from different tables.
Whether or not the columns on which the operations are made are indexed or not makes a huge difference on the performance of a view. If the tables and their relevant columns are indexed already accessing the view does not end in re-computing the indexes over and over again first. (on the downside, this is done when data is manipulated in the source tables)
! Index all columns used in JOINS and GROUP BY clauses in your CREATE VIEW statement !
If we are discussing "if you use views, how to ensure performance", and not the performance effect of views in general, I think that it boils down to restraint (as in yourself).
You can get in to big trouble if you just write views to make your query's simple in all cases, but do not take care that your views are actually usefull performance-wise. Any query's you're doing in the end should be running sane (see the comment example from that link by #eggyal). Ofcourse that's a tautology, but therefore not any less valuable
You especially need to be carefull not to make views from views, just because that might make it easier to make that view.
In the end you need to look at the reason you are using views. Any time you do this to make life easier on the programming end you might be better of with a stored procedure IMHO.
To keep things under control you might want to write down why you have a certain view, and decide why you are using it. For every 'new' use within your programming, recheck if you actually need the view, why you need it, and if this would still give you a sane execution-path. Keep on checking your uses to keep it speedy, and keep checking if you really need that view.