Are views in MySQL quicker than complex queries? [duplicate] - mysql

This question already has answers here:
MYSQL View vs Select Performance and Latency
(2 answers)
Closed 11 months ago.
I have a problem with a SELECT with multiple inner joins. My code is as follows:
SELECT `movies02`.`id`, `movies02`.`title`,
`movies03`.`talent`,
`movies07`.`character`,
`movies05`.`genre`
FROM `movies02`
INNER JOIN `movies07` ON `movies07`.`movie` = `movies02`.`id`
INNER JOIN `movies03` ON `movies03`.`id` = `movies07`.`performer`
INNER JOIN `movies08` ON `movies08`.`genre` = `movies05`.`id`
INNER JOIN `movies02` ON `movies08`.`movie` = `movies02`.`id`;
Doing an INNER JOIN to get the actors in the movie, as well as the characters they play, seems to work but the second two, which get the movie genre, don't work so I figure I can just write them as a VIEW and then combine them when I output the results. I would, therefore, end up with three VIEWs. One to get the genres, actors and characters, and then one to put everything together. Question is whether it is better to do that than one massive SELECT with multiple joins?
I tried rewriting the query a bunch of times and in multiple ways

When you do a query involving views, MySQL / MariaDB's query planner assembles all the views and your main query into a single query before working out how to access your tables. So, performance is roughly the same when using views, Common Table Expressions, and/or subqueries.
That being said, views are a useful way of encapsulating some query complexity.
And, you can grant a partly-trusted user access to a view without granting them access to the underlying tables.
The downside of views is the same as the downside of putting any application logic into your DBMS rather than in your application: it's trickier to update, and easier to forget to update. (This isn't relevant if you have a solid application-update workflow that updates views, stored functions, and stored procedures as it updates your application code.)
That being said, a good way to write queries like this is to start with the table containing the "top-level" entity. In your case I think it's the movie. Then LEFT JOIN the other tables rather than INNER JOINing them. That way you'll still see the movie in your results even when some of its subsidiary entities (performer, genre, I guess) are missing.
Pro tip: If you can, name your tables for the entities they contain (movie, genre, actor, etc) rather than using names like whatever01, whatever02 ... It's really important to be able to look at queries and reason about them, and naming the tables makes that easier.

Views are just sintactic sugar for queries. When you include a view in a query the engine reads the definition of it and combines it in the query.
They are useful to make queries easier to read and to type.
On the flip side, they can be detrimental to the query performance when naïve developers use them indiscriminately and end up producing queries that become unnecessarily complex behind the scenes. Use them with care.
Now, materialized view are a totally different story since they are pre-computed and refreshed at specific times or events. They can be quite fast to use since they can be indexed, but on the flip side their refresh interval configuration mean they may be showing data that is not 100% up to date.

Related

MySQL JOIN vs LIKE - faster selects?

Okay, so first of all let me tell a little about what I'm trying to do. Basically, during my studies I wrote a little webservice in PHP that calculates how similar movies are to each other based on some measurable sizes like length, actors, directors, writers, genres etc. The data I used for this was basically a collection of data accquired from omdbapi.com.
I still have that database, but it is technically just a SINGLE table that contains all the information to each movie. This means, that for each movie all the above mentioned parameters are divided by commas. Therefore I have so far used a query that encapsulates all these things by using LIKE statements. The query can become quite large as I will pretty much query for every parameter within the table, sometimes 5 different LIKE statements for different actors, the same for directors and writers. Back when I last used this, it took about 30 to 60 seconds to enter a single movie and receive a list of 15 similar ones.
Now I started my first job and to teach myself in my freetime, I want to work on my own website. Because I have no real concept for what I want to do with it, I thought I'd get out my old "movie finder" again and use it differently this time.
Now to challenge myself, I want the whole thing to be faster. Understand, that the data is NEVER changed, only read. It is also not "really" relational, as actor names and such are just strings and have no real entry anywhere else. Which essentially means having the same name will be treated as the same actor.
Now here comes my actual question:
Assuming I want my select queries to operate faster, would it make sense to run a script that splits the comma divided strings into extra tables (these are n to m relations, see attempt below) and then JOIN all these tables (they will be 8 or more) or will using LIKE as I currently do be about the same speed? The ONLY thing I am trying to achieve is faster select queries, as there is nothing else to really do with the data.
This is what I currently have. Keep in mind, I would still have to create tables for the relation between movies + each of these tables. After doing that, I could remove the columns in the movie table and would end up having to join a lot of tables with EACH query. The only real advantage I can see here, is that it would be easier to create an index on individuals tables, rather than one (or a few) covering the one, big movie table.
I hope all of this even makes sense to you. I appreciate any answer short or long, like I said this is mostly for self studies and as such, I don't have/need a real business model.
I don't understand what you currently have. It seems that you only showd the size of tables but not its internal structure. You need to separate data into separate tables using normalization rules and then put correct indexes. Indexes will make your queries very fast. What does the sizing above your query mean? Have you ever run EXPLAIN ANALYZE for you queries, and please post the query I cannot guess your query out of the result. There are a lot of optimization videos on YT.

large single join queries vs multiple smaller ones

So we are building this app, where the retrieval of data is based on small, modular queries. So for a product it would be something like:
$product = $this->product->getProductData($prod_id); //get main product record
$locations = $this->locations->getAvailableLocations($prod_id); //sale locations
$comments = $this->feedback->getFeedback($prod_id,'COMMENTS'); //user comments
On the other hand we could also do something like: $this->getAllProductData($id)
which would essentially have an SQL that:
get * from product_data
left join locations on <...>
left join comments on <...>
From a programming perspective, the first option makes it much easier for us to handle data, mix and match build separate flows/user experience etc. Our concern is - from a performance perspective would this become an issue when the products run in hundreds of thousands of rows?
There's overhead associated with each execution of a SQL statement. Packets sent to the server, SQL text parsed, statement verified to syntactically correct (keywords, commas, parens, etc.), statement verified to be semantically correct (identifiers reference tables, columns, function, et al. exist and user has sufficient privilegs, evaluating possible execution plans and choosing optimum plan, executing the plan (obtaining locks, accessing data in buffers, etc.), materializing the resultset (metadata and values), and returning to caller, releasing locks, cleanup of resources, etc.. On the client side, there's the overhead of retrieving the resultset, fetching rows, and closing the statement.
In general, it's more efficient to retrieve the actual data that is needed with fewer statements, but not if that entials returning a whole slew of information that's not needed. If we only need 20 rows, then we add LIMIT 20 on the query. If we only need rows for a particular product_id, WHERE product_id = 42.
When we see tight loops of repeated execution of essentially the same statement, that's a tell tale sign that the developer is processing data RBER (row by excruciating row), rather than as a set.
Bottom line, it depends on the use case. Sometimes, it's more efficient to run a couple of smaller statements in place of one humongous statement.
Use your second example (all joins in one query). As long as you have an index on "prod_id" and anything else you're filtering on or joining on, the database query optimizer will do smart things, such as seeing that prod_id will only return a few records and that doing that first will make the query about as fast as it could possibly be. Query Optimizers are very, very good at this in general.
For simple joins you are probably good with one sql statement, but from my experience, multiple separate queries are better for performance when more tables are involved.
I have worked on a website where the product information was scattered across seven different tables and usually there were only two or three tables that needed to be joined. However on one page we had a complicated search function that needed to look at all seven of them, so on that page we wrote the code to join all the tables in one statement. It worked fine until a few months later, it gradually got slower and slower until it just wouldn't load altogether.
We went through all the tables and made sure everything was indexed properly and nothing was fixing it. We noticed that the sql statements were all running fine by themselves, so we ended up splitting it up into separate statements and that fixed it, haven't had to go back and look at it again.

Should I split up a complex query into one to filter results and one to gather data?

I'm designing a central search function in a PHP web application. It is focused around a single table and each result is exactly one unique ID out of that table. Unfortunately there are a few dozen tables related to this central one, most of them being 1:n relations. Even more unfortunate, I need to join quite a few of them. A couple to gather the necessary data for displaying the results, and a couple to filter according to the search criteria.
I have been mainly relying on a single query to do this. It has a lot of joins in there and, as there should be exactly one result displayed per ID, it also works with rather complex subqueries and group by uses. It also gets sorted according to a user-set sort method and there's pagination in play as well done by the use of LIMIT.
Anyways, this query has become insanely complex and while I nicely build it up in PHP it is a PITA to change or debug. I have thus been considering another approach, and I'm wondering just how bad (or not?) this is for performance before I actually develop it. The idea is as follows:
run one less complex query only filtering according the search parameters. This means less joins and I can completely ignore group by and similar constructs, I will just "SELECT DISTINCT item_id" on this and get a list of IDs
then run another query, this time only joining in the tables I need to display the results (only about 1/4 of the current total joins) using ... WHERE item_id IN (....), passing the list of "valid" IDs gathered in the first query.
Note: Obviously the IN () could actually contain the first query in full instead of relying on PHP to build up a comma-separated list).
How bad will the IN be performance-wise? And how much will it possibly hurt me that I can not LIMIT the first query at all? I'm also wondering if this is a common approach to this or if there are more intelligent ways to do it. I'd be thankful for any input on this :)
Note to clarify: We're not talking about a few simple joins here. There is even (simple) hierarchical data in there where I need to compare the search parameter against not only the items own data but also against its parent's data. In no other project I've ever worked on have I encountered a query close to this complexity. And before you even say it, yes, the data itself has this inherent complexity, which is why the data model is complex too.
My experience has shown that using the WHERE IN(...) approach tends to be slower. I'd go with the joins, but make sure you're joining on the smallest dataset possible first. Reduce down the simple main table, then join onto that. Make sure your most complex joins are saved to the end to minimize the rows required to search. Try to join on indexes wherever possible to improve speed, and ditch wildcards in JOINS where possible.
But I agree with Andomar, if you have the time build both and measure.

Rails and queries with complex joins: Can each joined table have an alias?

I'm developing an online application for education research, where I frequently have the need for very complex SQL queries:
queries usually include 5-20 joins, often joined to the same table several times
the SELECT field often ends up being 30-40 lines tall, between derived fields / calculations and CASE statements
extra WHERE conditions are added in the PHP, based on user's permissions & other security settings
the user interface has search & sort controls to add custom clauses to the WHERE / ORDER / HAVING clauses.
Currently this app is built on PHP + MYSQL + Jquery for the moving parts. (This grew out of old Dreamweaver code.) Soon we are going to rebuild the application from scratch, with the intent to consolidate, clean, and be ready for future expansion. While I'm comfortable in PHP, I'm learning bits about Rails and realizing, Maybe it would be better to build version 2.0 on a more modern framework instead. But before I can commit to hours of tutorials, I need to know if the Rails querying system (ActiveRecord?) will meet our query needs.
Here's an example of one query challenge I'm concerned about. A query must select from 3+ "instances" of a table, and get comparable information from each instance:
SELECT p1.name AS my_name, pm.name AS mother_name, pf.name AS father_name
FROM people p1
JOIN mother pm ON p1.mother_id = pm.id
JOIN father pf ON p1.father_id = pf.id
# etc. etc. etc.
WHERE p1.age BETWEEN 10 AND 16
# (selects this info for 10-200 people)
Or, a similar example, more representative of our challenges. A "raw data" table joins multiple times to a "coding choices" table, each instance of which in turn has to look up the text associated with a key it stores:
SELECT d.*, c1.coder_name AS name_c1, c2.coder_name AS name_c2, c3.coder_name AS name_c3,
(c1.result + c2.result + c3.result) AS result_combined,
m_c1.selection AS selected_c1, m_c2.selection AS selected_c2. m_c3.selection AS selected_c3
FROM t_data d
LEFT JOIN t_codes c1 ON d.id = c1.data_id AND c1.category = 1
LEFT JOIN t_menu_choice m_c1 ON c1.menu_choice = m_c1.id
LEFT JOIN t_codes c2 ON d.id = c2.data_id AND c2.category = 2
LEFT JOIN t_menu_choice m_c2 ON c2.menu_choice = m_c2.id
LEFT JOIN t_codes c3 ON d.id = c3.data_id AND c3.category = 3
LEFT JOIN t_menu_choice m_c3 ON c3.menu_choice = m_c3.id
WHERE d.date_completed BETWEEN ? AND ?
AND c1.coder_id = ?
These sorts of joins are straightforward to write in pure SQL, and when search filters and other varying elements are needed, a couple PHP loops can help to cobble strings together into a complete query. But I haven't seen any Rails / ActiveRecord examples that address this sort of structure. If I'll need to run every query as pure SQL using find_by_sql(""), then maybe using Rails won't be much of an improvement over sticking with the PHP I know.
My question is: Does ActiveRecord support cases where tables need "nicknames", such as in the queries above? Can the primary table have an alias too? (in my examples, "p1" or "d") How much control do I have over what fields are selected in the SELECT statement? Can I create aliases for selected fields? Can I do calculations & select derived fields in the SELECT clause? How about CASE statements?
How about setting WHERE conditions that specify the joined table's alias? Can my WHERE clause include things like (using the top example) " WHERE pm.age BETWEEN p1.age AND 65 "?
This sort of complexity isn't just an occasional bizarre query, it's a constant and central feature of the application (as it's currently structured). My concern is not just whether writing these queries is "possible" within Rails & ActiveRecord; it's whether this sort of need is supported by "the Rails way", because I'll need to be writing a lot of these. So I'm trying to decide whether switching to Rails will cause more trouble than it's worth.
Thanks in advance! - if you have similar experiences with big scary queries in Rails, I'd love to hear your story & how it worked out.
Short answer is Yes. Rails takes care of the large part of these requirements through various types of relations, scopes, etc. Most important thing is to properly model your application to support types of queries and functionality you are going to need. If something is difficult to explain to a person, generally will be very hard to do in rails. It's optimized to handle most of "real world" type of relationships and tasks, so "exceptions" become somewhat difficult to fit into this convention, and later become harder to maintain, manage, develop further, decouple etc. Bottom line is that rails can handle sql query for you, SomeObject.all_active_objects_with_some_quality, give you complete control over sql SomeObject.find_by_sql("select * from ..."), execute("update blah set something=''...) and everything in between.
One of advantages of rails allows you to quickly create prototypes, I would create your model concepts, and then test the most complex business requirements that you have. This will give you a quick idea of what is possible and easy to do vs bottlenecks and potential issues that you might face in development.

Using Linq-to-SQL preload without joins

Like many people, I am trying to squeeze the best performance out of my app while keeping the code simple and readable as possible. I am using Linq-to-SQL and am really trying to keep my data layer as declarative as possible.
I operate on the assumption that SQL calls are the most expensive operations. Thus, I try to minimize them in quantity, but try to avoid crazy complex queries that are hard to optimize.
Case in point: I am using DataLoadOptions with my DataContext -- its goal is to minimize the quantity of queries by preloading related entities. (Aka, eager loading vs lazy loading.)
Problem: Linq uses joins to achieve the goal. As with everything, it's a trade-off. I am getting fewer queries, but those joined queries are more complex and expensive. Going into SQL Profiler makes this clear.
So, I'd like an option in Linq to preload without joins. Is this possible? Here's what it might look like:
I have a Persons table, an Items table, and a PersonItems table to provide a many-to-many relationship. When loading a collection of Persons, I'd like to have all their PersonItems and Items eagerly loaded as well.
Linq currently does this with one large query, containing two joins. What I'd rather it do is three non-join queries: one for Persons, one for all the PersonItems relating to those Persons, and one for all Items relating to those PersonItems. Then Linq would automagically arrange them into the related entities.
Each of these would be a fast, firehose-type query. Over the long haul, it would allow for predictable, web-scale performance.
Ever seen it done?
I believe what you describe where three non-join queries are done is essentially what happens under the hood when a single join query is performed. I could be wrong but if this the case the single query will be more efficient as only one database query is involved as opposed to three. If you are having performance issues I'd make sure the columns you are joining on are indexed (you should see no table scans in SQL profiler). If this is not enough you could write a custom stored procedure to get just the data you need (assuming you don't need every column of every object, this will allow you to make use of index seeks which are faster than index scans), or alternately you could denormalise (duplicate data across your tables) so that no joining occurs at all.