Best approach to construct complex MySQL joins and groups? - mysql

I find that when trying to construct complex MySQL joins and groups between many tables I usually run into strife and have to spend a lot of 'trial and error' time to get the result I want.
I was wondering how other people approach the problems. Do you isolate the smaller blocks of data at the end of the branches and get these working first? Or do you start with what you want to return and just start linking tables on as you need them?
Also wondering if there are any good books or sites about approaching the problem.

I don't work in mySQL but I do frequently write extremely complex SQL and here's how I approach it.
First, there is no substitute whatsoever for thoroughly understanding your database structure.
Next I try to break up the task into chunks.
For instance, suppose I'm writing a report concerning the details of a meeting (the company I work for does meeting planning). I will need to know the meeting name and sales rep, the meeting venue and dates, the people who attened and the speaker information.
First I determine which of the tables will have the information for each field in the report. Now I know what I will have to join together, but not exactly how as yet.
So first I write a query to get the meetings I want. This is the basis for all the rest of the report, so I start there. Now the rest of the report can probably be done in any order although I prefer to work through the parts that should have one-one relationshisps first, so next I'll add the joins and the fields that will get me all the sales rep associated information.
Suppose I only want one rep per meeting (if there are multiple reps, I only want the main one) so I check to make sure that I'm still returning the same number of records as when I just had meeting information. If not I look at my joins and decide which one is giving me more records than I need. In this case it might be the address table as we are storing multiple address for the rep. I then adjust the query to get only one. This may be easy (you may have a field that indicates the specific unique address you want and so only need to add a where condition) or you may need to do some grouping and aggregate functions to get what you want.
Then I go on to the next chunk (working first through all the chunks that should have a 1-1 relationshisp to the central data in this case the meeting). Runthe query nd check the data after each addition.
Finally I move to those records which might have a one-many relationship and add them. Again I run the query and check the data. For instance, I might check the raw data for a particular meeting and make sure what my query is returning is exactly what I expect to see.
Suppose in one of these additions of a join I find the number of distinct meetings has dropped. Oops, then there is no data in one of the tables I just added and I need to change that to a left join.
Another time I may find too many records returned. Then I look to see if my where clause needs to have more filtering info or if I need to use an aggreagte function to get the data I need. Sometimes I will add other fields to the report temporarily to see if I can see what is causing the duplicated data. This helps me know what needs to be adjusted.
The real key is to work slowly, understand your data model and check the data after every new chunk is added to make sure it is returning the results the way you think they should be.
Sometimes, If I'm returning a lot of data, I will temporarily put an additonal where clause on the query to restrict to a few items I can easily check. I also strongly suggest the use of order by because it will help you see if you are getting duplicated records.

Well the best approach to break down your MySQL query is to run the EXPLAIN command as well as looking at the MySQL documentation for Optimization with the EXPLAIN command.
MySQL provides some great free GUI tools as well, the MySQL Query Browser is what you need to use.
When running the EXPLAIN command this will break down how MySQL interprets your query and displays the complexity. It might take some time to decode the output but thats another question in itself.
As for a good book I would recommend: High Performance MySQL: Optimization, Backups, Replication, and More

I haven't used them myself so can't comment on their effectiveness, but perhaps a GUI based query builder such as dbForge or Code Factory might help?
And while the use of Venn diagrams to think about MySQL joins doesn't necessarily help with the SQL, they can help visualise the data you are trying to pull back (see Jeff Atwood's post).

Related

Smart Queries That Deal With NULL Values

I currently inherited a table similar to the one in the image below. I don't have the resources to do what should be done in the allotted time, which is obviously to normalize the data into separate tables break it into a few smaller tables to eliminate redundancy, etc.
My current idea for a short-term solution is to create a query for each product type and store it in a new table based on ParentSKU. In the image below, a different query would be necessary for each of the 3 example ParentSKUs. This will work okay, but if new attributes are added to a SKU the query needs to be adjusted manually. What would be ideal in the short term (but probably not very likely) is to be able to come up with a query that would only include and display attributes where there weren't any NULL values. The desired results for each of the three ParentSKUs would be the same as they are in the examples below. If there were only 3 queries total, that would be easy enough, but there are dozens of combinations based on the products and categories of each product.
I'm certainly not the man for the job, but there are scores of people way smarter than I am that frequent this site every day that may be able to steer me in a better direction. I realize I'm probably asking for the impossible here, but as the saying goes, "There are no stupid questions, only ill-advised questions that deservedly and/or inadvertently draw the ire of StackOverflow users for various reasons." Okay, I embellished a tad, but you get my point...
I should probably add that this is currently a MySQL database.
Thanks in advance to anyone that attempts to help!
First create SKUTypes with the result of
SELECT ParentSKU , count(Attr1) as Attr1,..
FROM tbl_attr
GROUP BY ParentSKU;
Then create script which will generate an SQL query for every row of SKUTypes taking every AttrN column which value > 0.

Ranking search results

I have a tutoring website with a search feature. I want tutors to appear on the list according to several weighted criteria, including whether or not they are subscription holders, if they have submitted a profile photo, if they have included a lot of information about themselves, etc...
Basically, I have a lot of criteria by which I would like to weigh their rank.
Instead of writing a complicated SQL query with multiple ORDER BYs (if this is even possible), I was thinking of creating a table (maybe a temporary one), that assigns numerical values based on several criteria to come up with a final search rank.
I'm not entirely sure about how to go about this, or if this is a good idea, so I would like to know what the community thinks about a) this method, and b) possible ways of implementing this in SQL.
I would add a field to one of the existing tables that more or less was a representation of their "weight" for sorting purposes. I would then populate this column with a database procedure that ran every so often (you could make a queue that only runs on records that have been updated, or just run it on all records if you want). That way, I can just pull back the data and order by one column instead of multiple ones.
Also, you could use a View. It really depends on if you want to number crunching to be done by the procedure or by the database every time you pull data (for a search feature and for speed's sake, I'd suggest the database procedure).

Most efficient way to select lots of values from database

I have a database table with around 2400 'items'. A user will have any arbitrary combination of items from the 2400 item set. Each item's details then need to be looked up and displayed to the user. What is the most efficient way to get all of the item details? I can think of three methods:
Select all 2400 items and parse the required details client-side.
Select the specific items with a SELECT which could be a very long SQL string (0-2400 ids)?
Select each item one at a time (too many connections)?
I'm not clued up on SQL efficiency so any guidance would help. It may help to know this is a web app heavily AJAX based.
Edit: On average a user will select ~150 items and very rarely more than 400-500.
The best method is to return the data you want from the database in a single query:
select i.*
from items i
where i.itemID in (<list of ids>);
MySQL queries can be quite large (see here), so I wouldn't worry about being able to pass in the values.
However, if your users have so many items, I would suggest storing them in the database first and then doing a join to get the additional information.
If the users never/rarely select more than ~50 elements, then I agree with Gordons answer.
If it is really plausible that they might select up to 2400 items, you'll probably be better off by inserting the selected ids into a holding table and then joining with that.
However, a more thorough answer can be found here - which I found through this answer.
He concludes that:
We see that for a large list of parameters, passing them in a temporary table is much faster that as a constant list, while for small lists performance is almost the same.
'Small' and 'large' are hardly static, but dependent upon your hardware - so you should test. My guess would be that with an average of 150 elements in your IN-list, you will see the temp table win.
(If you do test, please come back here and say what is the fastest in your setup.)
2400 items is nothing. I've have a MySql database which has hundreds of thousands of rows and relation and are working perfectly, by just optimizing the queries.
What you must do is, see how long the execution time is for each sql query. You can then optimize each query on its own, trying different querys and measure the execution time.
You can use ex. MySql Workbench, Sequel pro, Microsoft Server management studio or another software for building queries. Also you can add indexes to your tables, which can improve queries as well.
If you need to scale your database up you can use software like http://hadoop.apache.org
Also a great thing to mention is NoSQL (Not-Only SQL). It's a relation-less database, which can handle dynamic attributes and are build for handling large amount of data.
As you mention you could use AJAX. But that only helps the load time of the page and the stress of your web server (Not SQL server). Just ask if you wan't more info or more in-depth explanation.

What is the "Rails way" to do correlated subqueries?

I asked nearly the same question in probably the wrong way, so I apologize for both the near duplicate and lousy original phrasing. I feel like my problem now is attempting to fight Rails, which is, of course, a losing battle. Accordingly, I am looking for the idiomatic Rails way to do this.
I have a table containing rows of user data which is scraped from a third party site periodically. The old data is just as important as the new data; the old data is, in fact, probably used more often. There are no performance concerns about referencing the new data, because only a couple people will ever use my service (I keep my standards realistic). But thousands of users are scraped periodically (i.e., way too often). I have named the corresponding models "User" and "UserScrape"
Table users has columns: id, name, email
Table user_scrapes has columns: id, user_id, created_at, address_id, awesomesauce_preference
Note: These are not the real models - user_scrapes has a lot more columns - but you probably get the point
At any given time, I want to find the most recent user_scrapes values associated with the data retrieved from an external source from a given user. I want to find out that my current awesomeauce_preference is, because lately it's probably 'lamesauce' but before, it was 'saucy_sauce'.
I want to have a convenient method that allows me to access the newest scraped data for each user in such a way that I can combine it with separate WHERE clauses to narrow it down further. That's because in at least a dozen parts of my code, I need to deal with the data from the latest scrape.
What I have done so far is this horrible hack that selects the latest user_scrapes for each user with a regular find_by_sql correlated sub-query, then I pluck out the ids of the scrapes, then I put an additional where clause in any relevant query (that needs the latest data).
This is already an issue performance-wise because I don't want to buffer over a million integers (yes, a lot of pages get scraped very often) then try to pass the MySQL driver a list of these and have it miraculously execute a perfect query plan. In my benchmark it took almost as long as it did for me to write this post, so I lied before. Performance is sort of an issue, but not really.
My question
So with my UserScrape class, how can I make a method called 'current', as in: UserScrape.find(1337).current.where(address_id: 1234).awesomesauce_preference when I live at addresses 1234 and 1235 and I want to find out what my awesomsauce_preference is at my latest address?
I think what you are looking for are scopes:
http://guides.rubyonrails.org/active_record_querying.html#scopes
In particular, you can probably use:
scope :current, order("user_scrapes.created_at DESC").limit(1)
Update:
Scopes are meant to return an ActiveRecord object, so that you can continue chaining methods if you wish. There is nothing to prevent you (last I checked anyways) from writing this instead, however:
scope :current, order("user_scrapes.created_at DESC").first
This returns just the one object, and is not chainable, but it may be a more useful function ultimately.
UserScrape.where(address_id: 1234).current.awesomesauce_preference

Should I split up a complex query into one to filter results and one to gather data?

I'm designing a central search function in a PHP web application. It is focused around a single table and each result is exactly one unique ID out of that table. Unfortunately there are a few dozen tables related to this central one, most of them being 1:n relations. Even more unfortunate, I need to join quite a few of them. A couple to gather the necessary data for displaying the results, and a couple to filter according to the search criteria.
I have been mainly relying on a single query to do this. It has a lot of joins in there and, as there should be exactly one result displayed per ID, it also works with rather complex subqueries and group by uses. It also gets sorted according to a user-set sort method and there's pagination in play as well done by the use of LIMIT.
Anyways, this query has become insanely complex and while I nicely build it up in PHP it is a PITA to change or debug. I have thus been considering another approach, and I'm wondering just how bad (or not?) this is for performance before I actually develop it. The idea is as follows:
run one less complex query only filtering according the search parameters. This means less joins and I can completely ignore group by and similar constructs, I will just "SELECT DISTINCT item_id" on this and get a list of IDs
then run another query, this time only joining in the tables I need to display the results (only about 1/4 of the current total joins) using ... WHERE item_id IN (....), passing the list of "valid" IDs gathered in the first query.
Note: Obviously the IN () could actually contain the first query in full instead of relying on PHP to build up a comma-separated list).
How bad will the IN be performance-wise? And how much will it possibly hurt me that I can not LIMIT the first query at all? I'm also wondering if this is a common approach to this or if there are more intelligent ways to do it. I'd be thankful for any input on this :)
Note to clarify: We're not talking about a few simple joins here. There is even (simple) hierarchical data in there where I need to compare the search parameter against not only the items own data but also against its parent's data. In no other project I've ever worked on have I encountered a query close to this complexity. And before you even say it, yes, the data itself has this inherent complexity, which is why the data model is complex too.
My experience has shown that using the WHERE IN(...) approach tends to be slower. I'd go with the joins, but make sure you're joining on the smallest dataset possible first. Reduce down the simple main table, then join onto that. Make sure your most complex joins are saved to the end to minimize the rows required to search. Try to join on indexes wherever possible to improve speed, and ditch wildcards in JOINS where possible.
But I agree with Andomar, if you have the time build both and measure.