Advanced mysql grouping issue - brainteaser - mysql

I have been using stackoverflow vastly during the last year - an excellent source w/ great contributors. Now it's my time to request for help.
The setup is normal:
Orders, OrderArticles and Articles
I want to get the total amount of articles sold during the last year, but only during the best 5 weeks.
Never mind the WEEK-function and UNIXTIME-blah blah - I've got that covered. My question is wether it's possible or not to do without resorting to stored procedures or functions.
I have created a subquery for the summary for each week and article and order the result by the sum descendingly. Now - I only have to LIMIT the query to 5. Easy, but I also have to filter the result on the ArticleID BUT since I'm inside a subquery I don't have access to the outer ArticleID and it doesn't help to JOIN the result - it's too late ;-)
The syntax (hard to understand w/o the actual sql, right...?)
SELECT a.ID, [more fields], omg.total
FROM Articles AS a
LEFT JOIN
(
SELECT weeklytotals.articleID, weeklytotals.total
FROM
(
SELECT SUM(ra.quantity) AS total, ra.articleID AS articleID
FROM OrderArticles ra
INNER JOIN Orders r
ON ra.orderID = r.ID
WHERE r.timeCreated >= UNIX_TIMESTAMP('2011-06-30')
GROUP BY ra.articleID, WEEK(FROM_UNIXTIME(r.timeCreated))
ORDER BY SUM(ra.quantity) DESC
) AS weeklytotals
WHERE omg.articleID = a.ID --<-- THIS IS NOT WORKING BUT NECESSARY!
LIMIT 0, 5
) AS omg
ON omg.articleID = a.ID
WHERE a.isEnabled = 1 --more WHERE-thingys
This here returns the top 5 articles and ties the them to the correct Article. yay.
I have left out the SUM-function (which could go into the omg-SELECT).
Do you understand? Do I understand what I want? Yes, of course we do!
Thanx in advance.
Edit: The conditions have been changed - which makes my life easier, but I still would like to know if there is a solution to the problem.

If you require the omg subquery to use data from the a table, place it into the SELECT part not the FROM part. Using terms from the mysql documentation, you want the result of a correlated subquery to appear as a scalar operand in the outer result set.
You wrote about being interested in the sum, i.e. only a single number per article, although you left out the SUM from your example query. My approach relies on that sum, and would probably break in a bad way if you really needed distinct values for each of the best five weeks.
SELECT a.ID, [more fields], IFNULL(SUM(
(
SELECT SUM(ra.quantity) AS total
FROM OrderArticles ra
INNER JOIN Orders r
ON ra.orderID = r.ID
WHERE ra.articleID = a.ID -- <-- reference a.ID here
AND r.timeCreated >= UNIX_TIMESTAMP('2011-06-30')
GROUP BY WEEK(FROM_UNIXTIME(r.timeCreated))
ORDER BY SUM(ra.quantity) DESC
LIMIT 0, 5
)), 0) AS total
FROM Articles AS a
WHERE a.isEnabled = 1 --more WHERE-thingys
GROUP BY a.ID
I'm not saying anything about performance here. Placing the subquery this way, it will be executed for every row of the result set. So it might be too slow for practical use if you have a large number of articles. But if that should happen, I doubt that stored procedures or similar tricks would fare any better.
Edit: I found out that my original suggestion, which used subquerys nested two levels deep, doesn't allow access the innermost subquery to use a column of the outermost. But toying with this on sqlfiddle I also found out that one may safely pass the result of a subquery to sum, thus avoiding one level of nesting. So the above code nos has actually been checked and executed by a MySQL server, and should therefore work as intended.

Related

Can anyone assist me with optiising my query to reduce result time?

I have written a MYSQL query without much expertise in this area but as my database has increased in size, I'm finding the results are taking far too long to be returned. I can understand why but I can't figure out how to better group my query so that MYSQL isn't searching through the entire database to return the results. I know there is a far more efficient way to do this but I can't figure out how. If I remove the ORDER BY statement, the results are returned in less than a quarter of the time. As it stands now with a table that has 180,000 entries in it (members), it's taking about 4 seconds to return the results.
SELECT members.mem_id, members.username, members.online,
members.dob, members.regdate, members.sex,
members.mem_type, members.aboutme,
geo_cities.name AS city,
geo_countries.name AS country, photos.photo_path
FROM members
LEFT JOIN geo_cities
ON members.cty_id=geo_cities.cty_id
LEFT OUTER JOIN geo_countries
ON geo_cities.con_id=geo_countries.con_id
RIGHT OUTER JOIN photos
ON members.mem_id=photos.mem_id
WHERE (photos.main=1
AND photos.approved=1
AND members.banned!="1"
AND members.profile_photo="1"
AND members.profile_essentials="1"
AND members.profile_user="1")
ORDER BY lastdate DESC
LIMIT 12
It looks like you want to show the most recent 12 members who meet certain criteria.
A few things.
Your RIGHT JOIN on photos is actually an ordinary inner JOIN: its columns appear in your WHERE clause.
You probably need compound indexes on the members and photos tables.
SELECT many columns FROM ... JOIN ... ORDER BY column... LIMIT 12 is a notorious performance antipattern: It constructs a complex result set, then sorts the whole thing, then discards almost all of it. Wasteful.
You have WHERE....members.banned != "1" Inequality filters like this make SQL work harder (==slower) than equalities. If you can change that to = "0" or something like that do it.
(I guess your lastdate column is in your members table, but you didn't tell us that in your question.)
So try something like this to find the twelve members you want to display.
SELECT members.mem_id
FROM members
JOIN photos ON members.mem_id=photos.mem_id
WHERE photos.main=1
AND photos.approved=1
AND members.banned!="1"
AND members.profile_photo="1"
AND members.profile_essentials="1"
AND members.profile_user="1")
ORDER BY lastdate DESC
LIMIT 12
That gets you the ids of the twelve members you want. Use it in your main query.
SELECT members.mem_id, members.username, members.online,
members.dob, members.regdate, members.sex,
members.mem_type, members.aboutme,
geo_cities.name AS city,
geo_countries.name AS country, photos.photo_path
FROM members
LEFT JOIN geo_cities ON members.cty_id=geo_cities.cty_id
LEFT JOIN geo_countries ON geo_cities.con_id=geo_countries.con_id
JOIN photos ON members.mem_id=photos.mem_id
WHERE members.mem_id IN (
SELECT members.mem_id
FROM members
JOIN photos ON members.mem_id=photos.mem_id
WHERE photos.main=1
AND photos.approved=1
AND members.banned!="1"
AND members.profile_photo="1"
AND members.profile_essentials="1"
AND members.profile_user="1")
ORDER BY lastdate DESC
LIMIT 12
)
ORDER BY lastdate DESC
LIMIT 12
This finds the twelve members you care about, then pulls out only their records, instead of pulling all the records.
Then, create a compound index on members(profile_photo, profile_essentials, profile_user, banned, lastdate). That compound index will speed up your WHERE clause a great deal.
Likewise, create a compound index on photos(mem_id, main, approved, photo_path).
Things always get exciting when databases start to grow! Read Markus Winand's online book https://use-the-index-luke.com/

Complex MySQL query problems and also SQL hangs

I am trying to write an SQL query which is pretty complex. The requirements are as follows:
I need to return these fields from the query:
track.artist
track.title
track.seconds
track.track_id
track.relative_file
album.image_file
album.album
album.album_id
track.track_number
I can select a random track with the following query:
select
track.artist, track.title, track.seconds, track.track_id,
track.relative_file, album.image_file, album.album,
album.album_id, track.track_number
FROM
track, album
WHERE
album.album_id = track.album_id
ORDER BY RAND() limit 10;
Here is where I am having trouble though. I also have a table called "trackfilters1" thru "trackfilters10" Each row has an auto incrementing ID field. Therefore, row 10 is data for album_id 10. These fields are populated with 1's and 0's. For example, album #10 has 10 tracks, then trackfilters1.flags will contain "1111111111" if all tracks are to be included in the search. If track 10 was to be excluded, then it would contain "1111111110"
My problem is including this clause.
The latest query I have come up with is the following:
select
track.artist, track.title, track.seconds,
track.track_id, track.relative_file, album.image_file,
album.album, album.album_id, track.track_number
FROM
track, album, trackfilters1, trackfilters2
WHERE
album.album_id = track.album_id
AND
( (album.album_id = trackfilters1.id)
OR
(album.album_id=trackfilters2.id) )
AND
( (mid(trackfilters1.flags, track.track_number,1) = 1)
OR
( mid(trackfilters2.flags, track.track_number,1) = 1))
ORDER BY RAND() limit 2;
however this is causing SQL to hang. I'm presuming that I'm doing something wrong. Does anybody know what it is? I would be open to suggestions if there is an easier way to achieve my end result, I am not set on repairing my broken query if there is a better way to accomplish this.
Additionally, in my trials, I have noticed when I had a working query and added say, trackfilters2 to the FROM clause without using it anywhere in the query, it would hang as well. This makes me wonder. Is this correct behavior? I would think adding to the FROM list without making use of the data would just make the server procure more data, I wouldn't have expected it to hang.
There's not enough information here to determine what's causing the performance issue.
But here's a few suggestions and comments.
Ditch the old-school comma syntax for the join operations, and use the JOIN keyword instead. And relocate the join predicates to an ON clause.
And for heaven's sake, format the SQL so that it's decipherable by someone trying to read it.
There's some questions here... will there always be a matching row in both trackfilters1 and trackfilters2 for rows you want to return? Or could a row be missing from trackfilters2, and you still want to return the row if there's a matching row in trackfilters1? (The answer to that question determines whether you'd want to use an outer join vs an inner join to those tables.)
For best performance with large sets, having appropriate indexes defined is going to be critical.
Use EXPLAIN to see the execution plan.
I suggest you try writing your query like this:
SELECT track.artist
, track.title
, track.seconds
, track.track_id
, track.relative_file
, album.image_file
, album.album
, album.album_id
, track.track_number
FROM track
JOIN album
ON album.album_id = track.album_id
LEFT
JOIN trackfilters1
ON trackfilters1.id = album.album_id
LEFT
JOIN trackfilters2
ON trackfilters2.id = album.album_id
WHERE MID(trackfilters1.flags, track.track_number, 1) = '1'
OR MID(trackfilters2.flags, track.track_number, 1) = '1'
ORDER BY RAND()
LIMIT 2
And if you want help with performance, provide the output from EXPLAIN, and what indexes are defined.

Select taking too long. Need advice for a better performance

Ok, here we go. There's this messy SELECT crossing other tables and ordering to get the one desired row. Basically I do the "math" inside the ORDER BY.
1 base table.
7 JOINS poiting to local tables.
WHERE with 2 clauses and a NOT IN crossing another table.
You'll see in the code the ORDER BY is pretty damn big/ugly, it sums the result of 5 different calculations. I need that result to order by those calculations in order to get the worst row-case.
The problem is once I execute the Stored Procedure it takes up to 8 seconds to run. That's kind of non-acceptable. So, I'm starting to check Indexes.
So, I'm looking for advices on how to make this query run faster.
I'm indexing the WHERE clauses and the field LINEA, Should I index something else? Like the rows Im crossing for the JOINs? or should I approach the query differently?
Query:
SET #LINEA = (
SELECT TOP 1
BOA.LIN
FROM
BAND_BA BOA
LEFT JOIN
TEL PAR
ON REPLACE(BOA.Lin,'-','') = SUBSTRING(PAR.Te,2,10)
LEFT JOIN
TELP CLP
ON REPLACE(BOA.Lin,'-','') = SUBSTRING(CLP.Numtel,2,10)
LEFT JOIN
CA C
ON REPLACE(BOA.Lin,'-','') = C.An
LEFT JOIN
RE R
ON REPLACE(BOA.Lin,'-','') = R.Lin
LEFT JOIN
PRODUCTOS2 P2
ON BOA.PRODUCTO = P2.codigo
LEFT JOIN
EN
ON REPLACE(BOA.Lin,'-','') = EN.G
LEFT JOIN
TIP ID
ON TIPID = ID.ID
WHERE
BOA.EST = 'C' AND
ID.SE = 'boA' AND
BOA.LIN NOT IN (
SELECT
LIN
FROM
BAN
)
ORDER BY (EN.VALUE + ANT.VALUE + REIT.VAL + C.VALUE + TEL.VALUE
) DESC,
I'll be frank, this is some pretty terrible SQL. Without seeing all your table structures, advice here will be incomplete. That being said, please don't post all your table structures because you are already very close to "hire a consultant" territory with this.
All the REPLACE logic should be done away with. If you need to JOIN on these fields, then add comparable fields to the tables so you don't need to manipulate the data. Every single JOIN that uses a REPLACE or SUBSTRING is a table or index scan - those are non-SARGable and a definite anti-pattern.
The ORDER BY is probably the most convoluted ORDER BY I have ever seen. Some major issues there:
Subqueries should all be eliminated and materialized either in the outer query or as variables
String manipulation should be eliminated (see item 1 above)
The entire query is basically a code smell. If you need to write code like this to meet business requirements then you either have a terribly inappropriate design or some other much larger issue in the organization or data.
One thing that can kill performance is using a lot of LEFT JOINs. To improve performance of LEFT JOIN, you might want to make sure that the column(s) to which you join have an index - that can have a huge impact on performance.

MySQL - Fastest way to select relational data avoiding left join

I've currently got a query that selects metrics data from two tables whilst getting the projects to query from two other tables (one is owned projects, the other is projects to which the user has access).
SELECT v.`projectID`,
(SELECT COUNT(m.`session`)
FROM `metricData` m
WHERE m.`projectID` = v.`projectID`) AS `sessions`,
(SELECT COUNT(pb.`interact`)
FROM `interactionData` pb WHERE pb.`projectID` = v.`projectID` GROUP BY pb.`projectID`) AS `interactions`
FROM `medias` v
LEFT JOIN `projectsExt` pa ON v.`projectsExtID` = pa.`projectsExtID`
WHERE (pa.`user` = '1' OR v.`ownerUser` = '1')
GROUP BY v.`projectID`
It takes too long, 1-2seconds. This is obviously the multi left-join scenario. But, I've got a couple of ideas to improve speed and wondered what the thoughts were in principle. Do I:-
Try and select the list in the query and then get the data, rather than doing the joins. Not sure how this would work.
Do a select in a separate query to get the projectIDs and then run queries on each projectID afterwards. This may lead to hundreds of potentially thousands of requests, but may be better for the processing?
Other ideas?
There's two questions here:
how can I get my result in less than 2 seconds
how can I avoid a left join.
To answer #1 properly there has to be more information. Technical information, such as the explain plan for this particular query is a good start. Even better if we'd have the SHOW CREATE TABLE of all tables that you access, as well as the number of rows they contain.
But I'd also appreciate more functional information: what exactly is the question you're trying to answer? Right now, it seems you're looking at two different sets of medias:
either there is no matching row in projectsExt, in which case medias.ownerUser must equal '1' (is that '1' supposed to be a string btw?)
or there is exactly one mathching row in projectsExt for which projectsExt.user must equal '1' (is that '1' supposed to be a string btw?)
By lack of enough information to answer #1, I can answer #2 - "how to avoid a left join". Answer is: write a UNION of the two sets, one where there is a match and one where there isn't a match.
SELECT v.`projectID`
, (
SELECT COUNT(m.`session`)
FROM `metricData` m
WHERE m.`projectID` = v.`projectID`
) AS `sessions`
, (
SELECT COUNT(pb.`interact`)
FROM `interactionData` pb
WHERE pb.`projectID` = v.`projectID`
GROUP BY pb.`projectID`
) AS `interactions`
FROM (
SELECT v.projectID
FROM medias
WHERE ownerUser = '1'
GROUP BY projectID
UNION ALL
SELECT v.projectID
FROM medias v
INNER JOIN projectsExt pa
ON v.projectsExtID = pa.projectsExtID
WHERE v.ownerUser != '1'
AND pa.user = '1'
GROUP BY v.`projectID
) v
Have you tried, instead, to refactor everything into left joins? Seeing as how you're always grouping on the same field, it shouldn't be a problem. Try that and post an EXPLAIN to see what the bottlenecks are.
Subselects are less performant than joins, because the engine can optimize the joins to a much higher degree. In fact, subselects will usually, where applicable, be rewritten into joins by the engine where possible.
As a rule of a thumb, there is no gain in splitting queries, all you gain is overhead and confusing the optimizer. There are, as always, exceptions to this rule, but they come into play after you've done what you can traditionally and know you keen such an approach.

Using Joins, Group By and Sub Queries, Oh My!

I have a database with a table for details of ponies, another for details of contacts (owners and breeders), and then several other small tables for parameters (colours, counties, area codes, etc.). To give me a list of existing pony profiles, with their various details given, i use the following query:
SELECT *
FROM profiles
INNER JOIN prm_breedgender
ON profiles.ProfileGenderID = prm_breedgender.BreedGenderID
LEFT JOIN contacts
ON profiles.ProfileOwnerID = contacts.ContactID
INNER JOIN prm_breedcolour
ON profiles.ProfileAdultColourID = prm_breedcolour.BreedColourID
ORDER BY profiles.ProfileYearOfBirth ASC $limit
In the above sample, the 'profiles' table is my primary table (holding the Ponies info), 'contacts' is second in importance holding as it does the owner and breeder info. The lesser parameter tables can be identified by their prm_ prefix. The above query works fine, but i want to do more.
The first big issue is that I wish to GROUP the results by gender: Stallions, Mares, Geldings... I used << GROUP BY prm_breedgender.BreedGender >> or << GROUP BY ProfileBreedGenderID >> before my ORDER BY line, but than only returns two results from all my available profiles. I have read up on this, and apparantly need to reorganise my query to accomodate GROUP within my primary SELECT clause. How to do this however, gets me verrrrrrry confused. Step by step help here would be fantabulous.
As a further note on the above - You may have noticed the $limit var at the end of my query. This is for pagination, a feature I want to keep. I shouldn't think that's an issue however.
My secondary issue is more of an organisational one. You can see where I have pulled my Owner information from the contacts table here:
LEFT JOIN contacts
ON profiles.ProfileOwnerID = contacts.ContactID
I could add another stipulation:
AND profiles.ProfileBreederID = contacts.ContactID
with the intention of being able to list a pony's Owner and Breeder, where info on either is available. I'm not sure how to echo out this info though, as $row['ContactName'] could apply in either the capacity of owner OR breeder.
Is this a case of simply running two queries rather than one? Assigning a variable $foo to the first run of the query, then just run another separate query altogether and assign $bar to those results? Or is there a smarter way of doing it all in the one query (e.g. $row['ContactName']First-iteration, $row['ContactName']Second-iteration)? Advice here would be much appreciated.
And That's it! I've tried to be as clear as possible, and do really appreciate any help or advice at all you can give. Thanks in advance.
##########################################################################EDIT
My query currently stands as an amalgam of that provided by Cularis and Symcbean:
SELECT *
FROM (
profiles
INNER JOIN prm_breedgender
ON profiles.ProfileGenderID = prm_breedgender.BreedGenderID
LEFT JOIN contacts AS owners
ON profiles.ProfileOwnerID = owners.ContactID
INNER JOIN prm_breedcolour
ON profiles.ProfileAdultColourID = prm_breedcolour.BreedColourID
)
LEFT JOIN contacts AS breeders
ON profiles.ProfileBreederID = breeders.ContactID
ORDER BY prm_breedgender.BreedGender ASC, profiles.ProfileYearOfBirth ASC $limit
It works insofar as the results are being arranged as I had hoped: i.e. by age and gender. However, I cannot seem to get the alias' to work in relation to the contacts queries (breeder and owner). No error is displayed, and neither are any Owners or Breeders. Any further clarification on this would be hugely appreciated.
P.s. I dropped the alias given to the final LEFT JOIN by Symcbean's example, as I could not get the resulting ORDER BY statement to work for me - my own fault, I'm certain. Nonetheless, it works now although this may be what is causing the issue with the contacts query.
GROUP in SQL terms means using aggregate functions over a group of entries. I guess what you want is order by gender:
ORDER BY prm_breedgender.BreedGender ASC, profiles.ProfileYearOfBirth ASC $limit
This will output all Stallions, etc. next to each other.
To also get the breeders contact, you need to join with the contacts table again, using an alias:
LEFT JOIN contacts AS owners
ON profiles.ProfileOwnerID = owners.ContactID
LEFT JOIN contacts AS breeders
ON profiles.ProfileBreederID = breeders.ContactID
To further expand on what #cularis stated, group by is for aggregations down to the lowest level of "grouping" criteria. For example, and I'm not doing per your specific tables, but you'll see the impact. Say you want to show a page grouped by Breed. Then, a user picks a breed and they can see all entries of that breed.
PonyID ProfileGenderID Breeder
1 1 1
2 1 1
3 2 2
4 3 3
5 1 2
6 1 3
7 2 3
Assuming your Gender table is a lookup where ex:
BreedGenderID Description
1 Stallion
2 Mare
3 Geldings
SELECT *
FROM profiles
INNER JOIN prm_breedgender
ON profiles.ProfileGenderID = prm_breedgender.BreedGenderID
select
BG.Description,
count(*) as CountPerBreed
from
Profiles P
join prm_BreedGender BG
on p.ProfileGenderID = BG.BreedGenderID
group by
BG.Description
order by
BG.Description
would result in something like (counts are only coincidentally sequential)
Description CountPerBreed
Geldings 1
Mare 2
Stallion 4
change the "order by" clause to "order by CountsPerBreed Desc" (for descending) and you would get
Description CountPerBreed
Stallion 4
Mare 2
Geldings 1
To expand, if you wanted the aggregations to be broken down per breeder... It is a best practice to group by all things that are NOT AGGREGATES (such as MIN(), MAX(), AVG(), COUNT(), SUM(), etc)
select
BG.Description,
BR.BreaderName,
count(*) as CountPerBreed
from
Profiles P
join prm_BreedGender BG
on p.ProfileGenderID = BG.BreedGenderID
join Breeders BR
on p.Breeder = BR.BreaderID
group by
BG.Description,
BR.BreaderName
order by
BG.Description
would result in something like (counts are only coincidentally sequential)
Description BreaderName CountPerBreed
Geldings Bill 1
Mare John 1
Mare Sally 1
Stallion George 2
Stallion Tom 1
Stallion Wayne 1
As you can see, the more granularity you provide to the group by, the aggregation per that level is smaller.
Your join conditions otherwise are obviously understood from what you've provided. Hopefully this sample clearly provides what the querying process will do. Your group by does not have to be the same as the final order... its just common to see so someone looking at the results is not trying to guess how the data was organized.
In your sample, you had an order by the birth year. When doing an aggregation, you will never have the specific birth year of a single pony to so order by... UNLESS.... You included the YEAR( ProfileYearOfBirth ) as BirthYear as a column, and included that WITH your group by... Such as having 100 ponies 1 yr old and 37 at 2 yrs old of a given breed.
It would have been helpful if you'd provided details of the table structure and approximate numbers of rows. Also using '*' for a SELECT is a messy practice - and will cause you problems later (see below).
What version of MySQL is this?
apparantly need to reorganise my query to accomodate GROUP within my primary SELECT clause
Not necessarily since v4 (? IIRC), you could just wrap your query in a consolidating select (but move the limit into the outer select:
SELECT ProfileGenderID, COUNT(*)
FROM (
[your query without the LIMIT]
) ilv
GROUP BY ProfileGenderID
LIMIT $limit;
(note you can't ORDER BY ilv.ProfileYearOfBirth since it is not a selected column / group by expression)
How many records/columns do you have in prm_breedgender? Is it just Stallions, Mares, Geldings...? Do you think this list is likely to change? Do you have ponies with multiple genders? I suspect that this domain would be better represented by an enum in the profiles table.
with the intention of being able to list a pony's Owner and Breeder,
Using the code you suggest, you'll only get returned instances where the owner and breeder are the same! You need to add a second instance of the contacts table with a different alias to get them all, e.g.
SELECT *
FROM (
SELECT *
FROM profiles
INNER JOIN prm_breedgender
ON profiles.ProfileGenderID = prm_breedgender.BreedGenderID
LEFT JOIN contacts ownerContact
ON profiles.ProfileOwnerID = ownerContact.ContactID
INNER JOIN prm_breedcolour
ON profiles.ProfileAdultColourID = prm_breedcolour.BreedColourID
) ilv LEFT JOIN contacts breederContact
ON ilv.ProfileBreederID = breederContact.ContactID
ORDER BY ilv.ProfileYearOfBirth ASC $limit