I am trying to write an SQL query which is pretty complex. The requirements are as follows:
I need to return these fields from the query:
track.artist
track.title
track.seconds
track.track_id
track.relative_file
album.image_file
album.album
album.album_id
track.track_number
I can select a random track with the following query:
select
track.artist, track.title, track.seconds, track.track_id,
track.relative_file, album.image_file, album.album,
album.album_id, track.track_number
FROM
track, album
WHERE
album.album_id = track.album_id
ORDER BY RAND() limit 10;
Here is where I am having trouble though. I also have a table called "trackfilters1" thru "trackfilters10" Each row has an auto incrementing ID field. Therefore, row 10 is data for album_id 10. These fields are populated with 1's and 0's. For example, album #10 has 10 tracks, then trackfilters1.flags will contain "1111111111" if all tracks are to be included in the search. If track 10 was to be excluded, then it would contain "1111111110"
My problem is including this clause.
The latest query I have come up with is the following:
select
track.artist, track.title, track.seconds,
track.track_id, track.relative_file, album.image_file,
album.album, album.album_id, track.track_number
FROM
track, album, trackfilters1, trackfilters2
WHERE
album.album_id = track.album_id
AND
( (album.album_id = trackfilters1.id)
OR
(album.album_id=trackfilters2.id) )
AND
( (mid(trackfilters1.flags, track.track_number,1) = 1)
OR
( mid(trackfilters2.flags, track.track_number,1) = 1))
ORDER BY RAND() limit 2;
however this is causing SQL to hang. I'm presuming that I'm doing something wrong. Does anybody know what it is? I would be open to suggestions if there is an easier way to achieve my end result, I am not set on repairing my broken query if there is a better way to accomplish this.
Additionally, in my trials, I have noticed when I had a working query and added say, trackfilters2 to the FROM clause without using it anywhere in the query, it would hang as well. This makes me wonder. Is this correct behavior? I would think adding to the FROM list without making use of the data would just make the server procure more data, I wouldn't have expected it to hang.
There's not enough information here to determine what's causing the performance issue.
But here's a few suggestions and comments.
Ditch the old-school comma syntax for the join operations, and use the JOIN keyword instead. And relocate the join predicates to an ON clause.
And for heaven's sake, format the SQL so that it's decipherable by someone trying to read it.
There's some questions here... will there always be a matching row in both trackfilters1 and trackfilters2 for rows you want to return? Or could a row be missing from trackfilters2, and you still want to return the row if there's a matching row in trackfilters1? (The answer to that question determines whether you'd want to use an outer join vs an inner join to those tables.)
For best performance with large sets, having appropriate indexes defined is going to be critical.
Use EXPLAIN to see the execution plan.
I suggest you try writing your query like this:
SELECT track.artist
, track.title
, track.seconds
, track.track_id
, track.relative_file
, album.image_file
, album.album
, album.album_id
, track.track_number
FROM track
JOIN album
ON album.album_id = track.album_id
LEFT
JOIN trackfilters1
ON trackfilters1.id = album.album_id
LEFT
JOIN trackfilters2
ON trackfilters2.id = album.album_id
WHERE MID(trackfilters1.flags, track.track_number, 1) = '1'
OR MID(trackfilters2.flags, track.track_number, 1) = '1'
ORDER BY RAND()
LIMIT 2
And if you want help with performance, provide the output from EXPLAIN, and what indexes are defined.
Related
I have a query which goes like this:
SELECT insanlyBigTable.description_short,
insanlyBigTable.id AS insanlyBigTable,
insanlyBigTable.type AS insanlyBigTableLol,
catalogpartner.id AS catalogpartner_id
FROM insanlyBigTable
INNER JOIN smallerTable ON smallerTable.id = insanlyBigTable.catalog_id
INNER JOIN smallerTable1 ON smallerTable1.catalog_id = smallerTable.id
AND smallerTable1.buyer_id = 'xxx'
WHERE smallerTable1.cont = 'Y' AND insanlyBigTable.type IN ('111','222','33')
GROUP BY smallerTable.id;
Now, when I run the query first time it copies the giant table into a temp table... I want to know how I can prevent that? I am considering a nested query, or even to reverse the join (not sure the effect would be to run faster), but that is well, not nice. Any other suggestions?
To figure out how to optimize your query, we first have to boil down exactly what it is selecting so that we can preserve that information while we change things around.
What your query does
So, it looks like we need the following
The GROUP BY clause limits the results to at most one row per catalog_id
smallerTable1.cont = 'Y', insanelyBigTable.type IN ('111','222','33'), and buyer_id = 'xxx' appear to be the filters on the query.
And we want data from insanlyBigTable and ... catalogpartner? I would guess that catalogpartner is smallerTable1, due to the id of smallerTable being linked to the catalog_id of the other tables.
I'm not sure on what the purpose of including the buyer_id filter on the ON clause was for, but unless you tell me differently, I'll assume the fact it is on the ON clause is unimportant.
The point of the query
I am unsure about the intent of the query, based on that GROUP BY statement. You will obtain just one row per catalog_id in the insanelyBigTable, but you don't appear to care which row it is. Indeed, the fact that you can run this query at all is due to a special non-standard feature in MySQL that lets you SELECT columns that do not appear in the GROUP BY statement... however, you don't get to select WHICH columns. This means you could have information from 4 different rows for each of your selected items.
My best guess, based on column names, is that you are trying to bring back a list of items that are in the same catalog as something that was purchased by a given buyer, but without any more than one item per catalog. In addition, you want something to connect back to the purchased item in that catalog, via the catalogpartner table's id.
So, something probably akin to amazon's "You may like these items because you purchased these other items" feature.
The new query
We want 1 row per insanlyBigTable.catalog_id, based on which catalog_id exists in smallerTable1, after filtering.
SELECT
ibt.description_short,
ibt.id AS insanlyBigTable,
ibt.type AS insanlyBigTableLol,
(
SELECT smallerTable1.id FROM smallerTable1 st
WHERE st.buyer_id = 'xxx'
AND st.cont = 'Y'
AND st.catalog_id = ibt.catalog_id
LIMIT 1
) AS catalogpartner_id
FROM insanlyBigTable ibt
WHERE ibt.id IN (
SELECT (
SELECT ibt.id AS ibt_id
FROM insanlyBigTable ibt
WHERE ibt.catalog_id = sti.catalog_id
LIMIT 1
) AS ibt_id
FROM (
SELECT DISTINCT(catalog_id) FROM smallerTable1 st
WHERE st.buyer_id = 'xxx'
AND st.cont = 'Y'
AND EXISTS (
SELECT * FROM insanlyBigTable ibt
WHERE ibt.type IN ('111','222','33')
AND ibt.catalog_id = st.catalog_id
)
) AS sti
)
This query should generate the same result as your original query, but it breaks things down into smaller queries to avoid the use (and abuse) of the GROUP BY clause on the insanlyBigTable.
Give it a try and let me know if you run into problems.
I'm having an issue getting this SQL query to work properly.
I have the following query
SELECT apps.*,
SUM(IF(adtracking.appId = apps.id AND adtracking.id = transactions.adTrackingId, transactions.payoutAmount, 0)) AS 'revenue',
SUM(IF(adtracking.appId = apps.id AND adtracking.type = 'impression', 1, 0)) AS 'impressions'
FROM apps, adtracking, transactions
WHERE apps.userId = '$userId'
GROUP BY apps.id
Everything is working, HOWEVER for the 'impressions' column I am generating in the query, I am getting a WAY larger number than there should be. For example, one matching app for this query should only have 72 for 'Impressions' yet it is coming up with a value of over 3,000 when there aren't even that many rows in the adtracking table. Why is this? What is wrong here?
Your problem is you have no join conditions, so you are getting every row of every table being joined in your query result - called a cartesian product.
To fix, change your FROM clause to this:
FROM apps a
LEFT JOIN adtracking ad ON ad.appId = a.id
LEFT JOIN transactions t ON t.adTrackingId = ad.id
You haven't provided the schema for your tables, so I guessed the names of the relevant columns - you may have to adjust them. Also, your transaction table may join to adtracking - it's impossible to know from your question, so agin you have have to alter things slightly. Hopefully you get the idea.
Edit:
Note: your group-by clause is incorrect. You either need to list every column of apps (not recommended), or change your select to only select the id column from apps (recommended). Change your select to this:
SELECT apps.id,
-- rest of query the same
Otherwise you'll get weird, incorrect, results.
EDIT: I will leave the post here as is, but what I really needed to accomplish needed to be reposted. I didn't explain the problem well enough. After trying again with quite a different starting point, I was able to get the query that I needed. That is explained here.
ORIGINAL QUESTION:
I'm having trouble. I have looked at similar threads, and I am unable to find a solution specific to this query. The database is very large, and group by seems to slow it down immensely.
The problem is I am getting duplicate results. Here is my query which causes duplicates:
SELECT
itpitems.identifier,
itpitems.name,
itpitems.subtitle,
itpitems.description,
itpitems.itemimg,
itpitems.mainprice,
itpitems.upc,
itpitems.isbn,
itpitems.weight,
itpitems.pages,
itpitems.publisher,
itpitems.medium_abbr,
itpitems.medium_desc,
itpitems.series_abbr,
itpitems.series_desc,
itpitems.voicing_desc,
itpitems.pianolevel_desc,
itpitems.bandgrade_desc,
itpitems.category_code,
itprank.overall_ranking,
itpitnam.name AS artist,
itpitnam.type_code
FROM itpitems
INNER JOIN itprank ON ( itprank.item_number = itpitems.identifier )
INNER JOIN itpitnam ON ( itpitems.identifier = itpitnam.item_number )
WHERE mainprice >1
The results are actually not complete duplicates. itpitnam.type_code has a different result in the otherwise duplicated results.
Since adding GROUP BY to the end of the query is causing too much strain on the server (It's searching through about 300,000 records) what else can I do?
Can this be re-written as a sub-query? I just can't figure out how to eliminate the 2nd instances where type_code has changed.
Thank you for your help and assistance.
I also tried SELECT DISTINCT itpitems.identifier, but this served out the same results and had the duplicates (where type_code was the only difference). I don't want the second instance where type_code has changed. I just want one result per identifier regardless of whether or not type_code has multiple instances.
Without seeing examples of the output, hard to say. But have you tried the same exact query with a simple DISTINCT added to the SELECT?
SELECT DISTINCT itpitems.identifier, itpitems.name, itpitems.subtitle, itpitems.description, itpitems.itemimg, itpitems.mainprice, itpitems.upc, itpitems.isbn, itpitems.weight, itpitems.pages, itpitems.publisher, itpitems.medium_abbr, itpitems.medium_desc, itpitems.series_abbr, itpitems.series_desc, itpitems.voicing_desc, itpitems.pianolevel_desc, itpitems.bandgrade_desc, itpitems.category_code, itprank.overall_ranking, itpitnam.name AS artist, itpitnam.type_code
FROM itpitems
INNER JOIN itprank ON ( itprank.item_number = itpitems.identifier )
INNER JOIN itpitnam ON ( itpitems.identifier = itpitnam.item_number )
WHERE mainprice >1
Ok, here we go. There's this messy SELECT crossing other tables and ordering to get the one desired row. Basically I do the "math" inside the ORDER BY.
1 base table.
7 JOINS poiting to local tables.
WHERE with 2 clauses and a NOT IN crossing another table.
You'll see in the code the ORDER BY is pretty damn big/ugly, it sums the result of 5 different calculations. I need that result to order by those calculations in order to get the worst row-case.
The problem is once I execute the Stored Procedure it takes up to 8 seconds to run. That's kind of non-acceptable. So, I'm starting to check Indexes.
So, I'm looking for advices on how to make this query run faster.
I'm indexing the WHERE clauses and the field LINEA, Should I index something else? Like the rows Im crossing for the JOINs? or should I approach the query differently?
Query:
SET #LINEA = (
SELECT TOP 1
BOA.LIN
FROM
BAND_BA BOA
LEFT JOIN
TEL PAR
ON REPLACE(BOA.Lin,'-','') = SUBSTRING(PAR.Te,2,10)
LEFT JOIN
TELP CLP
ON REPLACE(BOA.Lin,'-','') = SUBSTRING(CLP.Numtel,2,10)
LEFT JOIN
CA C
ON REPLACE(BOA.Lin,'-','') = C.An
LEFT JOIN
RE R
ON REPLACE(BOA.Lin,'-','') = R.Lin
LEFT JOIN
PRODUCTOS2 P2
ON BOA.PRODUCTO = P2.codigo
LEFT JOIN
EN
ON REPLACE(BOA.Lin,'-','') = EN.G
LEFT JOIN
TIP ID
ON TIPID = ID.ID
WHERE
BOA.EST = 'C' AND
ID.SE = 'boA' AND
BOA.LIN NOT IN (
SELECT
LIN
FROM
BAN
)
ORDER BY (EN.VALUE + ANT.VALUE + REIT.VAL + C.VALUE + TEL.VALUE
) DESC,
I'll be frank, this is some pretty terrible SQL. Without seeing all your table structures, advice here will be incomplete. That being said, please don't post all your table structures because you are already very close to "hire a consultant" territory with this.
All the REPLACE logic should be done away with. If you need to JOIN on these fields, then add comparable fields to the tables so you don't need to manipulate the data. Every single JOIN that uses a REPLACE or SUBSTRING is a table or index scan - those are non-SARGable and a definite anti-pattern.
The ORDER BY is probably the most convoluted ORDER BY I have ever seen. Some major issues there:
Subqueries should all be eliminated and materialized either in the outer query or as variables
String manipulation should be eliminated (see item 1 above)
The entire query is basically a code smell. If you need to write code like this to meet business requirements then you either have a terribly inappropriate design or some other much larger issue in the organization or data.
One thing that can kill performance is using a lot of LEFT JOINs. To improve performance of LEFT JOIN, you might want to make sure that the column(s) to which you join have an index - that can have a huge impact on performance.
I've currently got a query that selects metrics data from two tables whilst getting the projects to query from two other tables (one is owned projects, the other is projects to which the user has access).
SELECT v.`projectID`,
(SELECT COUNT(m.`session`)
FROM `metricData` m
WHERE m.`projectID` = v.`projectID`) AS `sessions`,
(SELECT COUNT(pb.`interact`)
FROM `interactionData` pb WHERE pb.`projectID` = v.`projectID` GROUP BY pb.`projectID`) AS `interactions`
FROM `medias` v
LEFT JOIN `projectsExt` pa ON v.`projectsExtID` = pa.`projectsExtID`
WHERE (pa.`user` = '1' OR v.`ownerUser` = '1')
GROUP BY v.`projectID`
It takes too long, 1-2seconds. This is obviously the multi left-join scenario. But, I've got a couple of ideas to improve speed and wondered what the thoughts were in principle. Do I:-
Try and select the list in the query and then get the data, rather than doing the joins. Not sure how this would work.
Do a select in a separate query to get the projectIDs and then run queries on each projectID afterwards. This may lead to hundreds of potentially thousands of requests, but may be better for the processing?
Other ideas?
There's two questions here:
how can I get my result in less than 2 seconds
how can I avoid a left join.
To answer #1 properly there has to be more information. Technical information, such as the explain plan for this particular query is a good start. Even better if we'd have the SHOW CREATE TABLE of all tables that you access, as well as the number of rows they contain.
But I'd also appreciate more functional information: what exactly is the question you're trying to answer? Right now, it seems you're looking at two different sets of medias:
either there is no matching row in projectsExt, in which case medias.ownerUser must equal '1' (is that '1' supposed to be a string btw?)
or there is exactly one mathching row in projectsExt for which projectsExt.user must equal '1' (is that '1' supposed to be a string btw?)
By lack of enough information to answer #1, I can answer #2 - "how to avoid a left join". Answer is: write a UNION of the two sets, one where there is a match and one where there isn't a match.
SELECT v.`projectID`
, (
SELECT COUNT(m.`session`)
FROM `metricData` m
WHERE m.`projectID` = v.`projectID`
) AS `sessions`
, (
SELECT COUNT(pb.`interact`)
FROM `interactionData` pb
WHERE pb.`projectID` = v.`projectID`
GROUP BY pb.`projectID`
) AS `interactions`
FROM (
SELECT v.projectID
FROM medias
WHERE ownerUser = '1'
GROUP BY projectID
UNION ALL
SELECT v.projectID
FROM medias v
INNER JOIN projectsExt pa
ON v.projectsExtID = pa.projectsExtID
WHERE v.ownerUser != '1'
AND pa.user = '1'
GROUP BY v.`projectID
) v
Have you tried, instead, to refactor everything into left joins? Seeing as how you're always grouping on the same field, it shouldn't be a problem. Try that and post an EXPLAIN to see what the bottlenecks are.
Subselects are less performant than joins, because the engine can optimize the joins to a much higher degree. In fact, subselects will usually, where applicable, be rewritten into joins by the engine where possible.
As a rule of a thumb, there is no gain in splitting queries, all you gain is overhead and confusing the optimizer. There are, as always, exceptions to this rule, but they come into play after you've done what you can traditionally and know you keen such an approach.