Join on 3 tables insanely slow on giant tables - mysql

I have a query which goes like this:
SELECT insanlyBigTable.description_short,
insanlyBigTable.id AS insanlyBigTable,
insanlyBigTable.type AS insanlyBigTableLol,
catalogpartner.id AS catalogpartner_id
FROM insanlyBigTable
INNER JOIN smallerTable ON smallerTable.id = insanlyBigTable.catalog_id
INNER JOIN smallerTable1 ON smallerTable1.catalog_id = smallerTable.id
AND smallerTable1.buyer_id = 'xxx'
WHERE smallerTable1.cont = 'Y' AND insanlyBigTable.type IN ('111','222','33')
GROUP BY smallerTable.id;
Now, when I run the query first time it copies the giant table into a temp table... I want to know how I can prevent that? I am considering a nested query, or even to reverse the join (not sure the effect would be to run faster), but that is well, not nice. Any other suggestions?

To figure out how to optimize your query, we first have to boil down exactly what it is selecting so that we can preserve that information while we change things around.
What your query does
So, it looks like we need the following
The GROUP BY clause limits the results to at most one row per catalog_id
smallerTable1.cont = 'Y', insanelyBigTable.type IN ('111','222','33'), and buyer_id = 'xxx' appear to be the filters on the query.
And we want data from insanlyBigTable and ... catalogpartner? I would guess that catalogpartner is smallerTable1, due to the id of smallerTable being linked to the catalog_id of the other tables.
I'm not sure on what the purpose of including the buyer_id filter on the ON clause was for, but unless you tell me differently, I'll assume the fact it is on the ON clause is unimportant.
The point of the query
I am unsure about the intent of the query, based on that GROUP BY statement. You will obtain just one row per catalog_id in the insanelyBigTable, but you don't appear to care which row it is. Indeed, the fact that you can run this query at all is due to a special non-standard feature in MySQL that lets you SELECT columns that do not appear in the GROUP BY statement... however, you don't get to select WHICH columns. This means you could have information from 4 different rows for each of your selected items.
My best guess, based on column names, is that you are trying to bring back a list of items that are in the same catalog as something that was purchased by a given buyer, but without any more than one item per catalog. In addition, you want something to connect back to the purchased item in that catalog, via the catalogpartner table's id.
So, something probably akin to amazon's "You may like these items because you purchased these other items" feature.
The new query
We want 1 row per insanlyBigTable.catalog_id, based on which catalog_id exists in smallerTable1, after filtering.
SELECT
ibt.description_short,
ibt.id AS insanlyBigTable,
ibt.type AS insanlyBigTableLol,
(
SELECT smallerTable1.id FROM smallerTable1 st
WHERE st.buyer_id = 'xxx'
AND st.cont = 'Y'
AND st.catalog_id = ibt.catalog_id
LIMIT 1
) AS catalogpartner_id
FROM insanlyBigTable ibt
WHERE ibt.id IN (
SELECT (
SELECT ibt.id AS ibt_id
FROM insanlyBigTable ibt
WHERE ibt.catalog_id = sti.catalog_id
LIMIT 1
) AS ibt_id
FROM (
SELECT DISTINCT(catalog_id) FROM smallerTable1 st
WHERE st.buyer_id = 'xxx'
AND st.cont = 'Y'
AND EXISTS (
SELECT * FROM insanlyBigTable ibt
WHERE ibt.type IN ('111','222','33')
AND ibt.catalog_id = st.catalog_id
)
) AS sti
)
This query should generate the same result as your original query, but it breaks things down into smaller queries to avoid the use (and abuse) of the GROUP BY clause on the insanlyBigTable.
Give it a try and let me know if you run into problems.

Related

Complex MySQL query problems and also SQL hangs

I am trying to write an SQL query which is pretty complex. The requirements are as follows:
I need to return these fields from the query:
track.artist
track.title
track.seconds
track.track_id
track.relative_file
album.image_file
album.album
album.album_id
track.track_number
I can select a random track with the following query:
select
track.artist, track.title, track.seconds, track.track_id,
track.relative_file, album.image_file, album.album,
album.album_id, track.track_number
FROM
track, album
WHERE
album.album_id = track.album_id
ORDER BY RAND() limit 10;
Here is where I am having trouble though. I also have a table called "trackfilters1" thru "trackfilters10" Each row has an auto incrementing ID field. Therefore, row 10 is data for album_id 10. These fields are populated with 1's and 0's. For example, album #10 has 10 tracks, then trackfilters1.flags will contain "1111111111" if all tracks are to be included in the search. If track 10 was to be excluded, then it would contain "1111111110"
My problem is including this clause.
The latest query I have come up with is the following:
select
track.artist, track.title, track.seconds,
track.track_id, track.relative_file, album.image_file,
album.album, album.album_id, track.track_number
FROM
track, album, trackfilters1, trackfilters2
WHERE
album.album_id = track.album_id
AND
( (album.album_id = trackfilters1.id)
OR
(album.album_id=trackfilters2.id) )
AND
( (mid(trackfilters1.flags, track.track_number,1) = 1)
OR
( mid(trackfilters2.flags, track.track_number,1) = 1))
ORDER BY RAND() limit 2;
however this is causing SQL to hang. I'm presuming that I'm doing something wrong. Does anybody know what it is? I would be open to suggestions if there is an easier way to achieve my end result, I am not set on repairing my broken query if there is a better way to accomplish this.
Additionally, in my trials, I have noticed when I had a working query and added say, trackfilters2 to the FROM clause without using it anywhere in the query, it would hang as well. This makes me wonder. Is this correct behavior? I would think adding to the FROM list without making use of the data would just make the server procure more data, I wouldn't have expected it to hang.
There's not enough information here to determine what's causing the performance issue.
But here's a few suggestions and comments.
Ditch the old-school comma syntax for the join operations, and use the JOIN keyword instead. And relocate the join predicates to an ON clause.
And for heaven's sake, format the SQL so that it's decipherable by someone trying to read it.
There's some questions here... will there always be a matching row in both trackfilters1 and trackfilters2 for rows you want to return? Or could a row be missing from trackfilters2, and you still want to return the row if there's a matching row in trackfilters1? (The answer to that question determines whether you'd want to use an outer join vs an inner join to those tables.)
For best performance with large sets, having appropriate indexes defined is going to be critical.
Use EXPLAIN to see the execution plan.
I suggest you try writing your query like this:
SELECT track.artist
, track.title
, track.seconds
, track.track_id
, track.relative_file
, album.image_file
, album.album
, album.album_id
, track.track_number
FROM track
JOIN album
ON album.album_id = track.album_id
LEFT
JOIN trackfilters1
ON trackfilters1.id = album.album_id
LEFT
JOIN trackfilters2
ON trackfilters2.id = album.album_id
WHERE MID(trackfilters1.flags, track.track_number, 1) = '1'
OR MID(trackfilters2.flags, track.track_number, 1) = '1'
ORDER BY RAND()
LIMIT 2
And if you want help with performance, provide the output from EXPLAIN, and what indexes are defined.

something wrong with the result of mysql query with joins and select

Good day,
I am trying to join 3 tables for my inventory report but I am getting weird results out of it.
my query
SELECT i_inventory.xid,
count(x_transaction_details.xitem) AS occurrence,
i_inventory.xitem AS itemName,
SUM(i_items_group.or_qty) AS `openingQty`,
avg(x_transaction_details.cost) AS avg_cost,
SUM(x_transaction_details.qty) AS totalNumberSold,
SUM(i_items_group.or_qty) - SUM(x_transaction_details.qty) AS totalRemQty
FROM x_transaction_details
LEFT JOIN i_inventory ON x_transaction_details.xitem = i_inventory.xid
LEFT JOIN i_items_group ON i_inventory.xid = i_items_group.xitem
WHERE (x_transaction_details.date_at BETWEEN '2015-01-18 03:14:54' AND '2015-10-18 03:14:54')
AND i_inventory.xid = 3840
GROUP BY x_transaction_details.xitem
ORDER BY occurrence DESC
This query gives me this result:
See the openingQty column, I then tried to do a simple query to verify the result,
here's my query for checking the openingQty with joining only 2 tables i_items_group table (batches are stored) and i_inventory table (item Information are stored).
SELECT i_inventory.xid,
i_inventory.xitem,
SUM(i_items_group.or_qty) AS openingQty,
i_items_group.cost
FROM i_inventory
INNER JOIN i_items_group ON i_inventory.xid = i_items_group.xitem
WHERE i_inventory.xid = 3840
AND (i_items_group.date_at BETWEEN '2015-01-18 03:14:54' AND '2015-10-18 03:14:54')
my result was:
which is the correct data.
I also made a query on my x_transaction_details table also to verify if its correct or not.
heres my query:
select xitem, qty as qtySold from x_transaction_details where xitem = 3840
AND (date_at BETWEEN '2015-01-18 03:14:54' AND '2015-10-18 03:14:54')
result:
Which would total to: 15-quatitySold.
I'm just confused on how did I get 3269 as a result of my query where as the true openingQty should be only 467.
I guess the problem was in my query with joins, its messing up with number of transactions then it sums it up (I really dont know though).
Can you please help me identify it, and help me come up with the correct query.
This is a common problem with multiple SUM statements in a single query. Keep in mind how SQL does aggregation: first it generates a set of data that is not aggregated, then it aggregates it. Try your query without the GROUP BY or aggregate functions, and you'll be surprised what you turn up. There aren't enough of the right details in your post for me to determine where the breakdown is, but I can guess.
It looks like you have an xitem corresponding to some kind of product, then you have joined that to both transactions and items groups. Suppose a particular xitem matches with 3 transactions and 5 item groups. You'll get 15 records from that join. And when you sum it, any SUM calculations based on fields from the transaction table will be 5x higher than you expect, and any SUM calculations from the item groups table will be 3x higher than you expect. The key symptom here is the aggregate result being a multiple of the correct value, but seemingly different multiples for different rows.
There are multiple ways to address this kind of error. Some developers like to calculate one of hte aggregates in a subquery, then do the other aggregate in the main query and group by the already correct result from the subquery. Others like to write in-line queries to do the aggregate right in the expression:
SELECT xitem, (SELECT SUM(i_items_group.or_qty) FROM i_items_group WHERE i_inventory.xid = i_items_group.xitem) AS `openingQty`
, -- select more fields
Find what approach works best for you. But if you want to see the evidence for yourself, run this query with the aggregates gone and you'll see why those SUMs are doing what they are doing:
SELECT i_inventory.xid,
x_transaction_details.xitem AS occurrence,
i_inventory.xitem AS itemName,
i_items_group.or_qty,
x_transaction_details.cost,
x_transaction_details.qty,
i_items_group.or_qty - x_transaction_details.qty AS RemainingQty
FROM x_transaction_details
LEFT JOIN i_inventory ON x_transaction_details.xitem = i_inventory.xid
LEFT JOIN i_items_group ON i_inventory.xid = i_items_group.xitem
WHERE (x_transaction_details.date_at BETWEEN '2015-01-18 03:14:54' AND '2015-10-18 03:14:54')
AND i_inventory.xid = 3840
ORDER BY occurrence DESC

mysql left join and order by

hi i have to different tables m_mp and m_answer
both have m_date and m_mdate which fills with time() function i want to select both entries but order by date
example:
first table: 'some text','6464647776'
second table 'some answer','545454545'
so i want to show first second table and then the first one
this is the code im using:
SELECT r.*,u.*
FROM `mensajes` as r
LEFT JOIN `m_answer` as u on r.id = u.id_m
WHERE r.id = '1'
ORDER BY m_date
and then display the result of each table using while-loop
I guess I get what you want to do.
You may force both grouping and ordering using ORDER BY, like this:
http://sqlfiddle.com/#!9/373d65/9
I know the solution is not optimal in terms of speed, but, given proper indices, I suspect performance to be acceptable, unless you are already aiming for millions of messages; when that time comes, you would like to either properly GROUPBY, or make subsequent queries for just the recently answered questions from the current page.

SQL Sum Query Behaving Strangely?

I'm having an issue getting this SQL query to work properly.
I have the following query
SELECT apps.*,
SUM(IF(adtracking.appId = apps.id AND adtracking.id = transactions.adTrackingId, transactions.payoutAmount, 0)) AS 'revenue',
SUM(IF(adtracking.appId = apps.id AND adtracking.type = 'impression', 1, 0)) AS 'impressions'
FROM apps, adtracking, transactions
WHERE apps.userId = '$userId'
GROUP BY apps.id
Everything is working, HOWEVER for the 'impressions' column I am generating in the query, I am getting a WAY larger number than there should be. For example, one matching app for this query should only have 72 for 'Impressions' yet it is coming up with a value of over 3,000 when there aren't even that many rows in the adtracking table. Why is this? What is wrong here?
Your problem is you have no join conditions, so you are getting every row of every table being joined in your query result - called a cartesian product.
To fix, change your FROM clause to this:
FROM apps a
LEFT JOIN adtracking ad ON ad.appId = a.id
LEFT JOIN transactions t ON t.adTrackingId = ad.id
You haven't provided the schema for your tables, so I guessed the names of the relevant columns - you may have to adjust them. Also, your transaction table may join to adtracking - it's impossible to know from your question, so agin you have have to alter things slightly. Hopefully you get the idea.
Edit:
Note: your group-by clause is incorrect. You either need to list every column of apps (not recommended), or change your select to only select the id column from apps (recommended). Change your select to this:
SELECT apps.id,
-- rest of query the same
Otherwise you'll get weird, incorrect, results.

Retrieve top 1 and 2 records from each group from table

I have a query that needs to get the first and second highest sku in each members wishlist. The below query works, but it takes way too long because there's about 9 million users and each user has about 10 wishlist items, so you can see that the query below will never finish.
SELECT MAX(CASE WHEN wl.rank = 1 THEN wl.SKU ELSE NULL END) AS [highestSku],
MAX(CASE WHEN wl.rank = 2 THEN wl.SKU ELSE NULL END) AS [secondHighestSku],
FROM Member m
LEFT JOIN (SELECT *
FROM (SELECT DENSE_RANK() OVER (PARTITION BY wl.MemberID ORDER BY wli.Price DESC) AS rank, wl.MemberID, wli.SKU
FROM WishListItem wli
INNER JOIN WishList wl ON wli.WishListID = wl.ID) T1) w ON w.MemberID = m.ID
My question is, is there a better way to get the top first and second records for each user? If not, is there a way I can optimize this query? Ideally, if I can restirct the number of tiems pulled back from the ranking query (the one with the DENSE_RANK()) that will help me out. I wanted to do something like WHERE DENDS_RANK() <= 2, but that's not possible, and doing it outside of the brackets defeats the purpose of the soultion.
Also, this is just part of the query. I actually have even more left joins across more tables that have just as many items, and I need to get the top 1 and 2 records for each user.
And this needs to be done in one query, or as much as possible in one because I'm throwing it in a data table. I can also reduce the number of records, ie. TOP 1000, and break up the query, but I will need to be able to continue from where I left off... also, I did try TOP 1000, and after 10 minutes, I cancelled the query because I need to get all 9 million records out.
I'd grab a relatively small subset of the data, stick it in a table variable, and run the query off that instead of the main (and likely very "busy") tables:
DECLARE #Member TABLE
(
ID int IDENTITY (1, 1) PRIMARY KEY NOT NULL,
-- add necessary columns to this definition.
)
INSERT INTO #Member (field1, field2...)
SELECT field1, field2 -- etc.
FROM YourTables
WHERE SomeCriteria = Whatever
Make sure that the WHERE clause defines a narrower subset of data than your production tables. If performance still suffers, you could create table variables for the other tables you're joining, then use those in the final query.