I have a very large dataset, and I am trying to do a LEFT outer join, but keep losing some of my left table rows because of where I place my WHERE command, similar to the problem and solution here.
Example 1
SELECT *
FROM table1
LEFT JOIN table2
USING (IDvar)
WHERE table2.var IN(val1, val2,..., val100);
This only selects the rows in the first/left table (table1) that have a matching row in the second/right table (table2). The second example is what is likely to work:
Example 2
SELECT *
FROM table1
LEFT JOIN table2
USING (IDvar)
AND (table2.var = val1 OR table2.var = val2);
But, I have like 200 table2.var values that I would like to include, which are sporadic and and non-continuous (can't use syntax like table2.var >= val1).
An example of what I thought should work is to use "AND" and "IN" such as (because I have the values as a comma-separated list):
Example 3
SELECT *
FROM table1
LEFT JOIN table2
USING (IDvar)
AND table2.var IN(val1, val2,..., val100);
So how can I get many many values into an AND command?
I've found a working solution, but it takes way way to long to perform.
Example 4 - Working Example but takes too long
SELECT *
FROM table1
LEFT JOIN (SELECT table2.var WHERE table2.var IN(val1, val2,..., val100)) AS t
USING (IDvar);
Is there any way of optimising this query, it is taking way too long?
Related
I can't understand whats happening...
I use two sql queries which do not return the same thing...
this one :
SELECT * FROM table1 t1 JOIN table1 t2 on t1.attribute1 = t2.attribute1
I get 10 rows
this other :
SELECT * FROM table1 NATURAL JOIN table1
I get 8 rows
With the NATURAL JOIN 2 rows aren't returned... I look for the missing lines and they are the same values for the attribute1 column ...
It's impossible for me.
If anyone has an answer I could sleep better ^^
Best regards
Max
As was pointed out in the comments, the reason you are getting a different row count is that the natural join is connecting your self join using all columns. All columns are being compared because the same table appears on both sides of the join. To test this hypothesis, just check the column values from both tables, which should all match.
The moral of the story here is to avoid natural joins. Besides not being clear as to the join condition, the logic of the join could easily change should table structure change, e.g. if a new column gets added.
Follow the link below for a small demo which tried to reproduce your current results. In a table of 8 records, the natural join returns 8 records, whereas the inner join on one attribute returns 10 records due to some duplicate matching.
Demo
You need to 'project away' the attribute you don't want used in the join e.g. in a derived table (dt):
SELECT *
FROM table1
NATURAL JOIN ( SELECT attribute1 FROM table1 ) dt;
Scenario:
In each of the below, "created date", "other_created_date" and "date" are a Day (i.e. 2012-01-03)
Table 1:
fields:
ID | created_date
Table 2:
fields:
ID | table_1_fk | other_created_date
Table 3:
fields:
date
The Goal:
I want to do the following:
SELECT * FROM table_1
JOIN table_2
ON table_1.id = table_2.table_1_fk
FULL OUTER JOIN table_3
ON table_3.date = (
CASE
WHEN table_1.created_date > table_2.other_created_date THEN table_1.created_date
ELSE table_2.other_created_date
END
)
Basically, I'm interested in (Table_1 + Table_2) JOINed on Table_3, where if the first statement is true we join on Table_1's date, and if the second statement is true we join on Table_2's date
Is this possible or is there a better way to go at this?
SELECT * FROM table_1
JOIN table_2
ON table_1.id = table_2.table_1_fk
FULL OUTER JOIN table_3
ON table_3.date = GREATEST(table_1.created_date,table_2.other_created_date)
I like Bernd's answer. However, without knowing anything about the content of these tables, I think it's worth your while to evaluate the performance difference between doing what you are suggesting and simply having two separate outer joins. I know I've done creative things in the joins before, and the database will manage it, but HOW it manages it may not be what I had in mind at all, especially if dealing with tens of millions of records.
As an example, this is what the SQL might look like if you used two outer joins instead of trying to merge them into one. It will potentially be a lot more code, which is why you will need to benchmark it to see if it matters.
I know I used left joins here -- I'm always a little suspicious when I see a full outer, but that's not to say it wasn't exactly what you wanted. But this is for illustration purposes only:
SELECT
case
when table_1.created_date > table_2.other_created_date then
t3a.<field_1>
else
t3b.<field_1>
end
FROM
table_1
JOIN table_2
ON table_1.id = table_2.table_1_fk
left join table3 t3a on
table3.date = table_1.created_date
left join table3 t3b on
table3.date = table_2.other_created_date
-- EDIT --
Here's an example of where a compactly-coded join condition had horrible performance and a workaround that was way more code but worth it:
PostgreSQL Joining Between Two Values
Let's say I have about 25,000 records in two tables and the data in each should be the same. If I need to find any rows that are in table A but NOT in table B, what's the most efficient way to do this.
We've tried it as a subquery of one table and a NOT IN the result but this runs for over 10 minutes and almost crashes our site.
There must be a better way. Maybe a JOIN?
Hope LEFT OUTER JOIN will do the job
select t1.similar_ID
, case when t2.similar_ID is not null then 1 else 0 end as row_exists
from table1 t1
left outer join (select distinct similar_ID from table2) t2
on t1.similar_ID = t2.similar_ID // your WHERE goes here
I would suggest you read the following blog post, which goes into great detail on this question:
Which method is best to select values present in one table but missing
in another one?
And after a thorough analysis, arrives at the following conclusion:
However, these three methods [NOT IN, NOT EXISTS, LEFT JOIN]
generate three different plans which are executed by three different
pieces of code. The code that executes EXISTS predicate is about 30%
less efficient than those that execute index_subquery and LEFT JOIN
optimized to use Not exists method.
That’s why the best way to search for missing values in MySQL is using a LEFT JOIN / IS NULL or NOT IN rather than NOT
EXISTS.
If the performance you're seeing with NOT IN is not satisfactory, you won't improve this performance by switching to a LEFT JOIN / IS NULL or NOT EXISTS, and instead you'll need to take a different route to optimizing this query, such as adding indexes.
Use exixts and not exists function instead
Select * from A where not exists(select * from B);
Left join. From the mysql documentation
If there is no matching row for the right table in the ON or USING
part in a LEFT JOIN, a row with all columns set to NULL is used for
the right table. You can use this fact to find rows in a table that
have no counterpart in another table:
SELECT left_tbl.* FROM left_tbl LEFT JOIN right_tbl ON left_tbl.id =
right_tbl.id WHERE right_tbl.id IS NULL;
This example finds all rows in left_tbl with an id value that is not
present in right_tbl (that is, all rows in left_tbl with no
corresponding row in right_tbl).
I am getting a problem with my LEFT OUTER JOIN. I have a set of queries which gives me about 80,000 to 1,00000 records in a #Temp Table. Now when I LEFT OUTER JOIN this #Temp table with another table I have to put a CASE statement i.e. if the records are not found when joining with a particular column then take that particular column value and find its subsequent value in another table which has the matching records. The query is working fine for a particular data but for larger data it just goes on executing or just takes too much time. My query is like:
SELECT * FROM #Temp
LEFT OUTER JOIN TABLE1 ON #Temp.Materialcode =
CASE WHEN TABLE1.MaterialCode LIKE 'HY%'
THEN TABLE1.MaterialCode
ELSE REPLACE(TABLE1.MaterialCode,
TABLE1.MaterialCode,
(SELECT NewMaterialCode
FROM TABLE2
WHERE OldMaterialCode = TABLE1.MaterialCode))
END
Here TABLE2 has got only two columns NewMaterialCode and OldMetarialCode. What I have to do is if the Material Code is not found in TABLE1 LIKE 'HY%' type then it should take that material code and look for its subsequent NewMaterialCode in TABLE2 to get both types of records having 'HY' type and non 'HY' type. I think I made my problem clear. Any help would be greatly appreciated.
SELECT *
FROM #TEMP TMP
LEFT JOIN Table1 MATERIAL
ON TMP.MaterialCode = MATERIAL.MaterialCode
LEFT JOIN Table2 REPLACEMENT
ON MATERIAL.MaterialCode = REPLACEMENT.OldMaterialCode
WHERE ( COALESCE(MATERIAL.materialcode, '') LIKE 'HY%'
AND TMP.materialCode = MATERIAL.MaterialCode
)
OR MATERIAL.MaterialCode = REPLACEMENT.NewMaterialCode
I think this should do what you're trying to do, but I don't really know how the tables are related except by reverse-engineering your query.
For the record, the OUTER JOIN in your query isn't accomplishing a thing, because an outer condition would product null values for the columns in TABLE1, and the case condition wouldn't work (a NULL would be neither a match for 'HY%' nor an ELSE). That's counter-intuitive to those not used to working in the three-valued logic of the database world, but that's why we have COALESCE and ISNULL.
I have got 5 tables of which the structures are the same. Only the PAGEVISITS field is unique
ie. table 1:
ITEM | PAGEVISITS | Commodity
1813 50 Griddle
1851 10 Griddle
11875 100 Refrigerator
2255 25 Refrigerator
ie. table 2:
ITEM | PAGEVISITS | Commodity
1813 0 Griddle
1851 10 Griddle
11875 25 Refrigerator
2255 10 Refrigerator
I want it to add up the Commodity to spit out:
table1 | table2 | Commodity
60 10 Griddle
125 35 Refrigerator
Some of the data is actually correct but some are WAY off given the below query:
SELECT
SUM(MT.PAGEVISITS) as table1,
SUM(CT1.PAGEVISITS) as table2,
SUM(CT2.PAGEVISITS) as table3,
SUM(CT3.PAGEVISITS) as table4,
SUM(CT4.PAGEVISITS) as table5,
(COUNT(DISTINCT MT.ITEM)) + (COUNT(DISTINCT CT1.ITEM)) + (COUNT(DISTINCT CT2.ITEM)) + (COUNT(DISTINCT CT3.ITEM)) + (COUNT(DISTINCT CT4.ITEM)) as Total,
MT.Commodity
FROM table1 as MT
LEFT JOIN table2 CT1
on MT.ITEM = CT1.ITEM
LEFT JOIN table3 CT2
on MT.ITEM = CT2.ITEM
LEFT JOIN table4 CT3
on MT.ITEM = CT3.ITEM
LEFT JOIN table5 CT4
on MT.ITEM = CT4.ITEM
GROUP BY Commodity
I believe this may be cause by using the LEFT JOIN incorrectly. I have also tried the INNER JOIN with the same inconsistent results.
I would do a UNION on all five of those tables to get them as one rowset (inline view), and then run a query on that, start with something like this...
SELECT SUM(IF(t.source='MT',t.pagevisits,0)) AS table1
, SUM(IF(t.source='CT1',t.pagevisits,0)) AS table2
, t.commodity
FROM ( SELECT 'MT' as source, table1.* FROM table1
UNION ALL
SELECT 'CT1', table2.* FROM table2
UNION ALL
SELECT 'CT2', table3.* FROM table3
UNION ALL
SELECT 'CT3', table4.* FROM table4
UNION ALL
SELECT 'CT4', table5.* FROM table5
) t
GROUP BY t.commodity
(But I would specify the column list for each of those tables, rather than using the '.*' and having my query dependent on no one adding/dropping/renaming/reordering columns in any of those tables.)
I include an "extra" literal value (aliased as "source") to identify which table the row came from. I can use a conditional test in an expression in the SELECT list, to figure out whether the row came from a particular table.
This approach is particularly flexible, and can be used to get more complicated resultsets. For example, if I also wanted to get a total number page visits from table3, 4 and 5 added together, along with the individual counts.
SUM(IF(t.source IN ('CT2','CT3','CT4'),t.pagevisits,0) AS total_345
To get the equivalent of your COUNT(DISTINCT item) + COUNT(DISTINCT item) + ... expression...
I would use an expression that makes a single value from both the "source" and "item" columns, being careful to have some sort of guarantee that any particular "source"+"item" will not create a duplicate of some other "source"+"item". (If we just concatenate strings, for example, we don't have any way to distinguish between 'A'+'11' and 'A1'+'1'.) The most common approach I see here is a carefully chosen delimiter which is guaranteed not to appear in either value. We can distinguish between 'A::11' and 'A1::1', so something like this will work:
COUNT(DISINCT CONCAT(t.source,'::',t.item))
In your current query, if item is NULL, then the row doesn't get included in the COUNT. To fully replicate that behavior, you would need something like this:
COUNT(DISINCT IF(t.item IS NOT NULL,CONCAT(t.source,'::',t.item),NULL)) AS Total
Or course, getting a count of distinct item values over the whole set of five tables is much simpler (but then, it does return a different result)
COUNT(DISINCT t.item)
But to answer your question about the use of the LEFT JOIN, the left side table is the "driver" so a matching row has to be in that table for a corresponding row to be retrieved from a table on the right. That is, unmatched rows from the tables on the right side will not be returned.
If what you have is basically five "partitions", and you want to process all of the rows whether or not a matching row appears in any of the other "partitions", I would go with the UNION ALL approach to simply concatenate all of the rows from all of those tables together, and process the rows as if they were from a single table.
NOTE: For very large tables, this may not be a feasible approach, since MySQL is going to have to materialize that inline view. There are other approaches which don't require concatenating all of the rows together.
Specifying a list of only the columns you need, in the SELECT from each table, may help performance, if there are columns in those tables you don't need to reference in your query.