Finding which of an array of IDs has no record with a single query - mysql

I'm generating prepared statements with PHP PDO to pull in information from two tables based on an array of IDs.
Then I realized that if an ID passed had no record I wouldn't know.
I'm locating records with
SELECT
r.`DEANumber`,
TRIM(r.`ActivityCode`) AS ActivityCode,
TRIM(r.`ActivitySubCode`) as ActivitySubCode,
// other fields...
a.Activity
FROM
`registrants` r,
`activities` a
WHERE r.`DEAnumber` IN ( ?,?,?,?,?,?,?,? )
AND a.Code = ActivityCode
AND a.Subcode = ActivitySubCode
But I am having trouble figuring out the negative join that says which of the IDs has no record.
If two tables were involved I think I could do it like this
SELECT
r.DEAnumber
FROM registrant r
LEFT JOIN registrant2 r2 ON r.DEAnumber = r2.DEAnumber
WHERE r2.DEAnumber IS NULL
But I'm stumped as to how to use the array of IDs here. Obviously I could iterate over the array and track which queries had not result but it seems like such a manual and wasteful way to go...

Obviously I could iterate over the array and track which queries had not result but it seems like such a manual and wasteful way to go.
What could be a real waste is spending time solving this non-existent "problem".
Yes, you could iterate. Either manually, or using a syntax sugar like array_diff() in PHP.
I suggest that instead of making your query more complex (means heavier to support) for little gain, you just move on.
As old man Knuth once said 'premature optimization is the root of all evil'.
The only thing I could think of a help from PDO is a fetch mode that will put IDs as keys for the returned array, and thus you'll be able to make it without [explicitly written] loop, like
$stmt->execute($ids);
$data = $stmt->fetchAll(PDO::FETCH_UNIQUE);
$notFound = array_diff($ids, array_keys($data));
Yet a manual loop would have taken only two extra lines, which is, honestly, not that a big deal to talk about.

You are on the right track - a left join that filters out matches will give you the missing joins. You just need to move all conditions on the left-joined table up into the join.
If you leave the conditions on the joined table in the where clause you effectively cause an inner join, because the where clause is executed on the rows after the join is made, which is too late if there was no join in the first place.
Change the query to use proper join syntax, specifying a left join, with the conditions on activity moved to the join'n on clause:
SELECT
r.DEANumber,
TRIM(r.ActivityCode) AS ActivityCode,
TRIM(r.ActivitySubCode) as ActivitySubCode,
// other fields...
a.Activity
FROM registrants r
LEFT JOIN activities a ON a.Code = ActivityCode
AND a.Subcode = ActivitySubCode
WHERE r.DEAnumber IN (?,?,?,?,?,?,?,?)
In your app code, if Activity is null then you know there was no activity for that id.
This won't affect performance much, other than to return (potentially) more rows.
To just select all registrants without activities:
select r.DEAnumber
from registrants r
left join activities a on a.Code = ActivityCode
and a.Subcode = ActivitySubCode
where r.`DEAnumber` IN ( ?,?,?,?,?,?,?,? )
and a.Code is null

Related

SQL transform id and add where statement before join

I am pretty new to SQL. Here is an operation I am sure is simple for a lot of you. I am trying to join two tables across databases on the same server – dbB and dbA, and TableA (with IdA) and TableB (with IdB) respectively. But before doing that I want to transform column IdA into a number, where I would like to remove the “:XYZ” character from its values and add a where statement for another column in dbA too. Below I show my code for the join but I am not sure how to convert the values of the column. This allows me to match idAwith idB in the join. Thanks a ton in advance.
Select replace(idA, “:XYZ”, "")
from dbA.TableA guid
where event like “%2015”
left join dbB.TableB own
on guid.idA = own.idB
Few things
FROM, Joins, WHERE (unless you use subqueries) syntax order it's also the order of execution (notice select isn't listed as its near the end in order of operation but first syntactically!)
alias/fully qualify columns when multiple tables are involved so we know what field comes from what table.
order of operations has the SQL doing the from and JOINS 1st thus what you do in the select isn't available (not in scope yet) for the compiler, this is why you can't use select column aliases in the from, where or even group by as well.
I don't like Select * usually but as I don't know what columns you really need... I used it here.
As far as where before the join. most SQL compilers anymore use cost based optimization and figure out the best execution plan given your data tables and what not involved. So just put the limiting criteria in the where in this case since it's limiting the left table of the left join. If you needed to limit data on the right table of a left join, you'd put the limit on the join criteria; thus allowing it to filter as it joins.
probably need to cast IDA as integer (or to the same type as IDB) I used trim to eliminate spaces but if there are other non-display characters, you'd have issues with the left join matching)
.
SELECT guild.*, own.*
FROM dbA.TableA guid
LEFT JOIN dbB.TableB own
on cast(trim(replace(guid.idA, ':XYZ', '')) as int) = own.idB
WHERE guid.event like '%2015'
Or materialize the transformation first by using a subquery so IDA in its transformed state before the join (like algebra ()'s matter and get processed inside out)
SELECT *
FROM (SELECT cast(trim(replace(guid.idA, ':XYZ', '')) as int) as idA
FROM dbA.TableA guid
WHERE guid.event like '%2015') B
LEFT JOIN dbB.TableB own
on B.IDA = own.idB

Conditionals in WHEREs or JOINs?

Lets say I have the following query:
SELECT occurs.*, events.*
FROM occurs
INNER JOIN events ON (events.event_id = occurs.event_id)
WHERE event.event_state = 'visible'
Another way to do the same query and get the same results would be:
SELECT occurs.*, events.*
FROM occurs
INNER JOIN events ON (events.event_id = occurs.event_id
AND event.event_state = 'visible')
My question. Is there a real difference? Is one way faster than the other? Why would I choose one way over the other?
For an INNER JOIN, there's no conceptual difference between putting a condition in ON and in WHERE. It's a common practice to use ON for conditions that connect a key in one table to a foreign key in another table, such as your event_id, so that other people maintaining your code can see how the tables relate.
If you suspect that your database engine is mis-optimizing a query plan, you can try it both ways. Make sure to time the query several times to isolate the effect of caching, and make sure to run ANALYZE TABLE occurs and ANALYZE TABLE events to provide more info to the optimizer about the distribution of keys. If you do find a difference, have the database engine EXPLAIN the query plans it generates. If there's a gross mis-optimization, you can create an Oracle account and file a feature request against MySQL to optimize a particular query better.
But for a LEFT JOIN, there's a big difference. A LEFT JOIN is often used to add details from a separate table if the details exist or return the rows without details if they do not. This query will return result rows with NULL values for b.* if no row of b matches both conditions:
SELECT a.*, b.*
FROM a
LEFT JOIN b
ON (condition_one
AND condition_two)
WHERE condition_three
Whereas this one will completely omit results that do not match condition_two:
SELECT a.*, b.*
FROM a
LEFT JOIN b ON some_condition
WHERE condition_two
AND condition_three
Code in this answer is dual licensed: CC BY-SA 3.0 or the MIT License as published by OSI.

SQL query to select based on many-to-many relationship

This is really a two-part question, but in order not to mix things up, I'll divide into two actual questions. This one is about creating the correct SQL statement for selecting a row based on values in a many-to-many related table:
Now, the question is: what is the absolute simplest way of getting all resources where e.g metadata.category = subject AND where that category's corresponding metadata.value ='introduction'?
I'm sure this could be done in a lot of different ways, but I'm a novice in SQL, so please provide the simplest way possible... (If you could describe briefly what the statement means in plain English that would be great too. I have looked at introductions to SQL, but none of those I have found (for beginners) go into these many-to-many selections.)
The easiest way is to use the EXISTS clause. I'm more familiar with MSSQL but this should be close
SELECT *
FROM resources r
WHERE EXISTS (
SELECT *
FROM metadata_resources mr
INNER JOIN metadata m ON (mr.metadata_id = m.id)
WHERE mr.resource_id = r.id AND m.category = 'subject' AND m.value = 'introduction'
)
Translated into english it's 'return me all records where this subquery returns one or more rows, without returning the data for those rows'. This sub query is correlated to the outer query by the predicate mr.resource_id = r.id which uses the outer row as the predicate value.
I'm sure you can google around for more examples of the EXIST statement

Select taking too long. Need advice for a better performance

Ok, here we go. There's this messy SELECT crossing other tables and ordering to get the one desired row. Basically I do the "math" inside the ORDER BY.
1 base table.
7 JOINS poiting to local tables.
WHERE with 2 clauses and a NOT IN crossing another table.
You'll see in the code the ORDER BY is pretty damn big/ugly, it sums the result of 5 different calculations. I need that result to order by those calculations in order to get the worst row-case.
The problem is once I execute the Stored Procedure it takes up to 8 seconds to run. That's kind of non-acceptable. So, I'm starting to check Indexes.
So, I'm looking for advices on how to make this query run faster.
I'm indexing the WHERE clauses and the field LINEA, Should I index something else? Like the rows Im crossing for the JOINs? or should I approach the query differently?
Query:
SET #LINEA = (
SELECT TOP 1
BOA.LIN
FROM
BAND_BA BOA
LEFT JOIN
TEL PAR
ON REPLACE(BOA.Lin,'-','') = SUBSTRING(PAR.Te,2,10)
LEFT JOIN
TELP CLP
ON REPLACE(BOA.Lin,'-','') = SUBSTRING(CLP.Numtel,2,10)
LEFT JOIN
CA C
ON REPLACE(BOA.Lin,'-','') = C.An
LEFT JOIN
RE R
ON REPLACE(BOA.Lin,'-','') = R.Lin
LEFT JOIN
PRODUCTOS2 P2
ON BOA.PRODUCTO = P2.codigo
LEFT JOIN
EN
ON REPLACE(BOA.Lin,'-','') = EN.G
LEFT JOIN
TIP ID
ON TIPID = ID.ID
WHERE
BOA.EST = 'C' AND
ID.SE = 'boA' AND
BOA.LIN NOT IN (
SELECT
LIN
FROM
BAN
)
ORDER BY (EN.VALUE + ANT.VALUE + REIT.VAL + C.VALUE + TEL.VALUE
) DESC,
I'll be frank, this is some pretty terrible SQL. Without seeing all your table structures, advice here will be incomplete. That being said, please don't post all your table structures because you are already very close to "hire a consultant" territory with this.
All the REPLACE logic should be done away with. If you need to JOIN on these fields, then add comparable fields to the tables so you don't need to manipulate the data. Every single JOIN that uses a REPLACE or SUBSTRING is a table or index scan - those are non-SARGable and a definite anti-pattern.
The ORDER BY is probably the most convoluted ORDER BY I have ever seen. Some major issues there:
Subqueries should all be eliminated and materialized either in the outer query or as variables
String manipulation should be eliminated (see item 1 above)
The entire query is basically a code smell. If you need to write code like this to meet business requirements then you either have a terribly inappropriate design or some other much larger issue in the organization or data.
One thing that can kill performance is using a lot of LEFT JOINs. To improve performance of LEFT JOIN, you might want to make sure that the column(s) to which you join have an index - that can have a huge impact on performance.

MySQL -- joining then joining then joining again

MySQL setup: step by step.
programs -> linked to --> speakers (by program_id)
At this point, it's easy for me to query all the data:
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
Nice and easy.
The trick for me is this. My speakers table is also linked to a third table, "books." So in the "speakers" table, I have "book_id" and in the "books" table, the book_id is linked to a name.
I've tried this (including a WHERE you'll notice):
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
No results.
My questions:
What am I doing wrong?
What's the most efficient way to make this query?
Basically, I want to get back all the programs data and the books data, but instead of the book_id, I need it to come back as the book name (from the 3rd table).
Thanks in advance for your help.
UPDATE:
(rather than opening a brand new question)
The left join worked for me. However, I have a new problem. Multiple books can be assigned to a single speaker.
Using the left join, returns two rows!! What do I need to add to return only a single row, but separate the two books.
is there any chance that the books table doesn't have any matching columns for speakers.book_id?
Try using a left join which will still return the program/speaker combinations, even if there are no matches in books.
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
Btw, could you post the table schemas for all tables involved, and exactly what output (or reasonable representation) you'd expect to get?
Edit: Response to op author comment
you can use group by and group_concat to put all the books on one row.
e.g.
SELECT speakers.speaker_id,
speakers.speaker_name,
programs.program_id,
programs.program_name,
group_concat(books.book_name)
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
GROUP BY speakers.id
LIMIT 5
Note: since I don't know the exact column names, these may be off
That's typically efficient. There is some kind of assumption you are making that isn't true. Do your speakers have books assigned? If they don't that last JOIN should be a LEFT JOIN.
This kind of query is typically pretty efficient, since you almost certainly have primary keys as indexes. The main issue would be whether your indexes are covering (which is more likely to occur if you don't use SELECT *, but instead select only the columns you need).