How to rewrite a database subquery into a join? - mysql

I am trying rewrite this subquery into a join. I have read the other questions on SO but cant get this one working.
create table job (
emplid int,
effdt date,
title varchar(100),
primary key (emplid, effdt)
);
insert into job set emplid=1, effdt='2010-01-01', title='Programmer';
insert into job set emplid=1, effdt='2011-01-01', title='Programmer I';
insert into job set emplid=1, effdt='2012-01-01', title='Programmer II';
insert into job set emplid=2, effdt='2010-01-01', title='Analyst';
insert into job set emplid=2, effdt='2011-01-01', title='Analyst I';
insert into job set emplid=2, effdt='2012-01-01', title='Analyst II';
#Get each employees current job:
select *
from job a
where a.effdt=
(select max(b.effdt)
from job b
where b.emplid=a.emplid);
Results:
+--------+------------+---------------+
| emplid | effdt | title |
+--------+------------+---------------+
| 1 | 2012-01-01 | Programmer II |
| 2 | 2012-01-01 | Analyst II |
+--------+------------+---------------+
I would like to rewrite the query into a join, without a subquery. Is this possible?

Writing this as a join is perhaps a bit counterintuitive. The idea is to use a left outer join and include in the condition that b.effdt > a.effdt. This condition will match rows except when a.effdt takes on the maximum value. The query can then filter for these using a where:
select a.*
from job a left outer join
job b
on b.emplid = a.emplid and
b.effdt > a.effdt
where b.effdt is NULL;

Have you considered rewriting your schema?
If you are able to, it might be better to have a history or log table that has entries for when the effective date was changed, for which employee ID and what the previous title was. That way you would just query the actual table and get the results that you want.
This can be achieved by using triggers for whenever a row in the database is changed, then everything is handled at the database level.

Related

Please help to optimize MySql UPDATE with two tables and M:N row mapping

I'm post-processing traces for two different kinds of events, where the data is stored in table A and B. Both tables have an producer ID and a time index value. While the same producer can trigger a record in both tables, the time when the different events occur are independent, and much more frequent for table B.
I want to update table A such that, for every row in table A, a column value from table B is taken for the most recent row in table B for the same producer.
Example mappings between two tables:
Here is a simplified example with just one producer in both tables. The goal is not to get the oldest entry in table B, but rather the most recent entry in table B relative to a row in table A. I'm showing B.tIdx < A.tIdx in this example, but <= is just as good for my purposes; just a detail.
Table A Table B
+----+------+----------------------+ +------+------+-------+
| ID | tIdx | NEW value SET FROM B | | ID | tIdx | value |
+----+------+----------------------+ +------+------+-------+
| 1 | 2 | 12.5 | | 1 | 1 | 12.5 |
| 1 | 4 | 4.3 | | 1 | 2 | 9.0 |
+----+------+----------------------+ | 1 | 3 | 4.3 |
| 1 | 4 | 7.8 |
| 1 | 5 | 6.2 |
+------+------+-------+
The actual tables have thousands of different IDs, millions of rows, and nearly as many distinct time index values as rows. I'm having trouble to come up with an UPDATE that doesn't take days to complete.
The following UPDATE works, but executes far too slowly; it starts off at a rate of 100s of updates/s, but soon slows to roughly 5 updates/s.
UPDATE A AS x SET value =
(SELECT value
FROM B AS y
WHERE x.ID = y.ID AND x.tIdx > y.tIdx
ORDER BY y.tIdx DESC
LIMIT 1);
I've tried creating indexes for ID and tIdx separately, but also multi-column indexes with both orders (ID,tIdx) and (tIdx,ID). But even when the multi-column indexes exist, EXPLAIN shows that it only ever indexes on ID or tIdx, but not both together.
I was wondering if the solution is to create nested SELECTs, to first get a temporary table with a particular ID, and then find the 1 row in table B that will meet the time constraint for each tIdx for that particular ID. The following SELECT, with hardcoded ID and tIdx, works and is very fast, completing in 0.00 sec.
SELECT value, ID, tIdx
FROM (
SELECT value, ID, tIdx
FROM B
WHERE ID = 5216
) y
WHERE tIdx < 1253707
ORDER BY tIdx DESC LIMIT 1;
I'd like to incorporate this into an UPDATE somehow, but replace the hardcoded ID and tIdx with the ID,tIdx pair for each row in A.
Or try any other suggestion for a more efficient UPDATE statement.
This is my first post to stackoverflow. Sincere apologizes in advance if I have violated any etiquette.
Update with Inner Join should do it, but it's going to get nasty to do this.
Update a INNER JOIN
(Select b.ID, maxb.atIdx, b.value
From b INNER JOIN (Select a.ID, a.tIdx as atIdx, max(b.tIdx) as bigb
From b INNER JOIN a
ON b.ID=a.ID
Where b.tIdx<=a.tIdx
Group By a.ID,a.tIdx) maxb
ON b.ID=maxb.ID and b.tIdx=maxb.bigb
) bestb ON a.ID=bestb.ID and a.tIdx=bestb.atIdx
Set a.value=bestb.value
To explain this it's best to start with the innermost SQL and work your way to the outermost UPDATE. To start, we need to join every record in table A to every record in table B for each ID. We can filter out the B records that are too recent and summarize that result for each table A record. That leaves us with the tIdx of the B table whose value goes into A for every record key in A. So then we join that to the B table to select the values to update, preserving the A-table's keys. That result is joined back to A to perform the update.
You'll have to see whether this is fast enough for you - I'm worried that this accesses the B table twice and the inner query creates A LOT of join combinations. I would pull out that inner query and see how long it runs by itself. On the positive side, they are all very simple, straightforward queries and they are connected by Inner Joins so there is some opportunity for efficiency in the query optimizer. I think indexes on a(ID,TIdx) [fast lookup to get the Update row] and b(ID) would be useful here.
One thing you can try is lead() to see if that helps the performance:
UPDATE A JOIN
(SELECT b.*,
LEAD(tIDx) OVER (PARTITION BY id order by tIDx) as next_tIDx
FROM b
) b
ON a.id = b.id AND
a.tIDx >= b.tIDx AND
(b.next_tIDx IS NULL or a.tIDx < b.next_tIDx)
SET a.value = b.value;
And for this you want an index on b(id, tidx).

SQL, table join wont display proper output [duplicate]

I've got the following two tables (in MySQL):
Phone_book
+----+------+--------------+
| id | name | phone_number |
+----+------+--------------+
| 1 | John | 111111111111 |
+----+------+--------------+
| 2 | Jane | 222222222222 |
+----+------+--------------+
Call
+----+------+--------------+
| id | date | phone_number |
+----+------+--------------+
| 1 | 0945 | 111111111111 |
+----+------+--------------+
| 2 | 0950 | 222222222222 |
+----+------+--------------+
| 3 | 1045 | 333333333333 |
+----+------+--------------+
How do I find out which calls were made by people whose phone_number is not in the Phone_book? The desired output would be:
Call
+----+------+--------------+
| id | date | phone_number |
+----+------+--------------+
| 3 | 1045 | 333333333333 |
+----+------+--------------+
There's several different ways of doing this, with varying efficiency, depending on how good your query optimiser is, and the relative size of your two tables:
This is the shortest statement, and may be quickest if your phone book is very short:
SELECT *
FROM Call
WHERE phone_number NOT IN (SELECT phone_number FROM Phone_book)
alternatively (thanks to Alterlife)
SELECT *
FROM Call
WHERE NOT EXISTS
(SELECT *
FROM Phone_book
WHERE Phone_book.phone_number = Call.phone_number)
or (thanks to WOPR)
SELECT *
FROM Call
LEFT OUTER JOIN Phone_Book
ON (Call.phone_number = Phone_book.phone_number)
WHERE Phone_book.phone_number IS NULL
(ignoring that, as others have said, it's normally best to select just the columns you want, not '*')
SELECT Call.ID, Call.date, Call.phone_number
FROM Call
LEFT OUTER JOIN Phone_Book
ON (Call.phone_number=Phone_book.phone_number)
WHERE Phone_book.phone_number IS NULL
Should remove the subquery, allowing the query optimiser to work its magic.
Also, avoid "SELECT *" because it can break your code if someone alters the underlying tables or views (and it's inefficient).
The code below would be a bit more efficient than the answers presented above when dealing with larger datasets.
SELECT *
FROM Call
WHERE NOT EXISTS (
SELECT 'x'
FROM Phone_book
WHERE Phone_book.phone_number = Call.phone_number
);
SELECT DISTINCT Call.id
FROM Call
LEFT OUTER JOIN Phone_book USING (id)
WHERE Phone_book.id IS NULL
This will return the extra id-s that are missing in your Phone_book table.
I think
SELECT CALL.* FROM CALL LEFT JOIN Phone_book ON
CALL.id = Phone_book.id WHERE Phone_book.name IS NULL
SELECT t1.ColumnID,
CASE
WHEN NOT EXISTS( SELECT t2.FieldText
FROM Table t2
WHERE t2.ColumnID = t1.ColumnID)
THEN t1.FieldText
ELSE t2.FieldText
END FieldText
FROM Table1 t1, Table2 t2
SELECT name, phone_number FROM Call a
WHERE a.phone_number NOT IN (SELECT b.phone_number FROM Phone_book b)
Alternatively,
select id from call
minus
select id from phone_number
Don't forget to check your indexes!
If your tables are quite large you'll need to make sure the phone book has an index on the phone_number field. With large tables the database will most likely choose to scan both tables.
SELECT *
FROM Call
WHERE NOT EXISTS
(SELECT *
FROM Phone_book
WHERE Phone_book.phone_number = Call.phone_number)
You should create indexes both Phone_Book and Call containing the phone_number. If performance is becoming an issue try an lean index like this, with only the phone number:
The fewer fields the better since it will have to load it entirely. You'll need an index for both tables.
ALTER TABLE [dbo].Phone_Book ADD CONSTRAINT [IX_Unique_PhoneNumber] UNIQUE NONCLUSTERED
(
Phone_Number
)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ONLINE = ON) ON [PRIMARY]
GO
If you look at the query plan it will look something like this and you can confirm your new index is actually being used. Note this is for SQL Server but should be similar for MySQL.
With the query I showed there's literally no other way for the database to produce a result other than scanning every record in both tables.

2 inner joins between same 2 tables

I am trying to select columns from 2 tables,
The INNER JOIN conditions are $table1.idaction_url=$table2.idaction AND $table1.idaction_name=$table2.idaction.
However, From the query below, there is no output. It seems like the INNER JOIN can only take 1 condition. If I put AND to include both conditions as shown in the query below, there wont be any output. Please look at the picture below. Please advice.
$mysql=("SELECT conv(hex($table1.idvisitor), 16, 10) as visitorId,
$table1.server_time, $table1.idaction_url,
$table1.time_spent_ref_action,$table2.name,
$table2.type, $table1.idaction_name, $table2.idaction
FROM $table1
INNER JOIN $table2
ON $table1.idaction_url=$table2.idaction
AND $table1.idaction_name=$table2.idaction
WHERE conv(hex(idvisitor), 16, 10)='".$id."'
ORDER BY server_time DESC");
Short answer:
You need to use two separate inner joins, not only a single join.
E.g.
SELECT `actionurls`.`name` AS `actionUrl`, `actionnames`.`name` AS `actionName`
FROM `table1`
INNER JOIN `table2` AS `actionurls` ON `table1`.`idaction_url` = `actionurls`.`idaction`
INNER JOIN `table2` AS `actionnames` ON `table1`.`idaction_name` = `actionurls`.`idaction`
(Modify this query with any additional fields you want to select).
In depth: INNER JOIN, when done on a value unique to the second table (the table joined to the first in this operation) will only ever fetch one row. What you want to do is fetch data from the other table twice, into the same row, reading the select part of the statement.
INNER JOIN table2 ON [comparison] will, for each row selected from table1, grab any rows from table2 for which [comparison] is TRUE, then copy the row from table1 N times, where N is the amount of rows found in table2. If N = 0, then the row is skipped. In our case N=1 so INNER JOIN of idaction_name in table1 to idaction in table2 for example will allow you to select all the action names.
In order to get the action urls as well we have to INNER JOIN a second time. Now you can't join the same table twice normally, as SQL won't know which of the two joined tables is meant when you type table2.name in the first part of your query. This would be ambiguous if both had the same name. There's a solution for this, table aliases.
The output (of my answer above) is going to be something like:
+-----+------------------------+-------------------------+
| Row | actionUrl | actionName |
+-----+------------------------+-------------------------+
| 1 | unx.co.jp/ | UNIX | Kumamoto Home |
| 2 | unx.co.jp/profile.html | UNIX | Kumamoto Profile |
| ... | ... | ... |
+-----+------------------------+-------------------------+
While if you used only a single join, you would get this kind of output (using OR):
+-----+-------------------------+
| Row | actionUrl |
+-----+-------------------------+
| 1 | unx.co.jp/ |
| 2 | UNIX | Kumamoto Home |
| 3 | unx.co.jp/profile.html |
| 4 | UNIX | Kumamoto Profile |
| ... | ... |
+-----+-------------------------+
Using AND and a single join, you only get output if idaction_name == idaction_url is TRUE. This is not the case, so there's no output.
If you want to know more about how to use JOINS, consult the manual about them.
Sidenote
Also, I can't help but notice you're using variables (e.g. $table1) that store the names of the tables. Do you make sure that those values do not contain user input? And, if they do, do you at least whitelist a list of tables that users can access? You may have some security issues with this.
INNER JOIN does not put any restriction on number of conditions it can have.
The zero resultant rows means, there is no rows satisfying the two conditions simultaneously.
Make sure you are joining using correct columns. Try going step by step to identify from where the data is lost

MySQL queries, selecting field from one of many databases

I have a remarks table which can be linked to any number of other items in a system, in the case of this example we'll use bookings, enquiries and referrals.
Thus in the remarks table we have columns
remark_id | datetime | text | booking_id | enquiry_id | referral_id
1 | 2014-06-28 | abc | 0 | 8 | 0
2 | 2014-06-27 | def | 3 | 0 | 0
2 | 2014-05-31 | ghi | 0 | 0 | 10
Etc...
Each of the item tables will have a field called name. Thus when I want to select a remark the likelihood is I'll need this name.
I'd like to achieve this with a single query, getting a 2d array as follows:
['remark_id'=>1, 'datetime'=>'2014-06-28', 'text'=>'abc', 'name'=>'Harold']
However the query I'd expect to use would be
SELECT r.remark_id,r.datetime,r.text
,b.name AS book,rr.name AS referral,e.name AS enquiry
FROM remarks AS r
LEFT JOIN bookings AS b ON b.book_id=r.book_id
LEFT JOIN referrals AS rr ON rr.referral_id=r.referral_id
LEFT JOIN enquiries AS e ON e.enquiry_id=r.enquiry_id
Leaving me with the output
['remark_id'=>1, 'datetime'=>'2014-06-28', 'text'=>'abc', 'book'=>'Harold', 'referral'=>'', 'enquiry'=>'']
And more processing to do before or during rendering it to a view.
Is there a way to write a query such that it would fill a field from the first NOT NULL string it encountered in one of the joined tables?
Please only suggest using a different database system if you know that MySQL doesn't provide any way to do what I'm asking. If it's the case it can't be done there's no business sense in rewriting the system anyway, but I'd like to ask!
Two ways I can think of:
use UNION:
SELECT remark_id, datetime, text, name
FROM remarks
JOIN bookings ON (remarks.book_id = bookings.book_id)
UNION
SELECT remark_id, datetime, text, name
FROM remarks
JOIN referrals ON (remarks.referral_id = referrals.referral_id)
UNION
SELECT remark_id, datetime, text, name
FROM remarks
JOIN enquiries ON (remarks.enquiry_id = enquiries.enquiry_id)</code>
use IFNULL (probably much slower):
SELECT r.remark_id,r.datetime,r.text,
IFNULL(b.name,IFNULL(rr.name,e.name)) AS name
FROM remarks AS r
LEFT JOIN bookings AS b ON b.book_id=r.book_id
LEFT JOIN referrals AS rr ON rr.referral_id=r.referral_id
LEFT JOIN enquiries AS e ON e.enquiry_id=r.enquiry_id</code>
Variant 2 is really much slower because of the LEFT JOINs.
Also, generally I would not recommend using 0 as value for non-existent links, rather use NULL. This will allow MySQL to speed up the join.
one way to achieve this is with nested if statements:
if(b.name is not null, b.name, if(rr.name is not null, rr.name, e.name)) as name
one drawback is that this gives an implicit priority to books? not sure if that would be an issue.
perhaps the main drawback, though, is that this is kind of "magical" and has goofy syntax so it might be more clear to just handle those cases in the controller after all.
Seems quite messy that you have multiple unused columns for each entry, unless I'm not understanding correctly. If you add more tables, you'd have to adjust each of the views so that it would filter out the new table.
I'd be tempted to redesign your structure so that each of the tables has a remarkgroup_id column, then add the following remark table
remark_id, remarkgroup_id, date, message
This would clean up the extra unused columns and allow you to use simple joining logic.

Associative table with date

In my application I have association between two entities employees and work-groups.
This association usually changes over time, so in my DB I have something like:
emplyees
| EMPLOYEE_ID | NAME |
| ... | ... |
workgroups
| GROUP_ID | NAME |
| ... | ... |
emplyees_workgroups
| EMPLOYEE_ID | GROUP_ID | DATE |
| ... | ... | ... |
So suppose I have an association between employee 1 and group 1, valid from 2014-01-01 on.
When a new association is created, for example from 2014-02-01 on, the old one is no longer valid.
This structure for the associative table is a bit problematic for queries, but I actually would avoid to add an END_DATE field to the table beacuse it will be a reduntant value and also requires the execution of an insert + update or update on two rows every time a change happens in an association.
So have you any idea to create a more practical architecture to solve my problem? Is this the better approach?
You have what is called a slowly changing dimension. That means that you need to have dates in the employees_workgroup table in order to find the right workgroup at the right time for a set of employees.
The best way to handle this is to have to dates, which I often call effdate and enddate on each row. This greatly simplifies queries, where you are trying to find the workgroup at a particular point in time. Such a query might look like with this structure:
select ew.*
from employees_workgroup ew
where MYDATE between effdate and enddate;
Now consider the same results using only one date per field. It might be something like this:
select ew.*,
from employees_workgroup ew join
(select employee_id, max(date) as maxdate
from employees_workgroup ew2
where ew2.employee_id = ew.employee_id and
ew2.date <= MYDATE
) as rec
on ew.employee_id = rec.employee_id and ew.adte = ew.maxdate;
The expense of doing an update along with the insert is minimal compared to the complexity this will introduce in the queries.