Situation
My goal is to have a yearly cronjob that deletes certain data from a database based on age. To my disposal I have the powers of Bash and MySQL. I started with writing a bash script but then it struck me that maybe, I could do everything with just a single SQL query.
I'm more a programmer by nature and I haven't had much experience with data structures so that's why I would like some help.
Tables / data structure
The relevant tables and columns for this query are as follows:
Registration:
+-----+-------------------+
| Id | Registration_date |
+-----+-------------------+
| 2 | 2011-10-03 |
| 3 | 2011-10-06 |
| 4 | 2011-10-07 |
| 5 | 2011-10-07 |
| 6 | 2011-10-10 |
| 7 | 2011-10-13 |
| 8 | 2011-10-14 |
| 9 | 2011-10-14 |
| 10 | 2011-10-17 |
+-------------------------+
AssociatedClient:
+-----------+-----------------+
| Client_id | Registration_id |
+-----------+-----------------+
| 2 | 2 |
| 3 | 2 |
| 3 | 4 |
| 4 | 5 |
| 3 | 6 |
| 5 | 6 |
| 3 | 8 |
| 8 | 9 |
| 7 | 10 |
+-----------------------------+
Client: only Id is relevant here.
As you can see, this is a simple many-to-many relationship. A client can have multiple registrations to his name, and a registration can have multiple clients.
The goal
I need to delete all registrations and client data for clients who have not had a new registration in 5 years. Sounds simple, right?
The tricky part
The data should be kept if any other client on any registration from a specific client has a new registration within 5 years.
So imagine client A having 4 registrations with just him in them, and 1 registration with himself and client B. All 5 registrations are older than 5 years. If client B did not have a new registration in 5 years, everything should be deleted: client A registrations and record. If B did have a new registration within 5 years, all client A data should be kept, including his own old registrations.
What I've tried
Building my query, I got about this far:
DELETE * FROM `Registration` AS Reg
WHERE TIMESTAMPDIFF(YEAR, Reg.`Registration_date`, NOW()) >= 5
AND
(COUNT(`Id`) FROM `Registration` AS Reg2
WHERE Reg2.`Id` IN (SELECT `Registration_id` FROM `AssociatedClient` AS Clients
WHERE Clients.`Client_id` IN (SELECT `Client_id` FROM `AssociatedClient` AS Clients2
WHERE Clients2.`Registration_id` IN -- stuck
#I need all the registrations from the clients associated with the first
# (outer) registration here, that are newer than 5 years.
) = 0 -- No newer registrations from any associated clients
Please understand that I have very limited experience with SQL. I realise that even what I got so far can be heavily optimised (with joins etc) and may not even be correct.
The reason I got stuck is that the solution I had in mind would work if I could use some kind of loop, and I only just realised that this is not something you easily do in an SQL query of this kind.
Any help
Is much appreciated.
Begin by identifying the registrations of the other clients of a registration. Here's a view:
create view groups as
select a.Client_id
, c.Registration_id
from AssociatedClient as a
join AssociatedClient as b on a.Registration_id = b.Registration_id
join AssociatedClient as c on b.Client_id = c.Client_id;
That gives us:
select Client_id
, min(Registration_id) as first
, max(Registration_id) as last
, count(distinct Registration_id) as regs
, count(*) as pals
from groups
group by Client_id;
Client_id first last regs pals
---------- ---------- ---------- ---------- ----------
2 2 8 4 5
3 2 8 4 18
4 5 5 1 1
5 2 8 4 5
7 10 10 1 1
8 9 9 1 1
You dont' need a view, of course; it's just for convenience. You could just use a virtual table. But inspect it carefully to convince yourself it produces the right range of "pal registrations" for each client. Note that the view does not reference Registration. That's significant because it produces the same results even after we use it to delete from Registration, so we can use it for the second delete statement.
Now we have a list of clients and their "pal registrations". What's the date of each pal's last registration?
select g.Client_id, max(Registration_date) as last_reg
from groups as g join Registration as r
on g.Registration_id = r.Id
group by g.Client_id;
g.Client_id last_reg
----------- ----------
2 2011-10-14
3 2011-10-14
4 2011-10-07
5 2011-10-14
7 2011-10-17
8 2011-10-14
Which ones have a latest date before a time certain?
select g.Client_id, max(Registration_date) as last_reg
from groups as g join Registration as r
on g.Registration_id = r.Id
group by g.Client_id
having max(Registration_date) < '2011-10-08';
g.Client_id last_reg
----------- ----------
4 2011-10-07
IIUC that would mean that client #4 should be deleted, and anything he registered for should be deleted. Registrations would be
select * from Registration
where Id in (
select Registration_id from groups as g
where Client_id in (
select g.Client_id
from groups as g join Registration as r
on g.Registration_id = r.Id
group by g.Client_id
having max(Registration_date) < '2011-10-08'
)
);
Id Registration_date
---------- -----------------
5 2011-10-07
And, sure enough, client #4 is in Registration #5, and is the only client subject to deletion by this test.
From there you can work out the delete statements. I think the rule is "delete the client and anything he registered for". If so, I'd probably write the Registration IDs to a temporary table, and write the deletes for both Registration and AssociatedClient by joining to it.
You want to know all registrations that need to be kept.
So your first query returns registrations within 5 previous years :
SELECT
Id
FROM
Registration
WHERE
Registration_date >= '2011-10-08'
then all registrations with clients related to the previous query :
SELECT
a2.Registration_id as Id
FROM
AssociatedClient AS a1
INNER JOIN AssociatedClient AS a2
ON a1.Client_id = a2.Client_id
WHERE
a1.Registration_id IN
(
SELECT
Id
FROM
Registration
WHERE
Registration_date >= '2011-10-08'
)
Then you have all registrations that you must not delete by combining the previous queries in an UNION, and you want all clients that are not part of this query :
SELECT
Client_id
FROM
AssociatedClient
WHERE
Registration_id NOT IN
(
SELECT
Id
FROM
Registration
WHERE
Registration_date >= '2011-10-08'
UNION
SELECT
a2.Registration_id as Id
FROM
AssociatedClient AS a1
INNER JOIN AssociatedClient AS a2
ON a1.Client_id = a2.Client_id
WHERE
a1.Registration_id IN
(
SELECT
Id
FROM
Registration
WHERE
Registration_date >= '2011-10-08'
)
)
you can see the results in this SQL fiddle
Then you can delete the lines of clients without registration correspondig to the criterias using the following query :
DELETE FROM
AssociatedClient
WHERE
Client_id IN (<previous query>);
and all registrations not present in AssociatedClient :
DELETE FROM
Registration
WHERE
Id NOT IN (SELECT Registration_id FROM AssociatedClient)
Use temporary tables.
INSERT INTO LockedClient(client_id) --select clients that should not be deleted
SELECT DISTINCT ac.client_id
FROM AssociatedClient ac
JOIN Registration r ON r.Id = ac.ID
WHERE TIMESTAMPDIFF(YEAR, Reg.`Registration_date`, NOW()) >= 5;
DELETE * FROM Registration r -- now delete all except locked clients
JOIN AssociatedClient ac ON ac.registration_id = r.id
LEFT JOIN LockedClient lc ON lc.client_id = ac.client_id
WHERE TIMESTAMPDIFF(YEAR, Reg.`Registration_date`, NOW()) >= 5 AND lc.client_id IS NULL
This should give you the proper clients information 1 level down into the linked clients. I know that this may not give you all the needed information. But, as stated in the comments, a 1 level implementation should be sufficient for now. This may not be optimal.
SELECT
AC1.Client_id,
MAX(R.Registration_date) AS [LatestRegistration]
FROM
#AssociatedClient AC1
JOIN #AssociatedClient AC2
ON AC1.Registration_id = AC2.Registration_id
JOIN #AssociatedClient AC3
ON AC2.Client_id = AC3.Client_id
JOIN #Registration R
ON AC3.Registration_id = R.Id
GROUP BY
AC1.Client_id
You should look into a function using loops. That's the only thing I can think about right now.
I'm a SQL Server guy, but I think this syntax will work for MySQL. This query will pull the clients that should not be deleted.
SELECT A3.Client_id
FROM AssociatedClient A1
#Get clients with registrations in the last 5 years
JOIN Registration R1 ON A1.Registration_id = R1.Id
AND TIMESTAMPDIFFERENCE(YEAR, R1.Registration_Date, Now()) <= 5
#get the rest of the registrations for those clients
JOIN AssociatedClient A2 ON A1.Client_id = A2.Client_id
#get other clients tied to the rest of the registrations
JOIN AssociatedClient A3 ON A2.Registration_id = A3.Registration_id
You need two sql delete statement, because you are deleting from two tables.
Both delete statements need to distinguish between registrations which are being kept and those being deleted, so the delete from the registration table needs to happen second.
The controlling issue is the most recent registration associated with an id (a registration id or a client id). So you will be aggregating based on id and finding the maximum registration date.
When deleting client ids, you delete those where the aggregate registration id is older than five years. This deletion will disassociate registration ids which were previously linked, but that is ok, because this action will not give them a more recent associated registration date.
That said, once you have the client ids, you'll need a join on registration ids which finds associated registration ids. You'll need to join to client ids and then self join back to registration ids to get that part to work right. If you've deleted all client ids which were associated with a registration you'll need to delete those registrations also.
My sql is a bit rusty, and my mysql rustier, and this is untested code, but this should be reasonably close to what I think you need to do:
delete from associatedclient where client_id in (
select client_id from (
select ac.client_id, max(r.registration_date) as dt
from associatedclient ac
inner join registration r
on ac.registration_id = r.id
group by ac.client_id
) d where d.dt < cutoff
)
The next step would look something like this:
delete from registration where id in (
select id from (
select r1.id, max(r2.date) dt
from registration r1
inner join associated_client ac1
on r1.id = ac1.registration_id
inner join associated_client ac2
on ac1.client_id = ac2.client_id
inner join registration r2
on ac2.registration_id = r2.id
) d
where d.dt < cutoff
or d.dt is null
I hope you don't mind me reminding you, but you should want to run the select statements without the deletes, first, and inspect the result for plausibility, before you go ahead and delete stuff.
(And if you have any constraints or indices which prevent this from working you'll have to deal with those also.)
Related
my client was given the following code and he uses it daily to count the messages sent to businesses on his website. I have looked at the MYSQL.SLOW.LOG and it has the following stats for this query, which indicates to me it needs optimising.
Count: 183 Time=44.12s (8073s) Lock=0.00s (0s)
Rows_sent=17337923391683297280.0 (-1), Rows_examined=382885.7
(70068089), Rows_affected=0.0 (0), thewedd1[thewedd1]#localhost
The query is:
SELECT
businesses.name AS BusinessName,
messages.created AS DateSent,
messages.guest_sender AS EnquirersEmail,
strip_tags(messages.message) AS Message,
users.name AS BusinessName
FROM
messages
JOIN users ON messages.from_to = users.id
JOIN businesses ON users.business_id = businesses.id
My SQL is not very good but would a LEFT JOIN rather than a JOIN help to reduce the number or rows returned? Ive have run an EXPLAIN query and it seems to make no difference between the LEFT JOIN and the JOIN..
Basically I think it would be good to reduce the number of rows returned, as it is absurdly big..
Short answer: There is nothing "wrong" with your query, other than the duplicate BusinessName alias.
Long answer: You can add indexes to the foreign / primary keys to speed up searching which will do more than changing the query.
If you're using SSMS (SQL management studio) you can right click on indexes for a table and use the wizard.
Just don't be tempted to index all the columns as that may slow down any inserts you do in future, stick to the ids and _ids unless you know what you're doing.
he uses it daily to count the messages sent to businesses
If this is done per day, why not limit this to messages sent in specific recent days?
As an example: To count messages sent per business per day, for just a few recent days (example: 3 or 4 days), try this:
SELECT businesses.name AS BusinessName
, messages.created AS DateSent
, COUNT(*) AS n
FROM messages
JOIN users ON messages.from_to = users.id
JOIN businesses ON users.business_id = businesses.id
WHERE messages.created BETWEEN current_date - INTERVAL '3' DAY AND current_date
GROUP BY businesses.id
, DateSent
ORDER BY DateSent DESC
, n DESC
, businesses.id
;
Note: businesses.name is functionally dependent on businesses.id (in the GROUP BY terms), which is the primary key of businesses.
Example result:
+--------------+------------+---+
| BusinessName | DateSent | n |
+--------------+------------+---+
| business1 | 2021-09-05 | 3 |
| business2 | 2021-09-05 | 1 |
| business2 | 2021-09-04 | 1 |
| business2 | 2021-09-03 | 1 |
| business3 | 2021-09-02 | 5 |
| business1 | 2021-09-02 | 1 |
| business2 | 2021-09-02 | 1 |
+--------------+------------+---+
7 rows in set
This assumes your basic join logic is correct, which might not be true.
Other data could be returned as aggregated results, if necessary, and the fact that this is now limited to just recent data, the amount of rows examined should be much more reasonable.
I hope the title is not too confusing. I am learning sql by working on a database for an airline company. For the query I will explain, the following tables get involved:
1. Airplane
plane_number| type | capacity
-------------------------------
I-XX0 | boeing| 200
I-XX1 | airbus| 250
2. Route
route_id | airport1 | airport2
-------------------------------
1 | LAX | CDG
2 | FCO | LAX
3. Flight
flight_id | departure | arrival | plane_number | route_id
-----------------------------------------------------------------------------------------
AC 000 | 2020-02-11T13:10:00 |2020-02-11T15:15:00 | I-XX0 | 1
AC 001 | 2020-02-12T13:10:00 |2020-02-12T15:15:00 | I-XX1 | 2
4. employee
employee_id | name | surname
-------------------------------
1 | bob | black
2 | paul | white
5. service
employee_id | flight_id
-----------------------
1 | AC 000
2 | AC 001
Having this data, is it possible to find out the employees which never worked on the same route two days in a row?
I have tried doing a self join, but I'm not sure that's the right approach.
I hope I've been clear enough, if not please comment in order to suggest an edit.
Thank you all very much in advance.
EDIT
In order to make the whole model more clear, here is the ER model:
This isn't terrible once you break it down into its parts: A query giving employee_ids for employees who have worked the same route on consecutive days, and a query of the employee table getting all employees whose employee_ids don't appear in the first query.
For the first query, we need so find consecutive flights on the same route. The first step is joining Flight to itself on the route_id and the departure date. The join condition for the departure date should check that the second flight departed one day after the first flight's:
FROM Flight f1
JOIN Flight f2 ON f1.route_id = f2.route_id
AND CAST(f1.Departure AS DATE) = CAST(DATE_ADD(f2.Departure, INTERVAL -1 DAY) AS DATE)
Then join each Flight to service on flight_id, and confirm the employee_ids are the same:
JOIN service s1 ON f1.flight_id = s1.flight_id
JOIN service s2 ON f2.flight_id = s2.flight_id
WHERE
s1.employee_id = s2.employee_id
Putting it together, we want to select distinct employee_ids, and wrap it in a CTE to join to:
WITH B2BRouteEmployees
AS
(
SELECT DISTINCT s1.employee_id
FROM Flight f1
JOIN Flight f2 ON f1.route_id = f2.route_id
AND CAST(f1.Departure AS DATE) = CAST(DATE_ADD(f2.Departure, INTERVAL -1 DAY) AS DATE)
JOIN service s1 ON f1.flight_id = s1.flight_id
JOIN service s2 ON f2.flight_id = s2.flight_id
WHERE
s1.employee_id = s2.employee_id
)
Now we can do a LEFT JOIN between the employee table and our B2BRouteEmployees table, and take the employees where the join is NULL -- they do not appear in the list.
SELECT e.employee_id
FROM employee e
LEFT JOIN B2BRouteEmployees b ON e.employee_id = b.employee_id
WHERE
b.employee_id IS NULL
I cant say this is the best way of doing that - im just scratching the up in a text editor - but I think it should give you an idea of one way of doing it. I think something like this would work;
-- Make a temp table / cte / or you could do this as a sub query. you want to show the next departure for each employee route combination
WITH flightEmployee AS (
SELECT s.employee_id,
f1.route_id,
f1.departure,
lead(f1.departure,1) OVER (PARTITION BY s.employee_id, f1.route_id ORDER BY f1.departure) AS nextDeparture
FROM #flight f1
INNER JOIN #service s
ON f1.flight_id = s.flight_id
)
-- you can then use this to show if it is going to be departing within 24 hours in many different ways - here is one example.
SELECT *
FROM (
SELECT employee_id,
route_id,
SUM(CASE WHEN DATEDIFF(hour,nextDeparture,departure) >= -24 THEN 1 ELSE 0 END) As No_Of_Same_Employee_and_Route_Within_24_Hours
FROM flightEmployee
GROUP BY employee_id,route_id
) x
WHERE x.No_Of_Same_Employee_and_Route_Within_24_Hours= 0
I count each instance of employee / route combinations within 24 hours - then sum them and select only those employee / routes with a count of zero.
As i said there will be many versions of solving this - maybe some with less code or more efficient - this is simply food for thought.
SELECT e. name, e. surname
FROM employee e
LEFT JOIN service s ON s. employee_id = e.employee_id
LEFT JOIN (
SELECT DISTINCT (f.departure, f. route_id) , f.flight_id
FROM flight f) g ON g. flight_id = s. flight_id
can u try this.
I've got THREE MYSQL TABLES (innoDB) :
NAMES
id nid version fname lname birth
RELATIONS
id rid version idname idperson roleid
ROLES
id role
I want to select the last version of each RELATIONS joined to the last version of their related NAMES for a particular idperson (and the name of the ROLE)
Of course, idperson will have 0, 1 or more relations and there will be one or more versions of RELATIONS and NAMES
I wrote something like :
SELECT A.id,A.nid,MAX(A.version),A.idname,A.idperson,A.roleid,B.id,B.role
FROM RELATIONS A
INNER JOIN
ROLES
ON A.roleid = B.id
INNER JOIN
(SELECT id,nid,MAX(version),fname,lname,birth FROM NAMES) C
ON A.idname = C.id
WHERE A.idperson = xx
It doesn't work maybe because MAX() seems to return only one line...
How to get the maximum value for more than one line in this joining context?
PS: how do you generate this kind of nice data set?
i.e. :
id home datetime player resource
---|-----|------------|--------|---------
1 | 10 | 04/03/2009 | john | 399
2 | 11 | 04/03/2009 | juliet | 244
5 | 12 | 04/03/2009 | borat | 555
8 | 13 | 01/01/2009 | borat | 700
Adding a GROUP BY statement, both in the subquery you have, as well as in the outer query should allow the MAX function to generate the result that you're looking for.
This is untested code, but should give you the result that you're looking for:
SELECT A.id,A.nid,MAX(A.version),A.idname,A.idperson,A.roleid,B.id,B.role
FROM RELATIONS A
INNER JOIN
ROLES
ON A.roleid = B.id
INNER JOIN
(SELECT id,nid,MAX(version),fname,lname,birth FROM NAMES GROUP BY fname,lname) C
ON A.idname = C.id
WHERE A.idperson = xx
GROUP BY fname,lname
Alternatively, if it works better for your database architecture, you can use any unique identifier for the employees you'd like (possibly nid?).
As to the question that you've posed in your PS, I'm unsure as to what you're asking. I don't seem a home, datetime, player, or resource field in the examples of your tables that you've provided. If you could clarify, I'd be happy to try and help you with that as well.
I have the following SQL query which queries my tickets, ticketThreads, users and threadStatus tables:
SELECT tickets.threadId, ticketThreads.threadSubject, tickets.ticketCreatedDate, ticketThreads.threadCreatedDate, threadStatus.threadStatus, users.name
FROM
tickets
INNER JOIN
ticketThreads
ON
tickets.threadId = ticketThreads.threadId
INNER JOIN
threadStatus
ON
ticketThreads.threadStatus = threadStatus.id
INNER JOIN
users
ON
users.id = ticketThreads.threadUserId
WHERE
tickets.ticketId = ticketThreads.lastMessage
AND
ticketThreads.threadStatus != 3
ORDER BY
tickets.ticketCreatedDate
DESC
The abridged version of what this returns is:
threadId |
----------
1 |
2 |
This works fine, and is what I expect, however to clean up the code and database slightly I need to remove the ticketThreads.lastMessage column.
If I remove the line WHERE tickets.ticketId = ticketThreads.lastMessage then this is an abridged version of what is returned:
threadId |
----------
1 |
2 |
1 |
What I need to do then is edit the query above to enable me to select the highest unique value for each threadId value in the tickets database.
I know about MAX() and GROUP BY but can't figure how to get them into my query above.
The relevant parts of the tables are shown below:
tickets
ticketId | ticketUserId | threadId
-------------------------------
1 | 1 | 1
2 | 1 | 2
3 | 1 | 1
ticketThreads
threadId | lastMessage | threadStatus
-------------------------------
1 | 3 | 4
2 | 2 | 1
I hope all the above is clear and makes sense
So you need the ticket with the highest id per each thread? Your problem is actually very easy variant of greatest record per group problem. No need for any subqueries. Basicaly you have two options, which both should perform much better than your query, the second be faster (please post the actual durations in your db!):
1. Standard compliant query, but slower:
SELECT t1.threadId, ticketThreads.threadSubject, t1.ticketCreatedDate,
ticketThreads.threadCreatedDate, threadStatus.threadStatus, users.name
FROM tickets as t1
LEFT JOIN tickets as t2
ON t1.threadId = t2.threadId AND t1.ticketId < t2.ticketId
JOIN ticketThreads ON t1.threadId = ticketThreads.threadId
JOIN threadStatus ON ticketThreads.threadStatus = threadStatus.id
JOIN users ON users.id = ticketThreads.threadUserId
WHERE t2.threadId is NULL
AND ticketThreads.threadStatus != 3
ORDER BY t1.ticketCreatedDate DESC
This one joins the tickets table two times, which can make it a bit slower for big tables.
2. Faster, but uses MySQL extension to standard SQL:
set #prev_thread := NULL;
SELECT t.threadId, ticketThreads.threadSubject, t.ticketCreatedDate,
ticketThreads.threadCreatedDate, threadStatus.threadStatus, users.name
FROM tickets as t
JOIN ticketThreads ON t.threadId = ticketThreads.threadId
JOIN threadStatus ON ticketThreads.threadStatus = threadStatus.id
JOIN users ON users.id = ticketThreads.threadUserId
WHERE ticketThreads.threadStatus != 3
AND IF(IFNULL(#prev_thread, -1) = #prev_thread := t.threadId, 0, 1)
ORDER BY t.threadId, t.ticketId DESC,
t.ticketCreatedDate DESC
Here, we perform one pass scan on ordered joined data, using auxiliary mysql variable #prev_thread to filter only the first (in the given order) ticket for each thread (the one with highest ticketId).
I have a table of services that have been provided to clients. I'm trying to make a query that selects all the clients who received a service that WEREN'T provided by a certain user.
So consider this table...
id_client | id_service | id_user
--------------------------------
5 | 3 | 2
7 | 4 | 2
7 | 4 | 1
9 | 4 | 2
8 | 4 | 1
If I write the query like this:
SELECT id_client FROM table WHERE id_service=4 AND id_user<>1
I still end up getting id_client 7. But I don't want to get client 7 because that client HAS received that service from user 1. (They're showing up because they've also received that service from user 2)
In the example above I would only want to be returned with client 9
How can I write the query to make sure that clients that have EVER received service 4 from user 1 don't show up?
Try this:
SELECT DISTINCT id_client
FROM yourtable t
WHERE id_service = 4 AND id_client NOT IN
(SELECT DISTINCT id_client
FROM yourtable t
WHERE id_user = 1
)
I'd write it like this:
SELECT DISTINCT id_client
FROM mytable t1
LEFT OUTER JOIN mytable t2
ON t1.id_client = t2.id_client AND t2.id_user = 1
WHERE t2.id_client IS NULL
When the conditions of a left outer join are not met, the row on the left side is still returned, and all the columns for the row on the right side are null. So if you search for cases of null in a column that would be certain to be non-null if there were a match, you know the outer join found no match.
SELECT id_client
FROM table
WHERE id_service = 4
GROUP BY id_client
HAVING MAX(CASE
WHEN id_user = 1 THEN 2
ELSE 1
END) = 1