Mysql DELETE where ID isn't present in multiple tables - best practice? - mysql

I want to delete people that aren't present in events or photos or email subscribers. Maybe they were, but the only photo they're tagged in gets deleted, or the event they were at gets purged from the database.
Two obvious options:
1)
DELETE FROM people
WHERE personPK NOT IN (
SELECT personFK FROM attendees
UNION
SELECT personFK FROM photo_tags
UNION
SELECT personFK FROM email_subscriptions
)
2)
DELETE people FROM people
LEFT JOIN attendees A on A.personFK = personPK
LEFT JOIN photo_tags P on P.personFK = personPK
LEFT JOIN email_subscriptions E on E.personFK = personPK
WHERE attendeePK IS NULL
AND photoTagPK IS NULL
AND emailSubPK IS NULL
Both A & P are about a million rows apiece, and E a few thousand.
The first option works fine, taking 10 seconds or so.
The second option times out.
Is there a cleverer, better, faster third option?

This is what I would do with, say, a multi-million row half-fictitious schema like above.
For the person, I would add count columns, 1 each, related to the child tables, and a datetime. Such as
photoCount INT NOT NULL,
...
lastUpdt DATETIME NOT NULL,
When it comes time for an INSERT/UPDATE on child tables (main focus naturally being insert), I would
begin a transaction
perform a "select for update" which renders an Intention Lock on the parent (people) row
perform the child insert, such as a new picture or email
increment the parent relevant count variable and set lastUpdt=now()
commit the tran (which releases the intention lock)
A delete against a child row is like above but with a decrement.
Whether these are done client-side / Stored Procs/ Trigger is your choice.
Have an Event see 1 and 2 that fires off once a week (you choose how often) that deletes people rows that have lastUpdt greater than 1 week and the count columns all at zero.
I realize the Intention Lock is not an exact analogy but the point about timeouts and row-level locking and the need for speed are relevant.
As always carefully craft your indexes considering frequency of use, real benefit, and potential drags on the system.
As for any periodic cleanup Events, schedule them to run in low peak hours with the scheduler.
There are some natural downsides to all of this. But if those summary numbers are useful for other profile pages, and fetching them on the fly is too costly, you benefit by it. Also you certainly evade what I see in your two proposed solutions as expensive calls.

I try duplicate your scenario here using postgreSQL. But I think there is something else you didnt tell us.
Both A & P are about a million rows apiece, and E a few thousand.
table people= 10k records
I select 9500 record at random and insert into email_subscriptions
Then duplicate those 9500 records 100 times for attendees and photo_tags total 950k on each table
SQL FIDDLE DEMO
First query need 5 sec
Second one need 11 millisec.

Related

Data design best practices for customer data

I am trying to store customer attributes in a MySQL database although it could be any type of database. I have a customer table and then I have a number of attribute tables (status, product, address, etc.)
The business requirements are to be able to A) look back at a point in time to see if a customer was active or what address they had on any given date and B) have a customer service rep be able to put things like entering future vacation holds. I customer might call today and tell the rep they will be on vacation next week.
I currently have different tables for each customer attribute. For instance, the customer status table has records like this:
CustomerID
Status
dEffectiveStart
dEffectiveEnd
1
Active
2022-01-01
2022-05-01
1
Vacation
2022-05-02
2022-05-04
1
Active
2022-05-05
2099-01-01
When I join these tables the sql typically looks like this:
SELECT *
FROM customers c
JOIN customerStatus cs
on cs.CustomerID = c.CustomerID
and curdate() between cs.dEffectiveStart and cs.dEffectiveEnd
While this setup does work as designed, it is slow. The query joins themselves aren't too bad, but when I try to throw an Order By on its done. The typical client query would pull 5-20k records. There are 5-6 other similar tables to the one above I join to a customer.
Do you any suggestions of a better approach?
That ON clause is very hard to optimize. So, let me try to 'avoid' it.
If you are always (or usually) testing CURDATE(), then I recommend this schema design pattern. I call it History + Current.
The History table contains many rows per customer.
The Current table contains only "current" info about each customer -- one row per customer. Your SELECT would need only this table.
Your design is "proper" because the current status is not redundantly stored in two places. My design requires changing the status in both tables when it changes. This is a small extra cost when changing the "status", for a big gain in SELECT.
More
The Optimizer will probably transform that query into
SELECT *
FROM customerStatus cs
JOIN customers c
ON cs.CustomerID = c.CustomerID
WHERE curdate() >= cs.dEffectiveStart
AND curdate() <= cs.dEffectiveEnd
(Use EXPLAIN SELECT ...; SHOW WARNINGS; to find out exactly.)
In a plain JOIN, the Optimizer likes to start with the table that is most filtered. I moved the "filtering" to the WHERE clause so we could see it; I left the "relation" in the ON.
curdate() >= cs.dEffectiveStart might use an index on dEffectiveStart. Or it _might` use an index to help the other part.
The Optimizer would probably notice that "too much" of the table would need to be scanned with either index, and eschew both indexes and simply do a table scan.
Then it will quickly and efficiently JOIN to the other table.

Looking for the earliest history entry of a product should I join history/product tables or store value in product table?

I have a MySQL database with a table for products and a table with the buying/selling history of these products. The buying and selling history of each product is basically tracked in this history table.
I am looking for the most efficient way of creating a list of these products with the earliest transaction data from the history table joined.
At the moment my SQL query selects the products with the earliest history entry like this:
SELECT p.*
, h.transdate
, h.sale_price
FROM products p
LEFT
JOIN
( SELECT MIN(transdate) transdate
, product_id
FROM history
GROUP
BY product_id
) hist_min
ON hist_min.product_id = p.id
LEFT
JOIN history h
ON h.product_id = hist_min.product_id
AND h.transdate = hist_min.transdate
Since this query is used very frequently and potentially with many products I am considering storing the first sale_price directly in the 'products' table. This way I wouldn't need the 2 additional JOINS at all. But this would mean I store redundant data.
For me the most important question is, which of these possibilities is the most efficient one.
I am not sure if I am allowed to ask this additionally, but if there is an even better way I would like to know about it.
EDIT: To clarify 'efficient', I am talking about tens of thousands of products with maybe 10 history records each, where I only pick pagewise 20 with a LIMIT statement. To save the original price with the product would be pulling the data straight with the record, while the scanning of dates in the history table for the earliest time and another scan to join the actual row of data would require certainly more resources, even if only for the second table involved. The use of a primary key ID oder an index over product_id and transdate would certainly speed up the second join though.
What you're describing is called 'normalization'. The level of normalization is not a black and white area so I don't think this site is the place to get your answer as it's primarily opinion based.
Check out these links to get started:
Database Normalization Explained in Simple English
Wikipedia (check out the 'See also' section, it describes level of normalization)

How to correctly lock a MYSQL table in a concurrent environment

I am struggling about how to grant the consistency of the following operation in this scenario: we are developing a reservation portal, where you can sign for courses. Every course has a number of direct reservations: when the available reservations are over, users can still sign but the fall in the waiting list.
So this is the situation: for example, let's imagine that a course has 15 slots available and I make my reservation. In a simple logic, as I made my reservation the system has to decide if I am in the direct reservation list (one of the 15 available slots) or in the waiting list. In order to do so, we must count if the total direct reservations is less than the total available reservations, something like this in pseudocode:
INSERT new_partecipant IN table_partecipants;
(my_id) = SELECT ##IDENTITY;
(total_reservations_already_made) = SELECT COUNT(*) FROM table_partecipants WHERE flag_direct_reservation=1;
if total_reservations_already_made <= total_reservations_available then
UPDATE table_partecipants SET flag_direct_reservation=1 WHERE id=my_id
The question is: how I can manage the concurrency in order to be sure that two subscriptions are managed correctly? If I have only one reservation left and two users apply at the same time, I think it's possible that the result of the COUNT operation can give the same result to both the requests and insert both the users in the list.
What is the correct way of locking (or similar procedure) in order to be sure that if a user starts the subscription procedure no one can finish the same task before the request has been completed?
Ok, it seems we have found a solution without locking, any comment appreciated:
INSERT
INSERT INTO table1 (field1, field2, ..., fieldN, flag_direct_reservation)
SELECT #field1, #field2, ..., #fieldN, ,CASE WHEN (SELECT COUNT(*) FROM sometable WHERE course=#course_id) > #max_part THEN 1 ELSE 0 END
UPDATE
(only for determining the subscription status, in case of subscription deletion)
UPDATE corsi_mytable p1 INNER JOIN
(
SELECT COUNT(*) as actual_subscritions
FROM mytable
WHERE course=#course_id
)p2
SET p1.flag_direct_reservation= CASE WHEN p2.actual_subscritions > #max_part THEN 0 ELSE 1 END
WHERE p1.id =#first_waiting_id;
In this way the operation is performed by only one SQL statements and the transactional engine should ensure the consistency of the operation.
Don't lock the table. Instead try and book a row, and if it fails, throw them on the waiting list. For example:
UPDATE table_partecipants SET booking_id=? WHERE booking_id IS NULL LIMIT 1
The WHERE clause here should include any other exclusion factors, like if it's the right day. If the query successfully modifies a row the booking worked. If not, it's sold out. No lock required.
The booking_id here is some unique value that can be used to associate this course with the person booking it. You can delete any records that aren't used due to over-booking.
I would advise you agains locking the full table for this. Go for a row lock.
You can have a separate table that has only 1 line with your reservation counter of 15, and start by updating this counter before adding new reservations.
So you are sure the lock is managed by this separate table, which enables your reservation table to be updated by any other scenario where you don't care about the counter.

Efficiency of Query to Select Records based on Related Records in Composite Table

Setup
I am creating an event listing where users can narrow down results by several filters. Rather than having a table for each filter (i.e. event_category, event_price) I have the following database structure (to make it easy/flexible to add more filters later):
event
event_id title description [etc...]
-------------------------------------------
fllter
filter_id name slug
-----------------------------
1 Category category
2 Price price
filter_item
filter_item_id filter_id name slug
------------------------------------------------
1 1 Music music
2 1 Restaurant restaurant
3 2 High high
4 2 Low low
event_filter_item
event_id filter_item_id
--------------------------
1 1
1 4
2 1
2 3
Goal
I want to query the database and apply the filters that users specify. For example, if a user searches for events in 'Music' (category) priced 'Low' (price) then only one event will show (with event_id = 1).
The URL would look something like:
www.site.com/events?category=music&price=low
So I need to query the database with the filter 'slugs' I receive from the URL.
This is the query I have written to make this work:
SELECT ev.* FROM event ev
WHERE
EXISTS (SELECT * FROM event_filter_item efi
JOIN filter_item fi on fi.filter_item_id = efi.filter_item_id
JOIN filter f on f.filter_id = fi.filter_id
WHERE efi.event_id = ev.event_id AND f.slug = 'category' AND fi.slug ='music')
AND EXISTS (SELECT * FROM event_filter_item efi
JOIN filter_item fi on fi.filter_item_id = efi.filter_item_id
JOIN filter f on f.filter_id = fi.filter_id
WHERE efi.event_id = ev.event_id AND f.slug = 'price' AND fi.slug = 'low')
This query is currently hardcoded but would be dynamically generated in PHP based on what filters and slugs are present in the URL.
And the big question...
Is this a reasonable way to go about this? Does anyone see a problem with having multiple EXISTS() with sub-queries, and those subqueries performing several joins? This query is extremely quick with only a couple records in the database, but what about when there are thousands or tens of thousands?
Any guidance is really appreciated!
Best,
Chris
While EXISTS is just a form of JOIN, MySQL query optimizer is notoriously "stupid" about executing it optimally. In your case, it will probably do a full table scan on the outer table, then execute the correlated subquery for each row, which is bound to scale badly. People often rewrite EXISTS as explicit JOIN for that reason. Or, just use a smarter DBMS.
In addition to that, consider using a composite PK for filter_item, where FK is at the leading edge - InnoDB tables are clustered and you'd want to group items belonging to the same filter physically close together.
BTW, tens of thousands is not a "large" number of rows - to truly test the scalability use tens of millions or more.

MySQL architecture: null columns vs. joins

I have an application where I'll have repeating events. So an event can repeat by day, "every n days", by week, "every n weeks on Mon/Tue/Wed/etc", and by month, "every n months on the 1st,2nd,3rd,etc".
What is the best way to handle this from a table design perspective? I can think of two ways but I'm not sure which one is better.
1) 5 columns for the above, 1 for the day case and 2 each for week and month. Whichever ones are not being used would be null. In my application I could see the nulls and choose to ignore them.
2) Have a second table, say events_dateinfo or something, against which I'd JOIN for the query.
Seems like option 2 is probably more 'normalized' and what not, but does it strike you as overkill for such a simple thing? Also, if I were to go option 2, is there a way to translate rows into columns - that is, select the 2 week attributes for a specific event and have them treated as columns?
If I understood right event can have more than 1 schedule (this is why you want " to translate rows into columns ").
You will need not 2 but 3 tables in this case; third one must be junction table. You can easily add new schedules if you need in the future with this scheme.
So, something like this:
table events (event_id, event_name, description)
table schedules (sch_id, schedule)
table event_schedule (event_id, sch_id)
There isn't PIVOT possibility in MySQL as I know, but you can use GROUP_CONCAT() function in SELECT; it'll be one row per event and all schedules for one event will be in one column.
SELECT e.event_name AS Event, GROUP_CONCAT( s.schedule SEPARATOR ', ') AS Schedule
FROM events e
(LEFT) JOIN event_schedule es
ON e.event_id = es.event_id
JOIN schedules s
ON s.sch_id = es. sch_id
GROUP BY e.event_name;
I would prefer to handle this normallized, The events in one table, and the event recurrency in another.
Handling the indexes in a appropriate way, you can handle the request for data through views, or if data gets larger, as an audit table with triggers.