Data design best practices for customer data - mysql

I am trying to store customer attributes in a MySQL database although it could be any type of database. I have a customer table and then I have a number of attribute tables (status, product, address, etc.)
The business requirements are to be able to A) look back at a point in time to see if a customer was active or what address they had on any given date and B) have a customer service rep be able to put things like entering future vacation holds. I customer might call today and tell the rep they will be on vacation next week.
I currently have different tables for each customer attribute. For instance, the customer status table has records like this:
CustomerID
Status
dEffectiveStart
dEffectiveEnd
1
Active
2022-01-01
2022-05-01
1
Vacation
2022-05-02
2022-05-04
1
Active
2022-05-05
2099-01-01
When I join these tables the sql typically looks like this:
SELECT *
FROM customers c
JOIN customerStatus cs
on cs.CustomerID = c.CustomerID
and curdate() between cs.dEffectiveStart and cs.dEffectiveEnd
While this setup does work as designed, it is slow. The query joins themselves aren't too bad, but when I try to throw an Order By on its done. The typical client query would pull 5-20k records. There are 5-6 other similar tables to the one above I join to a customer.
Do you any suggestions of a better approach?

That ON clause is very hard to optimize. So, let me try to 'avoid' it.
If you are always (or usually) testing CURDATE(), then I recommend this schema design pattern. I call it History + Current.
The History table contains many rows per customer.
The Current table contains only "current" info about each customer -- one row per customer. Your SELECT would need only this table.
Your design is "proper" because the current status is not redundantly stored in two places. My design requires changing the status in both tables when it changes. This is a small extra cost when changing the "status", for a big gain in SELECT.
More
The Optimizer will probably transform that query into
SELECT *
FROM customerStatus cs
JOIN customers c
ON cs.CustomerID = c.CustomerID
WHERE curdate() >= cs.dEffectiveStart
AND curdate() <= cs.dEffectiveEnd
(Use EXPLAIN SELECT ...; SHOW WARNINGS; to find out exactly.)
In a plain JOIN, the Optimizer likes to start with the table that is most filtered. I moved the "filtering" to the WHERE clause so we could see it; I left the "relation" in the ON.
curdate() >= cs.dEffectiveStart might use an index on dEffectiveStart. Or it _might` use an index to help the other part.
The Optimizer would probably notice that "too much" of the table would need to be scanned with either index, and eschew both indexes and simply do a table scan.
Then it will quickly and efficiently JOIN to the other table.

Related

MySQL - When shouldn't I Join tables? Combinatorial Explosion of values

I am working on a database called classicmodels, which I found at: https://www.mysqltutorial.org/mysql-sample-database.aspx/
I realized that when I executed an Inner Join between 'payments' and 'orders' tables, a 'cartesian explosion' occurred. I understand that these two tables are not meant to be joined. However, I would like to know if it is possible to identify this just by looking at the relational schema or if I should check the tables one by one.
For instance, the customer number '141' appears 26 times in the 'orders table', which I found by using the following code:
SELECT
customerNumber,
COUNT(customerNumber)
FROM
orders
WHERE customerNumber=141
GROUP BY customerNumber;
And the same customer number (141) appears 13 times in the payments table:
SELECT
customerNumber,
COUNT(customerNumber)
FROM
payments
WHERE customerNumber=141
GROUP BY customerNumber;
Finally, I executed an Inner Join between 'payments' and 'orders' tables, and selected only the rows with customer number '141'. MySQL returned 338 rows, which is the result of 26*13. So, my query is multiplying the number of times this 'customer n°' appears in 'orders' table by the number of times it appears in 'payments'.
SELECT
o.customernumber,
py.amount
FROM
customers c
JOIN
orders o ON c.customerNumber=o.customerNumber
JOIN
payments py ON c.customerNumber=py.customerNumber
WHERE o.customernumber=141;
My questions is the following:
1 ) Is there a way to look at the relational schema and identify if a Join can be executed (without generating a combinatorial explosion)? Or should I check table by table to understand how the relationship between them is?
Important Note: I realized that there are two asterisks in the payments table's representation in the relational schema below. Maybe this means that this table has a composite primary key (customerNumber+checkNumber). The problem is that 'checkNumber' does not appear in any other table.
This is the database's relational schema provided by the 'MySQL Tutorial' website:
Thank you for your attention!
This is called "combinatorial explosion" and it happens when rows in one table each join to multiple rows in other tables.
(It's not "overestimation" or any sort of estimation. It's counting data items multiple times when it should only count them once.)
It's a notorious pitfall of summarizing data in one-to-many relationships. In your example each customer may have no orders, one order, or more than one. Independently, they may have no payments, one, or many.
The trick is this: Use subqueries so your toplevel query with GROUP BY avoids joining one-to-many relationships serially. In the query you showed us, that's happening.
You can this subquery to get a resultset with just one row per customer. (try it.)
SELECT customernumber,
SUM(amount) amount
FROM payments
GROUP BY customernumber
Likewise you can get the value of all orders for each customer with this
SELECT c.customernumber,
SUM(od.qytOrdered * od.priceEach) amount
FROM orders o
JOIN orderdetails od ON o.orderNumber = od.orderNumber
GROUP BY c.customernumber
This JOIN won't explode in your face because customer can have multiple orders, and each order can have multiple details. So it's a strict hierarchical rollup.
Now, we can use these subqueries in the main query.
SELECT c.customernumber, p.payments, o.orders
FROM customers c
LEFT JOIN (
SELECT c.customernumber,
SUM(od.qytOrdered * od.priceEach) orders
FROM orders o
JOIN orderdetails od ON o.orderNumber = od.orderNumber
GROUP BY c.customernumber
) o ON c.customernumber = o.customernumber
LEFT JOIN (
SELECT customernumber,
SUM() payment
FROM payments
GROUP BY customernumber
) p on c.customernumber = p.customernumber
Takehome tricks:
A subquery IS a table (a virtual table) that can be used whereever you might mention a table or a view.
The GROUP BY stuff in this query happens separately in two subqueries, so no combinatorial explosions.
All three participants in the toplevel JOIN have either one or zero rows per customernumber.
The LEFT JOINs are there so we can still see customers with (importantly for a business) no orders or no payments. With the ordinary inner JOIN, rows have to match both sides of the ON conditions or they're omitted from the resultset.
Pro tip Format your SQL queries fanatically carefully: They are really verbose. Adm. Grace Hopper would be proud. That means they get quite long and nested, putting the Structured in Structured Query Language. If you, or anybody, is going to reason about them in future, we must be able to grasp the structure easily.
Pro tip 2 The data engineer who designed this database did a really good job thinking it through and documenting it. Aspire to this level of quality. (Rarely reached in the real world.)
In this particular case, your behavior should depend on the accounting style being supported by the database, and this does not appear to be "open item" style accounting ie when an order is raised for 1000 there does not need to be a payment against it for 1000.. This is perhaps unusual in most consumer experience because you will be quite familiar with open item style ordering from Amazon - you buy a 500 dollar tv and a 500 dollar games console, the order is a thousand dollars and you pay for it, the payment going against the order. However, you're also familiar with "balance forward" accounting if you paid for that order using your credit card because you make similar purchases every day for a month and hen you get a statement from your bank saying you owe 31000 and you pay a lump of money, doesn't even have to be 31k. You aren't expected to make 31 payments of 1000 to your bank at the end of the month. Your bank allocate it to the oldest items on the account (if they're nice, or the newest items if they're not) and may eventually charge you interest on unpaid transactions
1 ) Is there a way to look at the relational schema and identify if a Join can be executed
Yes, you can tell looking at the schema- customer has many orders, customer makes many payments, but there is no relation between the order and payment tables at all so we can see there is no attempt to directly attach a payment to an order. You can see that customer is a parent table of payment and order, and therefore enjoys a relationship with each of them but they do not relate to each other. If you had Person, Car and Address tables, a person has many addresses during their life, and many cars but it doesn't mean there is a relationship between cars and addresses
In such a case it simply doesn't make sense to join payments to customers to orders because they do not relate that way. If you want to make such a join and not suffer a Cartesian explosion then you absolutely have to sum one side or the other (or both) to ensure that your joins are 1:1 and 1:M (or 1:1 and 1:1). You cannot arrange a join that is a pair of 1:M.
Going back to the car/person/address example to make any meaningful joins, you have to build more information into the question and arrange the join to create the answer. Perhaps the question is "what cars did they own while they lived at" - this flattens the Person:Address relationship to 1:1 but leaves Person:Car as 1:M so they might have owned many cars during their time in that house. "What was the newest car they owned while living at..." might be 1:1 on both sides if there is a clear winner for "newest" (though if they bought two cars manufactured at identical times...)
Which side you sum in your orders case will depend on what you want to know, but in this case I'd say you usually want to know "which orders haven't been paid for" and that's summing all payments and rolling summing all orders then looking at what point the rolling sum exceeds the sum of payments.. those are the unpaid orders
Take a look again at your database graph (the one that was present in the first iteration of your question). See the lines between tables have 3 angled legs on one end - that's the many end. You can start at any table in the graph and join to other tables by walking along the relationship. If you're going from the many end to the one end, and assuming you've picked out a single row in the start table (a single order) you can always walk to any other table in the many->one direction and not increase your row count. If you walk the other way you potentially increase your row count. If you split and walk two ways that both increase row count you get a Cartesian explosion. Of course, also you don't have to only join on relation lines, but that's out of scope for the question
ps: this is easier to see on the db diagram than the ERD in the question because the database purely concerns itself with the columns that are foreign keyed. The ERD is saying a customer has zero or one payments with a particular check number but the database will only be concerned with "the customer ID appears once in the customer table and multiple times in the payment table" because only part of the compound primary key of payment is keyed to the customer table. In other words, the ERD is concerned with business logic relations too, but the db diagram is purely how tables relate and they aren't necessarily aligned. For this reason the db diagrams are probably easier to read when walking round for join strategies
After seeing the answers of Caius Jard and O.Jones (please, check their replies), which kindly helped me to clarify this doubt, I decided to create a table to identify which customers paid for all orders they made and which ones did not. This creates a pertinent reason to join 'orders', 'orderdetails', 'payments' and 'customers' tables, because some orders may have been cancelled or still may be 'On Hold', as we can see in their corresponding 'status' in the 'orders' table. Also, this enables us to execute this join without generating a 'combinatorial explosion'.
I did this by using the CASE statement, which registers when py.amount and amount_in_orders match, don't match or when they are NULL (customers which did not make orders or payments):
SELECT
c.customerNumber,
py.amount,
amount_in_orders,
CASE
WHEN py.amount=amount_in_orders THEN 'Match'
WHEN py.amount IS NULL AND amount_in_orders IS NULL THEN 'NULL'
ELSE 'Don''t Match'
END AS Match
FROM
customers c
LEFT JOIN(
SELECT
o.customerNumber, SUM(od.quantityOrdered*od.priceEach) AS amount_in_orders
FROM
orders o
JOIN orderdetails od ON o.orderNumber=od.orderNumber
GROUP BY o.customerNumber
) o ON c.customerNumber=o.customerNumber
LEFT JOIN(
SELECT customernumber, SUM(amount) AS amount
FROM payments
GROUP BY customerNumber
) py ON c.customerNumber=py.customerNumber
ORDER BY py.amount DESC;
The query returned 122 rows. The images below are fractions of the generated output, so you can visualize what happened:
For instance, we can see that the customers identified by the numbers '141', '124', '119' and '496' did not pay for all the orders they made. Maybe some of them where cancelled or maybe they simply did not pay for them yet.
And this image shows some of the columns (not all of them) that are NULL:

Looking for the earliest history entry of a product should I join history/product tables or store value in product table?

I have a MySQL database with a table for products and a table with the buying/selling history of these products. The buying and selling history of each product is basically tracked in this history table.
I am looking for the most efficient way of creating a list of these products with the earliest transaction data from the history table joined.
At the moment my SQL query selects the products with the earliest history entry like this:
SELECT p.*
, h.transdate
, h.sale_price
FROM products p
LEFT
JOIN
( SELECT MIN(transdate) transdate
, product_id
FROM history
GROUP
BY product_id
) hist_min
ON hist_min.product_id = p.id
LEFT
JOIN history h
ON h.product_id = hist_min.product_id
AND h.transdate = hist_min.transdate
Since this query is used very frequently and potentially with many products I am considering storing the first sale_price directly in the 'products' table. This way I wouldn't need the 2 additional JOINS at all. But this would mean I store redundant data.
For me the most important question is, which of these possibilities is the most efficient one.
I am not sure if I am allowed to ask this additionally, but if there is an even better way I would like to know about it.
EDIT: To clarify 'efficient', I am talking about tens of thousands of products with maybe 10 history records each, where I only pick pagewise 20 with a LIMIT statement. To save the original price with the product would be pulling the data straight with the record, while the scanning of dates in the history table for the earliest time and another scan to join the actual row of data would require certainly more resources, even if only for the second table involved. The use of a primary key ID oder an index over product_id and transdate would certainly speed up the second join though.
What you're describing is called 'normalization'. The level of normalization is not a black and white area so I don't think this site is the place to get your answer as it's primarily opinion based.
Check out these links to get started:
Database Normalization Explained in Simple English
Wikipedia (check out the 'See also' section, it describes level of normalization)

To create MySQL tables for each specific user, or generalize the tables?

I'm running into all kinds of thought problems while planning my database:
Outline:
The database is a patient database with a large number of patients.
Each patient has tons of data, eg: bloodpressure values on different dates.
Questions:
Would it be easier to create tables per patient e.g.
"bob_builder_BPvalues" or to create one table for the BP values eg. "BP_values" and then have all the patients values in there linked via foreign keys?
As I have so much data per patient, it does not seem to make sense to mix blood pressure value of each patient into one single table as this would look very messy to a human. Which approach would be faster in terms of processing and sorting through the data?
Let's say you have 10 patients:
With your first approach, you'd end up with 10 different tables always containing the same type of data.
For each query on a single patient, you would have to build a dynamic query joining to the right table:
SELECT ...
FROM patients
INNER JOIN bobby_measures ON ... -- this has to be crafted dynamically each time
WHERE patients.name = 'bobby'
And what if you want to make some stats on some kind of data for a range of dates for all patients ? Querying this becomes a nightmare, even with 10 patients. So guess what happens when you have 1000...
On the other hand, your second choice makes (arguably) human reading of the database more difficult. But being read by a human is not one of the objectives of databases.
With a single patientData table (or as many tables you want, one per datatype if needed, bloodPressure and stuff), everything becomes simpler. You can query any patient using the same query, changing only the patient id, you can make all the queries you want for a range of dates, filtering on some datatype, or whatever.
SELECT ...
FROM patients
INNER JOIN patientData ON ...
WHERE patients.name in ('bobby', 'joe'...)
AND patientData.type = 'blood pressure'
AND patientData.date BETWEEN ... AND ...
-- and so on
Using the right indices on the patientData table(s) and an appropriate presentation layer, all this data becomes totally readable by an average user.
Have a single table for all patients. This can then link to a BloodPressure table using a foreign key. The relationship between ...
Patient 1----* BloodPressureResults
So a single patient can have many blood pressure results.
You would then be able to view the blood pressure results for a specific patient by using a simple query...
SELECT * FROM BloodPressureResults
WHERE Patient_Id = '1'
This would then return you all of the blood pressure results for the patient with an Id of 1.
You would then also be able to add other tables like WeightResults or BloodTestResults in the same way as the BloodPressureResults table

Efficiency of Query to Select Records based on Related Records in Composite Table

Setup
I am creating an event listing where users can narrow down results by several filters. Rather than having a table for each filter (i.e. event_category, event_price) I have the following database structure (to make it easy/flexible to add more filters later):
event
event_id title description [etc...]
-------------------------------------------
fllter
filter_id name slug
-----------------------------
1 Category category
2 Price price
filter_item
filter_item_id filter_id name slug
------------------------------------------------
1 1 Music music
2 1 Restaurant restaurant
3 2 High high
4 2 Low low
event_filter_item
event_id filter_item_id
--------------------------
1 1
1 4
2 1
2 3
Goal
I want to query the database and apply the filters that users specify. For example, if a user searches for events in 'Music' (category) priced 'Low' (price) then only one event will show (with event_id = 1).
The URL would look something like:
www.site.com/events?category=music&price=low
So I need to query the database with the filter 'slugs' I receive from the URL.
This is the query I have written to make this work:
SELECT ev.* FROM event ev
WHERE
EXISTS (SELECT * FROM event_filter_item efi
JOIN filter_item fi on fi.filter_item_id = efi.filter_item_id
JOIN filter f on f.filter_id = fi.filter_id
WHERE efi.event_id = ev.event_id AND f.slug = 'category' AND fi.slug ='music')
AND EXISTS (SELECT * FROM event_filter_item efi
JOIN filter_item fi on fi.filter_item_id = efi.filter_item_id
JOIN filter f on f.filter_id = fi.filter_id
WHERE efi.event_id = ev.event_id AND f.slug = 'price' AND fi.slug = 'low')
This query is currently hardcoded but would be dynamically generated in PHP based on what filters and slugs are present in the URL.
And the big question...
Is this a reasonable way to go about this? Does anyone see a problem with having multiple EXISTS() with sub-queries, and those subqueries performing several joins? This query is extremely quick with only a couple records in the database, but what about when there are thousands or tens of thousands?
Any guidance is really appreciated!
Best,
Chris
While EXISTS is just a form of JOIN, MySQL query optimizer is notoriously "stupid" about executing it optimally. In your case, it will probably do a full table scan on the outer table, then execute the correlated subquery for each row, which is bound to scale badly. People often rewrite EXISTS as explicit JOIN for that reason. Or, just use a smarter DBMS.
In addition to that, consider using a composite PK for filter_item, where FK is at the leading edge - InnoDB tables are clustered and you'd want to group items belonging to the same filter physically close together.
BTW, tens of thousands is not a "large" number of rows - to truly test the scalability use tens of millions or more.

What is the best way to count rows in a mySQL complex table

I have a table with the following fields (for example);
id, reference, customerId.
Now, I often want to log an enquiry for a customer.. BUT, in some cases, I need to filter the enquiry based on the customers country... which is in the customer table..
id, Name, Country..for example
At the moment, my application shows 15 enquiries per page and I am SELECTing all enquiries, and for each one, checking the country field in customerTable based on the customerId to filter the country. I would also count the number of enquiries this way to find out the total number of enquiries and be able to display the page (Page 1 of 4).
As the database is growing, I am starting to notice a bit of lag, and I think my methodology is a bit flawed!
My first guess at how this should be done, is I can add the country to the enquiryTable. Problem solved, but does anyone else have a suggestion as to how this might be done? Because I don't like the idea of having to update each enquiry every time the country of a contact is changed.
Thanks in advance!
It looks to me like this data should be spread over 3 tables
customers
enquiries
countries
Then by using joins you can bring out the customer and country data and filter by either. Something like.....
SELECT
enquiries.enquiryid,
enquiries.enquiredetails,
customers.customerid,
customers.reference,
customers.countryid,
countries.name AS countryname
FROM
enquiries
INNER JOIN customers ON enquiries.customerid = customers.customerid
INNER JOIN countries ON customers.countryid = countries.countryid
WHERE countries.name='United Kingdom'
You should definitely be only touching the database once to do this.
Depending on how you are accessing your data you may be able to get a row count without issuing a second COUNT(*) query. You havent mentioned what programming language or data access strategy you have so difficult to be more helpful with the count. If you have no easy way of determining row count from within the data access layer of your code then you could use a stored procedure with an output parameter to give you the row count without making two round trips to the database. It all depends on your architecture, data access strategy and how close you are to your database.