Databases design - one link table or multiple link tables? - mysql

I'm working on a front end for a database where each table essentially has a many to many relationship with all other tables.
I'm not a DB admin, just a few basic DB courses. The typical solution in this case, as I understand it, would be multiple link tables to join each 'real' table. Here's what I'm proposing instead: one link table that has foreign key dependencies to all other PKs of the other tables.
Is there any reason this could turn out badly in terms of scalability, flexibility, etc down the road?

So you're trying to decide whether to take a star pattern or an asterisk pattern?
I'd certainly advocate asterisk. Just as in code, there is generally a driver method, there should be a driver table if the schema is as you described. Look at the total number of tables you'll need for each number of "main" tables:
Main Junct Total
-------------------
2 1 3
3 3 6
4 6 10
5 10 15
6 15 21
7 21 28!
7 is probably the most you would have in a database schema.
In addition, this way you can do complex queries involving 3 main tables without having to go through 3 junction tables, but rather only touch one junction table no matter how many main tables you want.
Scalability? No.
Flexibility? Only if your schema changes dramatically.

If I understand your proposal correctly, what you are thinking of doing is a minor variation on the them of the 'One True Lookup Table' (OTLT), which is not a good idea. In this case, perhaps, OTLT stands for 'One True Linking Table'.
The problems come when you need to maintain the referential integrity of the OTLT. For starters, what is its schema?
ReferencingTable INTEGER (or VARCHAR(xx)?)
ReferencingId INTEGER
ReferencedTable INTEGER (or VARCHAR(xx)?)
ReferencedId INTEGER
The table IDs have to be watched. They can be copies of the value in the system catalog, but you have to worry about what happens when you rebuild on of the tables (typically, the table ID changes). Or they can be separately controlled values - a parallel set of tables.
Next, you have to worry about the asymmetry in the naming of the columns in what should be a symmetric setup; the OTLT connects Table1 to Table2 just as much as it does Table2 to Table1 -- unless, indeed, your relationships are asymmetric. That just complicates life enormously.
Now, suppose you need to join primary tables Table1 to Table2 and Table2 to Table3, each via the OTLT, and that the table IDs are 1, 2, and 3, and that the 'ReferencingTable' is always the smaller of the two in the OTLT:
SELECT T1.*, T2.*, T3.*
FROM Table1 AS T1
JOIN OTLT AS O1 ON T1.Id = O1.ReferencingId AND O1.ReferencingTable = 1
JOIN Table2 AS T2 ON T2.Id = O1.ReferencedId AND O1.ReferencedTable = 2
JOIN OTLT AS O2 ON T2.Id = O2.ReferencingId AND O2.ReferencingTable = 2
JOIN Table3 AS T3 ON T3.Id = O2.ReferencedId AND O2.ReferencedTable = 3
So, here you have two independent sets of joins via the OTLT.
The alternative formulation uses separate joining tables for each pair. The rows in these joining tables are smaller:
ReferencingID INTEGER
ReferencedID INTEGER
And, assuming that the joining tables are named Join_T1_T2, etc, the query above becomes:
SELECT T1.*, T2.*, T3.*
FROM Table1 AS T1
JOIN Join_T1_T2 AS J1 ON T1.Id = J1.ReferencingId
JOIN Table2 AS T2 ON T2.Id = J1.ReferencedId
JOIN Join_T2_T3 AS J2 ON T2.Id = J2.ReferencingId
JOIN Table3 AS T3 ON T3.Id = J2.ReferencedId
There are just as many references to tables (5) as before, but the DBMS can automatically maintain the referential integrity on these joining tables - whereas the maintenance has to be written by hand with the OTLT. The joins are simpler (no AND clauses).
In my view, this weighs strongly against the OTLT system and in favour of specialized linking tables for each significant pairing of the primary tables.

My main problem with a single link table is if a 'link' suddenly turned into an entity. For example, you may have 'shopper' and 'store' entities. that can be many-to-many as a shopper can go to many stores and a store will have many shoppers.
Next month you decide you want to record how much a shopper spends in a store. Suddenly you either have to add a 'purchase' amount to your generic link table, or rebuild a big chunk of your application to use a specific link table for that link instead of the generic one.

You have two options with this setup.
Ensure each row indicates the link to only one table. This is a degenerate model of having individual join tables.
Ensure you have all the links combinations which highly inflates the size of the table. Given a row in the primary table which joins to 4, 5, and 6 records in each of three other tables you need 4 * 5 * 6 = 120 rows in your join table. You also need logic to handle no joins to a table. If you need to join to only the first tables you need to filter the 120 rows you get down to 4.
There are cases where you will have multiple table relationships, but these will be driven by the design. Releationships frequently carry information such as start and end dates. These are problematic for the one true lookup table as you will need to carry columns for each possible relationship.

Related

A large database or several smaller?

After reading a lot about it and similar questions, I am still not clear about the following case.
I have an schema like this in one mysql database, where I store the probabilities of matches of more than 10 sports depending on the type of result
(is intended for an application that shows the odds for each sport on different pages but I will never mix sports on the same page):
Design 1: a single database
SPORT
id
name
TEAM
id
sportId
name
birth
MATCHES
id
sportId
teamId_1
teamId_2
result
date
PROBABILITIES
id
matchId
type
percentage
(table Probabilities is very long, almost a billion rows, and will grow over time)
All necessary fields are correctly indexed. Then to see all the probabilities of the matches that do not have result with the football sport with id = 1, I would make the following query:
SELECT s.name, t1.name as nameTeam1, t2.name as nameTeam2, t1.birth as birthTeam1, t2.birth as birthTeam2, m.date, p.type, p.percentage
FROM matches m
INNER JOIN team t1 ON t1.id = m.teamId_1
INNER JOIN team t2 ON t2.id = m.teamId_2
INNER JOIN sport s ON s.id = m.sportId
INNER JOIN probabilities p ON p.matchId = m.id
WHERE result IS NULL
AND s.id = 1
This database design is great because it allows me to work comfortably with ORM like Prisma. But for my team the most important thing is speed and performance.
Knowing this, is it a good idea to do it this way or would it be better to separate the tables into several databases?
Design 2: one database per sport
Database Football
TEAM
id
sportId
name
birth
MATCHES
id
teamId_1
teamId_2
date
PROBABILITIES
id
matchId
type
percentage
Database Basketball
TEAM
id
sportId
name
birth
MATCHES
id
teamId_1
teamId_2
date
PROBABILITIES
id
matchId
type
percentage
The probabilities table is much smaller, in some sports only thousands of rows.
So if, for example, I only need to take the football probabilities I make a query like this:
SELECT t1.name as nameTeam1, t2.name as nameTeam2, t1.birth as birthTeam1, t2.birth as birthTeam2, m.date, p.type, p.percentage
FROM football.matches m
INNER JOIN football.team t1 ON t1.id = m.teamId_1
INNER JOIN football.team t2 ON t2.id = m.teamId_2
INNER JOIN football.probabilities p ON p.matchId = m.id
WHERE result IS NULL
Or is there some other way to improve the speed and performance of the database such as partitioning the probabilities table when we only query the most recent rows in the database?
If you make one database per sport you are locking the application into that decision. If you put them all together in one you can separate them later if necessary. I doubt it will be.
But for my team the most important thing is speed and performance.
At this early stage the most important thing is getting something working so you can use it and discover what it actually needs to do. Then adapt the schema as you learn..
Your major performance problems won't come from whether you have one database or many, but more pedestrian issues of indexing, bad queries, and schema design.
To that end...
Keep the schema simple
Keep the schema flexible
Consider a data warehouse
To the first, that means one database. Don't add the complication of maintaining multiple copies of the schema if you don't need to.
To the second, use schema migrations and keep the details of the schema out of the application code. An ORM is a good start, but also employ the Respository Pattern, Decorator Pattern, Service Pattern, and others to keep details of your tables from leaking out into your code. Then when it inevitably comes time to change your schema you can without having to rewrite all the code which uses it.
Your concerns can be solved with indexing and partitioning, probably partition probabilities, but without knowing your queries I can't say on what. For example, you might want to partition by the age of the match since newer matches are more interesting than old ones. It's hard to say. Fortunately partitioning can be added later.
The rest of the tables should be relatively small, partitioning by team isn't likely to help. Poor partitioning choices can even slow things down.
Finally, what might be best for performance is to separate the statsistical tables into a data warehouse optimized for big data and statistics. Do the stats there and have the application query them. This separates the runtime schema which must have low latency and benefits from being kept small, from the statistical database which is mostly reporting on pre-calculated statisitical queries.
Some notes on your schema.
Remove "sport" from the matches. It's redundant. Get it from the teams. Add a constraint to ensure both teams are playing the same sport.
Don't name a column date. First, it's a keyword. Second, date of what? What if there's another date associated with the match? Third, what about the time of the match? Make it specific: scheduled_at. Use a timestamp type.
Result should be it's own table. You're going to want to store a lot of information about the result of the match.
In MySQL, a "DATABASE" is a very lightweight thing. It makes virtually no difference to MySQL and queries as to whether you have one db or two. Or 20.
You might need a little bit of syntax to handle JOINs:
One db:
USE db;
SELECT a.x, b.y
FROM a
JOIN b ON ...;
Two dbs:
USE db;
SELECT a.x, b.y
FROM db1.a AS a
JOIN db2.b AS b ON ...;
The performance of those two is the same.
Bottom Line: Do what feels good to you, the developer.

SQL most efficient way to check if rows from one table are also present in another

I have two DB tables each containing email addresses
One is mssql with 1.500.000.000 entries
One is mysql with 70.000.000 entries
I now want to check how many identical email addresses are present in both tables.
i.e. the same address is present in both tables.
Which approach would be the fastest:
1. Download both datasets as csv, load it into memory and compare in program code
2. Use the DB queries to get the overlapping resultset.
if 2 is better: What would be a suggested SQL query?
I would go with a DBQuery. Set up a linked server connection between the two DBs (probably on the MSSQL side), and use a simple inner join query to produce the list of e-mails that occur in both tables:
select a.emailAddress
from MSDBServ.DB.dbo.Table1 a
join MySqlServ.DB..Table2 b
on a.EmailAddress = b.EmailAddress
Finding the set difference, that's going to take more processor power (and it's going to produce at least 1.4b results in the best-case scenario of every MySql row matching an MSSQL row), but the query isn't actually that much different. You still want a join, but now you want that join to return all records from both tables whether they could be joined or not, and then you specifically want the results that aren't joined (in which case one side's field will be null):
select a.EmailAddress, b.EmailAddress
from MSDBServ.DB.dbo.Table1 a
full join MySqlServ.DB..Table2 b
on a.EmailAddress = b.EmailAddress
where a.EmailAddress IS NULL OR b.EmailAddress IS NULL
You could do a sql query to check how many identical email addresses are present in two databases: first number is how many duplicates, second value is the email address.
SELECT COUNT(emailAddr),emailAddr FROM table1 A
INNER JOIN
table2 B
ON A.emailAddr = B.emailAddr
Table1 has the 70,000,000 email addresses, table2 has the 1,500,000,000. I use Oracle so the Upper function may or may not have an equivalent in MySQL.
Select EmailAddress from table1 where Upper(emailaddress) in (select Upper(emailaddress) from table2)
Quicker than comparing spreadsheets and this assumes both tables are in the same database.

What actually happens during table JOINs?

I'm trying to see if my understanding of JOINs is correct.
For the following query:
SELECT * FROM tableA
join tableB on tableA.someId = tableB.someId
join tableC on tableA.someId = tableC.someId;
Does the RDMS basically execute similar pseudocode as follows:
List tempResults
for each A_record in tableA
for each B_record in tableB
if (A_record.someId = B_record.someId)
tempResults.add(A_record)
List results
for each Temp_Record in tempResults
for each C_record in tableC
if (Temp_record.someId = C_record.someId)
results.add(C_record)
return results;
So basically the more records with the same someId tableA has with tableB and tableC, the more records the RDMS have the scan? If all 3 tables have records with same someId, then essentially a full table scan is done on all 3 tables?
Is my understanding correct?
Each vendor's query processor is of course written (coded) slightly differently, but they probably share many common techniques. Implementing a join can be done in a variety of ways, and which one is chosen, in any vendor's implementation, will be dependent on the specific situation, but factors that will be considered include whether the data is already sorted by the join attribute, the relative number of records in each table (a join between 20 records in one set of data with a million records in the other will be done differently than one where each set of records is of comparable size). I do not know the internals for MySQL, but for SQL server, there are three different join techniques, a Merge Join, a Loop Join, and a Hash Join. Take a look at this.

What is better way to join in mysql?

I wanted to join 3 or more tables
table1 - 1 thousand record
table2 - 100 thousands record
table3 - 10 millions record
Which of the following is best(speed wise performance):-
Note: pk and fk are primary and foreign key for respective tables and FILTER_CONDITION1 and FILTER_CONDITION2 are respective restricting records query normally found in where
Case 1 :taking smaller tables first and joining larger one later
Select table1.*,table2.*,table3.*
from table1
join table2
on table1.fk = table2.pk and FILTER_CONDITION1
join table3
on table2.fk = table3.pk and FILTER_CONDITION2
Case 2
Select table1.*,table2.*,table3.*
from table3
join table2
on table2.fk = table3.pk and FILTER_CONDITION2
join table1
on table1.fk = table2.pk and FILTER_CONDITION1
Case 3
Select table1.*,table2.*,table3.*
from table3
join table2
on table2.fk = table3.pk
join table1
on table1.fk = table2.pk
where FILTER_CONDITION1 and FILTER_CONDITION2
The cases you show are equivalent. What you are describing is in the end the same query and will be seen by the database as such: the database will make a query plan.
The best thing you can do is use EXPLAIN and check out what your query actually does: this way you can see they will probably be run the same, AND if there might be a bottle neck in there.
As #Nanne updated in his answer that normally mysql do it its own (right ordering) but some time (rare case) mysql can read table join in wrong order and can kill query performance in this case you can follow below approach-
If you can filter data from your bulky tables like table2 and table3 (suppose you can get only 500 records after joining these tables and applying filter) then first you filter your data and then you can join that filtered data with your small table..in this way you can get performance but there can be various combinations, so you have to check by which join you can do more filteration..yes explain will help you to know it and index will help you to get filtered data.
After above approach you can say mysql to use ordering as you have in your query by syntax "SELECT STRAIGHT_JOIN....." same as some time mysql does not use proper index and we have to use force index

MySQL Join with many (fields) to one (secondary table) relationship

I have a query I need to perform on a table that is roughly 1M records. I am trying to reduce the churn, but unfortunately there is a UNION involved (after i figure this join out), so that may be a question for another day.
The records and data I need to get reference 3 fields in a table that need each pull a description from another table and return it in the same record, but when i do the Inner join i was thinking, it either returns only 1 field fromt he other table, or multiple records from he original table.
Here are some screen shots of the tables and their relationship:
Primary table containing records (1 each) with the physician record I want to pull, including up to 3 codes that can be listed in the "taxonomy" table.
Secondary table containing records (1 each) with the "Practice" field I want to pull.
A Quick glance of the relationship i'm talking about
I presume that if perform an inner join matching the 3 fields in the physicians table, that it will have to iterate that table multiple times to pull each taxonomy code .. but I still can't even figure the syntax to easily pull all of these codes instead of just 1 of them.
i've tried this:
SELECT
taxonomy_codes.specialization,
physicians.provider_last_name,
physicians.provider_first_name,
physicians.provider_dba_name,
physicians.legal_biz_name,
physicians.biz_practice_city
FROM
taxonomy_codes
INNER JOIN physicians ON physicians.provider_taxonomy_code_1 = taxonomy_codes.taxonomy_codes OR physicians.provider_taxonomy_code_2 = taxonomy_codes.taxonomy_codes OR physicians.provider_taxonomy_code_3 = taxonomy_codes.taxonomy_codes
First, the query churns a lot and it only returns one taxonomy specialty result which I presume is because of the OR in the join statement. Any help would be greatly appreciated.
Thank you,
Silver Tiger
You have to join the taxonomy_codes table multiple times:
SELECT p.provider_last_name, p...., t1.specialization as specialization1, t2.specialization as specialization2, t3.specialization as specialization3
FROM physicians p
LEFT JOIN taxonomy_codes t1 ON t1.taxonomy_codes = provider_taxonomy_code_1
LEFT JOIN taxonomy_codes t2 ON t2.taxonomy_codes = provider_taxonomy_code_2
LEFT JOIN taxonomy_codes t3 ON t3.taxonomy_codes = provider_taxonomy_code_3