After reading a lot about it and similar questions, I am still not clear about the following case.
I have an schema like this in one mysql database, where I store the probabilities of matches of more than 10 sports depending on the type of result
(is intended for an application that shows the odds for each sport on different pages but I will never mix sports on the same page):
Design 1: a single database
SPORT
id
name
TEAM
id
sportId
name
birth
MATCHES
id
sportId
teamId_1
teamId_2
result
date
PROBABILITIES
id
matchId
type
percentage
(table Probabilities is very long, almost a billion rows, and will grow over time)
All necessary fields are correctly indexed. Then to see all the probabilities of the matches that do not have result with the football sport with id = 1, I would make the following query:
SELECT s.name, t1.name as nameTeam1, t2.name as nameTeam2, t1.birth as birthTeam1, t2.birth as birthTeam2, m.date, p.type, p.percentage
FROM matches m
INNER JOIN team t1 ON t1.id = m.teamId_1
INNER JOIN team t2 ON t2.id = m.teamId_2
INNER JOIN sport s ON s.id = m.sportId
INNER JOIN probabilities p ON p.matchId = m.id
WHERE result IS NULL
AND s.id = 1
This database design is great because it allows me to work comfortably with ORM like Prisma. But for my team the most important thing is speed and performance.
Knowing this, is it a good idea to do it this way or would it be better to separate the tables into several databases?
Design 2: one database per sport
Database Football
TEAM
id
sportId
name
birth
MATCHES
id
teamId_1
teamId_2
date
PROBABILITIES
id
matchId
type
percentage
Database Basketball
TEAM
id
sportId
name
birth
MATCHES
id
teamId_1
teamId_2
date
PROBABILITIES
id
matchId
type
percentage
The probabilities table is much smaller, in some sports only thousands of rows.
So if, for example, I only need to take the football probabilities I make a query like this:
SELECT t1.name as nameTeam1, t2.name as nameTeam2, t1.birth as birthTeam1, t2.birth as birthTeam2, m.date, p.type, p.percentage
FROM football.matches m
INNER JOIN football.team t1 ON t1.id = m.teamId_1
INNER JOIN football.team t2 ON t2.id = m.teamId_2
INNER JOIN football.probabilities p ON p.matchId = m.id
WHERE result IS NULL
Or is there some other way to improve the speed and performance of the database such as partitioning the probabilities table when we only query the most recent rows in the database?
If you make one database per sport you are locking the application into that decision. If you put them all together in one you can separate them later if necessary. I doubt it will be.
But for my team the most important thing is speed and performance.
At this early stage the most important thing is getting something working so you can use it and discover what it actually needs to do. Then adapt the schema as you learn..
Your major performance problems won't come from whether you have one database or many, but more pedestrian issues of indexing, bad queries, and schema design.
To that end...
Keep the schema simple
Keep the schema flexible
Consider a data warehouse
To the first, that means one database. Don't add the complication of maintaining multiple copies of the schema if you don't need to.
To the second, use schema migrations and keep the details of the schema out of the application code. An ORM is a good start, but also employ the Respository Pattern, Decorator Pattern, Service Pattern, and others to keep details of your tables from leaking out into your code. Then when it inevitably comes time to change your schema you can without having to rewrite all the code which uses it.
Your concerns can be solved with indexing and partitioning, probably partition probabilities, but without knowing your queries I can't say on what. For example, you might want to partition by the age of the match since newer matches are more interesting than old ones. It's hard to say. Fortunately partitioning can be added later.
The rest of the tables should be relatively small, partitioning by team isn't likely to help. Poor partitioning choices can even slow things down.
Finally, what might be best for performance is to separate the statsistical tables into a data warehouse optimized for big data and statistics. Do the stats there and have the application query them. This separates the runtime schema which must have low latency and benefits from being kept small, from the statistical database which is mostly reporting on pre-calculated statisitical queries.
Some notes on your schema.
Remove "sport" from the matches. It's redundant. Get it from the teams. Add a constraint to ensure both teams are playing the same sport.
Don't name a column date. First, it's a keyword. Second, date of what? What if there's another date associated with the match? Third, what about the time of the match? Make it specific: scheduled_at. Use a timestamp type.
Result should be it's own table. You're going to want to store a lot of information about the result of the match.
In MySQL, a "DATABASE" is a very lightweight thing. It makes virtually no difference to MySQL and queries as to whether you have one db or two. Or 20.
You might need a little bit of syntax to handle JOINs:
One db:
USE db;
SELECT a.x, b.y
FROM a
JOIN b ON ...;
Two dbs:
USE db;
SELECT a.x, b.y
FROM db1.a AS a
JOIN db2.b AS b ON ...;
The performance of those two is the same.
Bottom Line: Do what feels good to you, the developer.
Related
Related to Join vs. sub-query but a different type of situation, and just trying to understand how this works.
I had to make a view where I get a bunch of employee codes from one table, and I have to get their names from a different table - the same two tables every time. I arranged my query like this:
SELECT
(SELECT name from emptable where empcod = code1) as emp1, code1,
(SELECT name from emptable where empcod = code2) as emp2, code2,
[repeat 6 times]
FROM codetable
It is more complicated than this, and more tables are joined, but this is the element I want to ask about. My boss says joining like so is better:
SELECT e1.name, c.code1, e2.name, c.code2, e3.name, code3 [etc]
FROM codetable c
INNER JOIN emptable e1 ON e1.empcod = c.code1
INNER JOIN emptable e2 ON e2.empcod = c.code2
INNER JOIN emptable e3 ON e3.empcod = c.code3
My reasoning, aside from not having to go search in the joins which table gets whose name and why, is the way I understand the join goes like this:
Take whole table A
Take whole table B
Combine all the data from both tables according to the 'ON' section of the join
select one single string from this complete combination of two whole tables from which I need no other data
I think it's obvious that this seems like it would take up a lot of resources. I understand the subquery as
Get one datum from table A (the employee code)
Match this one datum to every record from table B until you find a match
As soon as you get a match, bring back this one single datum from this other table (the employee's name)
Understanding that in the table of employees, the employee code is a primary key and cannot be duplicated, so every subquery can only ever give me one single string back.
It seems to me that comparing ONE number from one table to ONE number from another table and retrieving ONE string related to that number would be less resource-intensive than matching ALL of the data in two whole tables together in order to get this one string. But I figure I don't know what these databases are doing behind the scenes, and a lot of people seem to prefer joins. Can anyone explain this to me, if I'm understanding it wrong? The other posts I find here of similar situations tend to want more information from more tables, I'm not immediately finding anything about matching the same two tables six or seven times to retrieve one single string for every configuration.
Thanks in advance.
So as ScaisEdge explained, a join only gets executed once - and thus only spends time and resources once - no matter how many rows you have, whereas each of the six subselects get executed once for every row. If you have 100 rows, you're executing six joins once or you're executing 6 subselects 100 times.
It makes sense that this would be more resource-intensive, and I did not explain clearly enough that my case involves only one row at a time - in which case I guess the difference would be negligible anyway.
After reading the question title you may find it silly but I'm seriously asking this question with curiosity in my mind.
I'm using MySQL database system.
Consider below the two tables :
Customers(CustomerID(Primary Key), CustomerName, ContactName, Address, City, PostalCode, Country)
Orders(OrderID(Primary Key), CustomerID(Foreign Key), EmployeeID, OrderDate, ShipperID)
Now I want to get the details of all orders that is which order is placed by which customer?
So, I did it in two ways :
First way:
SELECT o.OrderID, o.OrderDate, c.CustomerName
FROM Customers AS c, Orders AS o
WHERE c.CustomerID=o.CustomerID;
Second way:
SELECT Orders.OrderID, Orders.OrderDate, Customers.CustomerName
FROM Orders
INNER JOIN Customers ON Orders.CustomerID=Customers.CustomerID;
In both the cases I'm getting exactly the same correct result. My question is why there is a necessary of additional and confusing concept of Inner Join in MySQL as we can achieve the same results even without using Inner Join?
Is the Inner Join more effective in any manner?
What you are looking at is ANSI-89 syntax (A,B WHERE) vs ANSI-92 syntax (A JOIN B ON).
For very simple queries, there is no difference. However, there are a number of things you can do with ANSI-92 that you cannot do or that become very difficult to implement and maintain in ANSI-89. Anything more than two tables involved, more than one condition in the same join, or separating LEFT JOIN conditions from WHERE conditions are all much harder to read and work with in the older syntax.
The old A,B WHERE syntax is generally considered obsolete and avoided, even for the simple queries where it still works.
The trade-offs of hardware optimization are second to none to users being able to maintain their queries.
Having explicit clean code is better than having esoteric implicit code. In actual production relational databases, most of the queries that take too long come from the ones where the tables are in a concatenated list. These queries show that:
User did not put the effort on expressing the order these tables are joined.
All the relationship joins are cluttered in one place instead organized on its own space for each join.
If all queries are in such format for said user, user does not take
advantage of Outer Joins. There are many cases where a relationship between tables can be: (1) TO (0-many) OR (many) TO (many) instead of (1) TO (1-many).
As in most use cases, these queries become to start to be a problem when the number of joins increase. Beginner users choose to query the tables by placing them as a list delimited with a comma because it takes less to type. At first, it does not seem to be a problem because they are joined against two to three tables. This in turn become a habit to the beginner user. As they start to write more complicated queries by increasing their number of joins, those type of queries are harder to maintain as described from the above bullet points.
Conclusion: As the number of joins within a query scales, improper indentation and categorization make the query harder to maintain.
You should use INNER JOIN and ident your query as below so it is easy for others to read:
SELECT
Orders.OrderID,
Orders.OrderDate,
Customers.CustomerName
FROM Orders
INNER JOIN Customers
ON Customers.CustomerID = Orders.CustomerID;
I'm just in the process of learning MYSQL, and have something I've been wondering about.
Let's take this simple scenario: A hypothetical website for taking online courses, comprised of 4 tables: Students, Teachers, Courses and Registrations (one entry per course that a student has registered for)
You can find the DB generation code on github.
While the provided DB is tiny for clarity, to keep it relevant to what I need help with, let's assume that this is with a large enough database where efficiency would be a real issue - let's say hundreds of thousands of students, teachers, etc.
As far as I understand with MYSQL, if we want a table of students being taught by 'Charles Darwin', one possible query would be this:
Method 1
SELECT Students.name FROM Teachers
INNER JOIN Courses ON Teachers.id = Courses.teacher_id
INNER JOIN Registrations ON Courses.id = Registrations.course_id
INNER JOIN Students ON Registrations.student_id = Students.id
WHERE Teachers.name = "Charles Darwin"
which does indeed return what we want.
+----------------+
| name |
+----------------+
| John Doe |
| Jamie Heineman |
| Claire Doe |
+----------------+
So Here's my question:
With my (very) limited MYSQL knowledge, it seems to me that here we are JOIN-ing elements onto the teachers table, which could be quite large, while we are ultimately only after a single teacher, who we filter out at the very very end of the query.
My 'Intuition' Says that it would be much more efficient to first get a single row for the teacher we need, and then join the remaining stuff onto that instead:
Method 2
SELECT Students.name FROM (SELECT Teachers.id FROM Teachers WHERE Teachers.name =
"Charles Darwin") as Teacher
INNER JOIN Courses ON Teacher.id = Courses.teacher_id
INNER JOIN Registrations ON Courses.id = Registrations.course_id
INNER JOIN Students ON Registrations.student_id = Students.id
But is that really the case? Assuming thousands of teachers and students, is this more efficient than the first query? It could be that MYSQL is smart enough to parse the method 1 query in such a way that it runs more efficiently.
Also, if anyone could suggest an even more efficient query, I would be quite interested to hear it too.
Note: I've read before to use EXPLAIN to figure out how efficient a query is, but I don't understand MYSQL well enough to be able to decipher the result. Any insight here would be much appreciated as well.
My 'Intuition' Says that it would be much more efficient to first get
a single row for the teacher we need, and then join the remaining
stuff onto that instead:
You are getting a single row for teacher in method 1 by using the predicate Teachers.name = "Charles Darwin". The query optimiser should determine that it is more efficient to restrict the Teacher set using this predicate before joining the other tables.
If you don't trust the optimiser or want to lessen the work it does you can even force the table read order by using SELECT STRAIGHT_JOIN ... or STRAIGHT_JOIN instead of INNER_JOIN to make sure that MySQL reads the tables in the order you have specified in the query.
Your second query results in the same answer but may be less efficient because a temporary table is created for your teacher subquery.
The EXPLAIN documentation is a good source on how to interpret the EXPLAIN output.
I'm working on a front end for a database where each table essentially has a many to many relationship with all other tables.
I'm not a DB admin, just a few basic DB courses. The typical solution in this case, as I understand it, would be multiple link tables to join each 'real' table. Here's what I'm proposing instead: one link table that has foreign key dependencies to all other PKs of the other tables.
Is there any reason this could turn out badly in terms of scalability, flexibility, etc down the road?
So you're trying to decide whether to take a star pattern or an asterisk pattern?
I'd certainly advocate asterisk. Just as in code, there is generally a driver method, there should be a driver table if the schema is as you described. Look at the total number of tables you'll need for each number of "main" tables:
Main Junct Total
-------------------
2 1 3
3 3 6
4 6 10
5 10 15
6 15 21
7 21 28!
7 is probably the most you would have in a database schema.
In addition, this way you can do complex queries involving 3 main tables without having to go through 3 junction tables, but rather only touch one junction table no matter how many main tables you want.
Scalability? No.
Flexibility? Only if your schema changes dramatically.
If I understand your proposal correctly, what you are thinking of doing is a minor variation on the them of the 'One True Lookup Table' (OTLT), which is not a good idea. In this case, perhaps, OTLT stands for 'One True Linking Table'.
The problems come when you need to maintain the referential integrity of the OTLT. For starters, what is its schema?
ReferencingTable INTEGER (or VARCHAR(xx)?)
ReferencingId INTEGER
ReferencedTable INTEGER (or VARCHAR(xx)?)
ReferencedId INTEGER
The table IDs have to be watched. They can be copies of the value in the system catalog, but you have to worry about what happens when you rebuild on of the tables (typically, the table ID changes). Or they can be separately controlled values - a parallel set of tables.
Next, you have to worry about the asymmetry in the naming of the columns in what should be a symmetric setup; the OTLT connects Table1 to Table2 just as much as it does Table2 to Table1 -- unless, indeed, your relationships are asymmetric. That just complicates life enormously.
Now, suppose you need to join primary tables Table1 to Table2 and Table2 to Table3, each via the OTLT, and that the table IDs are 1, 2, and 3, and that the 'ReferencingTable' is always the smaller of the two in the OTLT:
SELECT T1.*, T2.*, T3.*
FROM Table1 AS T1
JOIN OTLT AS O1 ON T1.Id = O1.ReferencingId AND O1.ReferencingTable = 1
JOIN Table2 AS T2 ON T2.Id = O1.ReferencedId AND O1.ReferencedTable = 2
JOIN OTLT AS O2 ON T2.Id = O2.ReferencingId AND O2.ReferencingTable = 2
JOIN Table3 AS T3 ON T3.Id = O2.ReferencedId AND O2.ReferencedTable = 3
So, here you have two independent sets of joins via the OTLT.
The alternative formulation uses separate joining tables for each pair. The rows in these joining tables are smaller:
ReferencingID INTEGER
ReferencedID INTEGER
And, assuming that the joining tables are named Join_T1_T2, etc, the query above becomes:
SELECT T1.*, T2.*, T3.*
FROM Table1 AS T1
JOIN Join_T1_T2 AS J1 ON T1.Id = J1.ReferencingId
JOIN Table2 AS T2 ON T2.Id = J1.ReferencedId
JOIN Join_T2_T3 AS J2 ON T2.Id = J2.ReferencingId
JOIN Table3 AS T3 ON T3.Id = J2.ReferencedId
There are just as many references to tables (5) as before, but the DBMS can automatically maintain the referential integrity on these joining tables - whereas the maintenance has to be written by hand with the OTLT. The joins are simpler (no AND clauses).
In my view, this weighs strongly against the OTLT system and in favour of specialized linking tables for each significant pairing of the primary tables.
My main problem with a single link table is if a 'link' suddenly turned into an entity. For example, you may have 'shopper' and 'store' entities. that can be many-to-many as a shopper can go to many stores and a store will have many shoppers.
Next month you decide you want to record how much a shopper spends in a store. Suddenly you either have to add a 'purchase' amount to your generic link table, or rebuild a big chunk of your application to use a specific link table for that link instead of the generic one.
You have two options with this setup.
Ensure each row indicates the link to only one table. This is a degenerate model of having individual join tables.
Ensure you have all the links combinations which highly inflates the size of the table. Given a row in the primary table which joins to 4, 5, and 6 records in each of three other tables you need 4 * 5 * 6 = 120 rows in your join table. You also need logic to handle no joins to a table. If you need to join to only the first tables you need to filter the 120 rows you get down to 4.
There are cases where you will have multiple table relationships, but these will be driven by the design. Releationships frequently carry information such as start and end dates. These are problematic for the one true lookup table as you will need to carry columns for each possible relationship.
MySQL setup: step by step.
programs -> linked to --> speakers (by program_id)
At this point, it's easy for me to query all the data:
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
Nice and easy.
The trick for me is this. My speakers table is also linked to a third table, "books." So in the "speakers" table, I have "book_id" and in the "books" table, the book_id is linked to a name.
I've tried this (including a WHERE you'll notice):
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
No results.
My questions:
What am I doing wrong?
What's the most efficient way to make this query?
Basically, I want to get back all the programs data and the books data, but instead of the book_id, I need it to come back as the book name (from the 3rd table).
Thanks in advance for your help.
UPDATE:
(rather than opening a brand new question)
The left join worked for me. However, I have a new problem. Multiple books can be assigned to a single speaker.
Using the left join, returns two rows!! What do I need to add to return only a single row, but separate the two books.
is there any chance that the books table doesn't have any matching columns for speakers.book_id?
Try using a left join which will still return the program/speaker combinations, even if there are no matches in books.
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
Btw, could you post the table schemas for all tables involved, and exactly what output (or reasonable representation) you'd expect to get?
Edit: Response to op author comment
you can use group by and group_concat to put all the books on one row.
e.g.
SELECT speakers.speaker_id,
speakers.speaker_name,
programs.program_id,
programs.program_name,
group_concat(books.book_name)
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
GROUP BY speakers.id
LIMIT 5
Note: since I don't know the exact column names, these may be off
That's typically efficient. There is some kind of assumption you are making that isn't true. Do your speakers have books assigned? If they don't that last JOIN should be a LEFT JOIN.
This kind of query is typically pretty efficient, since you almost certainly have primary keys as indexes. The main issue would be whether your indexes are covering (which is more likely to occur if you don't use SELECT *, but instead select only the columns you need).