More efficient to use subquery before inner joins? - mysql

I'm just in the process of learning MYSQL, and have something I've been wondering about.
Let's take this simple scenario: A hypothetical website for taking online courses, comprised of 4 tables: Students, Teachers, Courses and Registrations (one entry per course that a student has registered for)
You can find the DB generation code on github.
While the provided DB is tiny for clarity, to keep it relevant to what I need help with, let's assume that this is with a large enough database where efficiency would be a real issue - let's say hundreds of thousands of students, teachers, etc.
As far as I understand with MYSQL, if we want a table of students being taught by 'Charles Darwin', one possible query would be this:
Method 1
SELECT Students.name FROM Teachers
INNER JOIN Courses ON Teachers.id = Courses.teacher_id
INNER JOIN Registrations ON Courses.id = Registrations.course_id
INNER JOIN Students ON Registrations.student_id = Students.id
WHERE Teachers.name = "Charles Darwin"
which does indeed return what we want.
+----------------+
| name |
+----------------+
| John Doe |
| Jamie Heineman |
| Claire Doe |
+----------------+
So Here's my question:
With my (very) limited MYSQL knowledge, it seems to me that here we are JOIN-ing elements onto the teachers table, which could be quite large, while we are ultimately only after a single teacher, who we filter out at the very very end of the query.
My 'Intuition' Says that it would be much more efficient to first get a single row for the teacher we need, and then join the remaining stuff onto that instead:
Method 2
SELECT Students.name FROM (SELECT Teachers.id FROM Teachers WHERE Teachers.name =
"Charles Darwin") as Teacher
INNER JOIN Courses ON Teacher.id = Courses.teacher_id
INNER JOIN Registrations ON Courses.id = Registrations.course_id
INNER JOIN Students ON Registrations.student_id = Students.id
But is that really the case? Assuming thousands of teachers and students, is this more efficient than the first query? It could be that MYSQL is smart enough to parse the method 1 query in such a way that it runs more efficiently.
Also, if anyone could suggest an even more efficient query, I would be quite interested to hear it too.
Note: I've read before to use EXPLAIN to figure out how efficient a query is, but I don't understand MYSQL well enough to be able to decipher the result. Any insight here would be much appreciated as well.

My 'Intuition' Says that it would be much more efficient to first get
a single row for the teacher we need, and then join the remaining
stuff onto that instead:
You are getting a single row for teacher in method 1 by using the predicate Teachers.name = "Charles Darwin". The query optimiser should determine that it is more efficient to restrict the Teacher set using this predicate before joining the other tables.
If you don't trust the optimiser or want to lessen the work it does you can even force the table read order by using SELECT STRAIGHT_JOIN ... or STRAIGHT_JOIN instead of INNER_JOIN to make sure that MySQL reads the tables in the order you have specified in the query.
Your second query results in the same answer but may be less efficient because a temporary table is created for your teacher subquery.
The EXPLAIN documentation is a good source on how to interpret the EXPLAIN output.

Related

count students from table where join mysql

I have the databaase in icon below. I want
to count all students from Subject with name Psychology and class with name Class5.
the percentage of students with status "Something" from subject with name Psychology and class with name Class5.
All students and the class name from Class "Class6" that are male.
I've tried for example
(in english:)
SELECT COUNT(student_name) AS NumberOfStudents FROM student_srms JOIN class_srms JOIN subject_srms WHERE class_srms.class_name='Class5' AND subject_srms.subject_name='Psychology'
But returns NumberOfStudents = 20, but 20 are all student entries.
The issue likely stems from your FROM clause. It's not enough to just say JOIN. You need to specify the relationship of the columns between the two tables being joined with an ON clause:
FROM student_srms
JOIN class_srms
ON student_srms.student_id = class_srms.student_id
JOIN subject_srms
ON class_srms.subject_id = subject_srms.subject_id
I believe in MySQL there is a NATURAL JOIN which will tell mysql without an ON clause to just join on column names that are similar between the two tables, but that feels dirty to me and could cause failures later on in an applications lifecycle if new columns are introduced to tables that share names, but not relationships, so I would just steer clear of that.
I have a suspicion that your diagram showing tables/columns is incorrect based on the error you are reporting in the comments. Instead, try (and I'm totally guessing blind here at this point):
FROM student_srms
JOIN student_class
On student_srms.student_id = student_class.class_id
JOIN class_srms
ON student_class.class_id = class_srms.student_id
JOIN subject_srms
ON class_srms.subject_id = subject_srms.subject_id
That adds in that student_class relationship table so you can make the jump from student to class tables. Fingers crossed.

A large database or several smaller?

After reading a lot about it and similar questions, I am still not clear about the following case.
I have an schema like this in one mysql database, where I store the probabilities of matches of more than 10 sports depending on the type of result
(is intended for an application that shows the odds for each sport on different pages but I will never mix sports on the same page):
Design 1: a single database
SPORT
id
name
TEAM
id
sportId
name
birth
MATCHES
id
sportId
teamId_1
teamId_2
result
date
PROBABILITIES
id
matchId
type
percentage
(table Probabilities is very long, almost a billion rows, and will grow over time)
All necessary fields are correctly indexed. Then to see all the probabilities of the matches that do not have result with the football sport with id = 1, I would make the following query:
SELECT s.name, t1.name as nameTeam1, t2.name as nameTeam2, t1.birth as birthTeam1, t2.birth as birthTeam2, m.date, p.type, p.percentage
FROM matches m
INNER JOIN team t1 ON t1.id = m.teamId_1
INNER JOIN team t2 ON t2.id = m.teamId_2
INNER JOIN sport s ON s.id = m.sportId
INNER JOIN probabilities p ON p.matchId = m.id
WHERE result IS NULL
AND s.id = 1
This database design is great because it allows me to work comfortably with ORM like Prisma. But for my team the most important thing is speed and performance.
Knowing this, is it a good idea to do it this way or would it be better to separate the tables into several databases?
Design 2: one database per sport
Database Football
TEAM
id
sportId
name
birth
MATCHES
id
teamId_1
teamId_2
date
PROBABILITIES
id
matchId
type
percentage
Database Basketball
TEAM
id
sportId
name
birth
MATCHES
id
teamId_1
teamId_2
date
PROBABILITIES
id
matchId
type
percentage
The probabilities table is much smaller, in some sports only thousands of rows.
So if, for example, I only need to take the football probabilities I make a query like this:
SELECT t1.name as nameTeam1, t2.name as nameTeam2, t1.birth as birthTeam1, t2.birth as birthTeam2, m.date, p.type, p.percentage
FROM football.matches m
INNER JOIN football.team t1 ON t1.id = m.teamId_1
INNER JOIN football.team t2 ON t2.id = m.teamId_2
INNER JOIN football.probabilities p ON p.matchId = m.id
WHERE result IS NULL
Or is there some other way to improve the speed and performance of the database such as partitioning the probabilities table when we only query the most recent rows in the database?
If you make one database per sport you are locking the application into that decision. If you put them all together in one you can separate them later if necessary. I doubt it will be.
But for my team the most important thing is speed and performance.
At this early stage the most important thing is getting something working so you can use it and discover what it actually needs to do. Then adapt the schema as you learn..
Your major performance problems won't come from whether you have one database or many, but more pedestrian issues of indexing, bad queries, and schema design.
To that end...
Keep the schema simple
Keep the schema flexible
Consider a data warehouse
To the first, that means one database. Don't add the complication of maintaining multiple copies of the schema if you don't need to.
To the second, use schema migrations and keep the details of the schema out of the application code. An ORM is a good start, but also employ the Respository Pattern, Decorator Pattern, Service Pattern, and others to keep details of your tables from leaking out into your code. Then when it inevitably comes time to change your schema you can without having to rewrite all the code which uses it.
Your concerns can be solved with indexing and partitioning, probably partition probabilities, but without knowing your queries I can't say on what. For example, you might want to partition by the age of the match since newer matches are more interesting than old ones. It's hard to say. Fortunately partitioning can be added later.
The rest of the tables should be relatively small, partitioning by team isn't likely to help. Poor partitioning choices can even slow things down.
Finally, what might be best for performance is to separate the statsistical tables into a data warehouse optimized for big data and statistics. Do the stats there and have the application query them. This separates the runtime schema which must have low latency and benefits from being kept small, from the statistical database which is mostly reporting on pre-calculated statisitical queries.
Some notes on your schema.
Remove "sport" from the matches. It's redundant. Get it from the teams. Add a constraint to ensure both teams are playing the same sport.
Don't name a column date. First, it's a keyword. Second, date of what? What if there's another date associated with the match? Third, what about the time of the match? Make it specific: scheduled_at. Use a timestamp type.
Result should be it's own table. You're going to want to store a lot of information about the result of the match.
In MySQL, a "DATABASE" is a very lightweight thing. It makes virtually no difference to MySQL and queries as to whether you have one db or two. Or 20.
You might need a little bit of syntax to handle JOINs:
One db:
USE db;
SELECT a.x, b.y
FROM a
JOIN b ON ...;
Two dbs:
USE db;
SELECT a.x, b.y
FROM db1.a AS a
JOIN db2.b AS b ON ...;
The performance of those two is the same.
Bottom Line: Do what feels good to you, the developer.

mysql left join with condition on table1

I'd like to do a left join using only certain rows of the first table in mysql. Currently I do something like:
SELECT students.* FROM students
LEFT JOIN courses
ON students.id = courses.id
WHERE students.id = 6
But will mysql first select rows from table1 (students) satisfying students.id = 6, before doing the left join?
If not, is there a way to force mysql do to so?
Thanks.
Yes there is, try this:
SELECT students.* FROM students
LEFT JOIN courses
ON students.id = courses.id
HAVING students.id = 6
LIMIT 1
It seems to me that you are trying to optimize for the DB. If the query is not slow, and your testing data set is a reasonable proximation of the production DB, then it is not best practice to do so, for many reasons.
With that said, (maybe I am not right, so ignore me)
I think your trying to say, "How can i speed things up, by limiting the number of rows that the DB looks at during the query"
The best way to do this, is ensure you have proper indexes on the tables, on the fields that your going to query. Indexes are extremely fast. If you do not already index the .id fields of both tables, add those indexes to the DB, and that may solve your issue.

Scalable way of doing self join with many to many table

I have a table structure like the following:
user
id
name
profile_stat
id
name
profile_stat_value
id
name
user_profile
user_id
profile_stat_id
profile_stat_value_id
My question is:
How do I evaluate a query where I want to find all users with profile_stat_id and profile_stat_value_id for many stats?
I've tried doing an inner self join, but that quickly gets crazy when searching for many stats. I've also tried doing a count on the actual user_profile table, and that's much better, but still slow.
Is there some magic I'm missing? I have about 10 million rows in the user_profile table and want the query to take no longer than a few seconds. Is that possible?
Typically databases are able to handle 10 million records in a decent manner. I have mostly used oracle in our professional environment with large amounts of data (about 30-40 million rows also) and even doing join queries on the tables has never taken more than a second or two to run.
On IMPORTANT lessson I realized whenever query performance was bad was to see if the indexes are defined properly on the join fields. E.g. Here having index on profile_stat_id and profile_stat_value_id (user_id I am assuming is the primary key) should have indexes defined. This will definitely give you a good performance increaser if you have not done that.
After defining the indexes do run the query once or twice to give DB a chance to calculate the index tree and query plan before verifying the gain
Superficially, you seem to be asking for this, which includes no self-joins:
SELECT u.name, u.id, s.name, s.id, v.name, v.id
FROM User_Profile AS p
JOIN User AS u ON u.id = p.user_id
JOIN Profile_Stat AS s ON s.id = p.profile_stat_id
JOIN Profile_Stat_Value AS v ON v.id = p.profile_stat_value_id
Any of the joins listed can be changed to a LEFT OUTER JOIN if the corresponding table need not have a matching entry. All this does is join the central User_Profile table with each of the other three tables on the appropriate joining column.
Where do you think you need a self-join?
[I have not included anything to filter on 'the many stats'; it is not at all clear to me what that part of the question means.]

MySQL -- joining then joining then joining again

MySQL setup: step by step.
programs -> linked to --> speakers (by program_id)
At this point, it's easy for me to query all the data:
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
Nice and easy.
The trick for me is this. My speakers table is also linked to a third table, "books." So in the "speakers" table, I have "book_id" and in the "books" table, the book_id is linked to a name.
I've tried this (including a WHERE you'll notice):
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
No results.
My questions:
What am I doing wrong?
What's the most efficient way to make this query?
Basically, I want to get back all the programs data and the books data, but instead of the book_id, I need it to come back as the book name (from the 3rd table).
Thanks in advance for your help.
UPDATE:
(rather than opening a brand new question)
The left join worked for me. However, I have a new problem. Multiple books can be assigned to a single speaker.
Using the left join, returns two rows!! What do I need to add to return only a single row, but separate the two books.
is there any chance that the books table doesn't have any matching columns for speakers.book_id?
Try using a left join which will still return the program/speaker combinations, even if there are no matches in books.
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
Btw, could you post the table schemas for all tables involved, and exactly what output (or reasonable representation) you'd expect to get?
Edit: Response to op author comment
you can use group by and group_concat to put all the books on one row.
e.g.
SELECT speakers.speaker_id,
speakers.speaker_name,
programs.program_id,
programs.program_name,
group_concat(books.book_name)
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
GROUP BY speakers.id
LIMIT 5
Note: since I don't know the exact column names, these may be off
That's typically efficient. There is some kind of assumption you are making that isn't true. Do your speakers have books assigned? If they don't that last JOIN should be a LEFT JOIN.
This kind of query is typically pretty efficient, since you almost certainly have primary keys as indexes. The main issue would be whether your indexes are covering (which is more likely to occur if you don't use SELECT *, but instead select only the columns you need).