SQL Outer Join - improper execution - mysql

I am working with learning SQL, I have taken the basics course on pluralsight, and now I am using MySQL through Treehouse, with dummy databases they've set up, through the MySQL server. Once my training is complete I will be using SQLServer daily at work.
I ran into a two-part challenge yesterday that I had some trouble with.
The first question in the challenge was:
"We have a 'movies' table with a 'title' and 'genre_id' column and a
'genres' table which has an 'id' and 'name' column. Use an INNER JOIN
to join the 'movies' and 'genres' tables together only selecting the
movie 'title' first and the genre 'name' second."
Understanding how to properly set up JOINS has been a little confusing for me, because the concepts seem simple but like in cooking, execution is everything ---and I'm doing it wrong. I was able to figure this one out after some trial and error, work, and rewatching the Treehouse explanation a few times; here is how I solved the first question, with a Treehouse-accepted answer:
SELECT movies.title, genres.name FROM movies INNER JOIN genres ON movies.genre_id = genres.id;
--BUT--
The next question of the challenge I have not been so successful with, and I'm not sure where I'm going wrong. I would really like to get better with JOINS, and picking the brains of all you smartypantses is the best way I can think of to get an explanation for this specific (and I'm sure, pitifully simple for you guys) problem. Thanks for your help, here's where I'm stumped:
"Like before, bring back the movie 'title' and genre 'name' but use
the correct OUTER JOIN to bring back all movies, regardless of whether
the 'genre_id' is set or not."
This is the closest (?) solution that I've come up with, but I'm clearly doing something (maybe a lot) wrong here:
SELECT movies.title, genres.name FROM movies LEFT OUTER JOIN genres ON genres.id;
I had initially tried this (below) but when it didn't work, I decided to cut out the last portion of the statement, since it's mentioned in the requirement criteria that I need a dataset that doesn't care if genre_id is set in the movies table or not:
SELECT movies.title, genres.name FROM movies LEFT OUTER JOIN genres ON movies.genre_id = genres.id;
I know this is total noob stuff, but like I said, I'm learning, and the questions I researched on Stack and on the Internet at large were not necessarily geared for the same problem. I am very grateful to have your expertise and help to draw on. Thank you for taking the time to read this and help out if you choose to do so!

Your solution is correct:
SELECT movies.title, genres.name
FROM movies
LEFT OUTER JOIN genres ON movies.genre_id = genres.id
This is my interpretation:
When you tell "Left join" or "left outer join", in fact,
it's not that "You don't care if genre_id is set in the movies table or not",
but "You want all genres of each movie to be shown, however, you don't care if genre_id is not set in the movies table for some records; just show the movie in these cases [and show 'genre = NULL' for those records]"
generally, in "left join", you want:
all the records of the left table, with their corresponding records in the other table, if any. Otherwise with NULL.
In your example, these two sets of records will be shown:
1- All the movies which have been set to a genre
(give movie.title, Genres.name)
2- All other movies [which do not have a genre, i.e., genre_id = NULL]
(give movie.title, NULL)
Example (with left join):
Title, Genre
--------------
Movie1, Comedy
Movie1, Dramma
Movie1, Family
Movie2, NULL
Movie3, Comedy
Movie3, Dramma
Movie4, Comedy
Movie5, NULL
Example (with inner join):
Title, Genre
--------------
Movie1, Comedy
Movie1, Dramma
Movie1, Family
Movie3, Comedy
Movie3, Dramma
Movie4, Comedy

Your'e specific question was already answered, though:
I'd like to add another perspective about JOIN, that i think will help you understand how to use it in the future (after that, I also recommend you follow this link: SQL JOINS ).
This perspective is from the DB eyes, which is "dumb" and can't guess what you really want it to do for you.
I help it helps and won't confuse you too match:
Lets first understand what a join does (without using any SQL script), and than we'll understand better how to use it.
Say this is a movie list:
Armageddon
Batman
Cinderella
and a list of genres:
Action
Fantasy
Western
When you join both tables, the DB creates a new tables, that for each row in movies table, you'll get all possible rows in genres table, like this:
Armageddon <-> Action
Armageddon <-> Fantasy
Armageddon <-> Western
Batman <-> Action
Batman <-> Fantasy
Batman <-> Western
Cinderella <-> Action
Cinderella <-> Fantasy
Cinderella <-> Western
You can also see that the NEW table row number is 3*3 ([table 1 row number] multiply [table 2 row number]). Can you explain yourself why? If so, lets continue to our second step...
In your DB, you keep track of which movie is which genre (identifying genre by it's id), so lets talk about NEW tables, that look like this and have info about movies genre:
1 - Armageddon - 1
2 - Armageddon - 2
4 - Batman - 1
5 - Batman - 2
6 - Batman - 3
7 - Cinderella - 2
And the genre:
1 - Action
2 - Fantasy
3 - Western
As we've just explained, joining both tables will get you... 18 rows (6*3=18. why? because for each row in movies table, you'll get all possible rows from genres table). I won't write those 18 rows, I hope you get the point...
Each time you call a join (doesn't matter which kind of join: LEFT/RIGHT/OUTER/INNER), the DB creates a new table with all passible options([table 1 row number] multiply [table 2 row number]). Now, you're probably thinking: How does the DB erase the rows I don't want?
First, you define an ON condition. You tell your DB: "please mark for me all rows that meet my condition: movies.genre_id = genres.id (But don't drop any unmarked rows yet!!!)".
Second, you tell your DB which kind of rows you want to drop (or edit!!!): now comes the JOIN kind, which is a bit tricky.
INNER JOIN is easy to understand- just tell the DB: "drop all rows that don't meet my condition: movies.genre_id = genres.id" (and of course show me the updated table, after you've dropped these rows I don't need).
LEFT/RIGHT JOINs are more complicated. Lets start for example with LEFT JOIN. You're telling your DB: "well, in case a row doesn't match my condition: movies.genre_id = genres.id, mark the RIGHT part of my row (meaning, the columns that represent my 2nd table) as null, AND LEAVE THE ROW.
That way, I know you this row in table1, doesn't have a matching row in table2.
In RIGHT JOIN, it's the opposite: you tell the DB, that if your condition isn't met, mark the LEFT side with null.
FULL JOIN tells your DB: "well, from a row that doesn't meet my condition, make 2 rows: 1 that has it's RIGHT part marked as null, and a second that has it's LEFT part marked as null" (this is a bit complicated for understanding for why the hack you'll need that, and you'll hardly need to use FULL JOIN in your first steps, so drop it for now).
In conclusion, my advice for you when you design your JOIN query
first, understand what YOU want to get, see illustration in answer: SQL JOINS.
Then, comes the part when you need to explain to you DB what it should do:
first, tell it which rows it should mark,
than, tell it which rows it should drop/edit.

Related

SQL giving me lines that doesn't exist?

While using this:
SELECT borrowbook.studentusername, borrowbook.schoolbookid,borrowbook.date,borrowbook.deadline, book.title, student.email, student.fname, student.lname
FROM borrowbook, book, student
I get many lines, but in my database I just have four lines in the borrowbook table, and while using this, I get some "lines" that doesn't exist. (Note: this works through php on a website, I cannot seem to make this work in mysql so I think I have done something)
Like that a person that had borrowed one book (line 1 in my list of borrowed books) suddenly has borrowed ten different books that I have not registered anyone to borrow. With date as to when it was loaned, and deadline just taken from one of the four lines I have registered.
Even the same person that is registered to borrow one book, suddenly shows up as if they borrowed it four times with different dates. Dates and deadline are taken from "borrowbook" while different names of students are taken from another table, since they have never been used in the "borrowbook" line.
I have tried this now in different ways and with different content and different tables, but still get many "made up" lines of loans that is not registered.
I know very little, but I am grateful for all help I can get. Articles help as well.
Without joins, you duplicate records. For a better practice, you should use explicit joins instead of implicit ones. If you have student.username and book.id fields, you can do something like this:
SELECT borrowbook.studentusername, borrowbook.schoolbookid,borrowbook.date,borrowbook.deadline,
book.title,
student.email, student.fname, student.lname
FROM borrowbook
INNER JOIN student ON borrowbook.studentusername=student.username
INNER JOIN schoolbook ON borrowbook.schoolbookid=schoolbook.id
INNER JOIN book ON schoolbook.isbn=book.isbn
;
You haven't specified any JOIN conditions in your query, and because of that tables will be CROSS JOIN-ed, i.e., every record from the borrowbook table is paired with every record from the book table which is then paired with every record from the student table. So if you have X, Y and Z number of records in each table respectively, you will get X * Y * Z records as a result.
You probably want to add join conditions such as (I'm just guessing column names):
SELECT borrowbook.studentusername, borrowbook.schoolbookid,borrowbook.date,borrowbook.deadline, book.title, student.email, student.fname, student.lname
FROM borrowbook, book, student
WHERE borrowbook.book_id = book.id and borrowbook.student_id = student.id

How to use a SELECT query to output results from two tables - Query logic to limit output data from one table

I thought this would be a simple query, but I'm obviously having problems with the query logic for outputting the data as desired.
I'd like to query two tables that have a key to relate the tables, and output the results from table1 (one row/record output) and then output the results from table2 (multiple rows/records output).
I've tried UNIQUE, DISTINCT and subqueries to no avail.
Here is the table info. and fields:
tblfilm:
filmID (primary key)
film
film1 (record)
tblfilmimage:
imageID
filmID (foreign key, i.e. primary key from tblfilm)
image
image1 (record)
image2 (record)
And here is the most recent SELECT statement I've tried:
SELECT film, image FROM tblfilm, tblfilmimage
WHERE tblfilm.filmID = 2
filmID = 2 is unique to one film in tblfilm, and = to multiple images in tblfilmimages, that show actors from that film. Thus the overall goal is to be able to output multiple images for one film.
Current Undesired Output:
film1 image1
film1 image2
*I'm trying to eliminate the second film1 record from being output.
Desired Output:
film1 image1 image2
I'm pretty new to using mysql when querying multiple tables. And I'm sure there is an easy solution to this but I just can't seem to wrap my head around the logic. Thus any comments or suggestions appreciated. Thank you.
UPDATE: I now realize my original question logic was flawed ... and my example not very clear. The solution given by those that responded was correct for what I originally wrote. Thank you. I will try to clarify what I am trying to accomplish.
TableA has a list of many films (example):
Wizard of OZ
Gone with the Wind ... etc.
TableB has a list of images (example):
Wizard of Oz Image1
Wizard of Oz Image2
Wizard of Oz Image3 ...etc.
Gone with the Wind Image1
Gone with the Wind Image2 ... etc.
The desired output of the query would be:
Wizard of OZ
Wizard of Oz Image1
Wizard of Oz Image2
Wizard of Oz Image3
Gone with the Wind
Gone with the Wind Image1
Gone with the Wind Image2
... and so on with hundreds of films and hundreds more images.
The solution to my original question showed me how to prevent multiple instances of the film name record within the output, but requires knowing and coding the ImageID for every image that is associated with a film. Every film has at least one image to associate. Most films have many and a varying number of images associated with the film. But the images associated with each film are unique to one film. Thus I need a way to iterate through each film, find all the images associated with each film via the (filmID) which is a common field to each table (TableA - films & TableB - images). I think I will need a variable to store each unique film and a way to associate that film variable with a variable that stores all of the different images that have the same filmID as the film. I will keep trying to figure out how to do that, but if anyone wants to point me in the right direction it would be appreciated. Thank you.
Use correlated subqueries within your queries by common column filmID to produce new columns :
SELECT film,
(SELECT image FROM tblfilmimage WHERE imageID=1 and filmID = f.filmID ) as image1,
(SELECT image FROM tblfilmimage WHERE imageID=2 and filmID = f.filmID ) as image2
FROM tblfilm f
WHERE f.filmID = 2;
or, use conditional aggregation directly as a better option:
SELECT film,
MAX(CASE WHEN i.imageID=1 THEN i.image END ) as image1,
MAX(CASE WHEN i.imageID=2 THEN i.image END ) as image2
FROM tblfilm f
JOIN tblfilmimage i on i.filmID = f.filmID
WHERE f.filmID = 2
GROUP BY film;
Where the values for ImageID columns are just presumed by me.
In your case you had a CROSS JOIN by using comma-seperated tables. So, there were multiple rows generated without correlation through that. This style is not suggested, and considered as old-syntax for SQL Select Statements. Whenever JOIN is needed, prefer using this JOIN syntax which is called ANSI-92 standard, easier to read, understand, and maintain.

MYSQL select query on multiple tables

I'm not seeing a clean way to write this query without subselects which I avoid because they are generally not portable, and harder to read and debug than individual queries.
Table A has exactly 2 foreign keys to table B, which are always different, but always defined. Sort of like:
MARRIAGE_TABLE
M_KEY
LAST_NAME
PERSON_HUSBAND_FK
PERSON_WIFE_FK
PERSON_TABLE
PERSON_KEY
SEX
FIRST_NAME
The PERSON_HUSBAND_FK will always point at a SEX=MALE, and the WIFE_FK will always point at a female. There will always be one of each. (this is in no way a statement on same-sex marriage BTW I'm all for it)..
I want to create a result like:
MARRIAGE HUSBAND WIFE
-------- ------- ----
SMITH TOM KATHY
JONES BILL EVE
My current approach is to get all records from the MARRIAGE TABLE and store them in a hash. Then I augment the hash with names {wife_name} and {husband_name} using 2 more queries using the husband and wife FK's. Then I format and print the hash. It works, but I'm not wild about 3 queries per row.
I'm not sure I ever encountered a table having >1 FK to another table. I've done years of table-design, but I'm not really sure this design even meets normalization. It seems like no, to me. Like they created a many-many without an intermediate table; a cheat?
Just join table PERSON_TABLE twice:
SELECT m.last_name AS marriage, p1.first_name AS husband, p2.first_name AS wife
FROM marriage_table m
INNER JOIN person_table p1 ON p1.person_key = m.person_husband_fk
INNER JOIN person_table p2 ON p2.person_key = m.person_wife_fk

MySQL select users on multiple criteria

My team working on a php/MySQL website for a school project. I have a table of users with typical information (ID,first name, last name, etc). I also have a table of questions with sample data like below. For this simplified example, all the answers to the questions are numerical.
Table Questions:
qid | questionText
1 | 'favorite number'
2 | 'gpa'
3 | 'number of years doing ...'
etc.
Users will have the ability fill out a form to answer any or all of these questions. Note: users are not required to answer all of the questions and the questions themselves are subject to change in the future.
The answer table looks like this:
Table Answers:
uid | qid | value
37 | 1 | 42
37 | 2 | 3.5
38 | 2 | 3.6
etc.
Now, I am working on the search page for the site. I would like the user to select what criteria they want to search on. I have something working, but I'm not sure it is efficient at all or if it will scale (not that these tables will ever be huge - like I said, it is a school project). For example, I might want to list all users whose favorite number is between 100 and 200 and whose GPA is above 2.0. Currently, I have a query builder that works (it creates a valid query that returns accurate results - as far as I can tell). A result of the query builder for this example would look like this:
SELECT u.ID, u.name (etc)
FROM User u
JOIN Answer a1 ON u.ID=a1.uid
JOIN Answer a2 ON u.ID=a2.uid
WHERE 1
AND (a1.qid=1 AND a1.value>100 AND a1.value<200)
AND (a2.qid=2 AND a2.value>2.0)
I add the WHERE 1 so that in the for loops, I can just add " AND (...)". I realize I could drop the '1' and just use implode(and,array) and add the where if array is not empty, but I figured this is equivalent. If not, I can change that easy enough.
As you can see, I add a JOIN for every criteria the searcher asks for. This also allows me to order by a1.value ASC, or a2.value, etc.
First question:
Is this table organization at least somewhat decent? We figured that since the number of questions is variable, and not every user answers every question, that something like this would be necessary.
Main question:
Is the query way too inefficient? I imagine that it is not ideal to join the same table to itself up to maybe a dozen or two times (if we end up putting that many questions in). I did some searching and found these two posts which seem to kind of touch on what I'm looking for:
Mutiple criteria in 1 query
This uses multiple nested (correct term?) queries in EXISTS
Search for products with multiple criteria
One of the comments by youssef azari mentions using 'query 1' UNION 'query 2'
Would either of these perform better/make more sense for what I'm trying to do?
Bonus question:
I left out above for simplicity's sake, but I actually have 3 tables (for number valued questions, booleans, and text)
The decision to have separate tables was because (as far as I could think of) it would either be that or have one big answers table with 3 value columns of different types, having 2 always empty.
This works with my current query builder - an example query would be
SELECT u.ID,...
FROM User u
JOIN AnswerBool b1 ON u.ID=b1.uid
JOIN AnswerNum n1 ON u.ID=n1.uid
JOIN AnswerText t1 ON u.ID=t1.uid
WHERE 1
AND (b1.qid=1 AND b1.value=true)
AND (n1.qid=16 AND n1.value<999)
AND (t1.qid=23 AND t1.value LIKE '...')
With that in mind, what is the best way to get my results?
One final piece of context:
I mentioned this is for a school project. While this is true, then eventual goal (it is an undergrad senior design project) is to have a department use our site for students creating teams for their senior design. For a rough estimate of size, every semester, the department would have somewhere around 200 or so students use our site to form teams. Obviously, when we're done, the department will (hopefully) check our site for security issues and other stuff they need to worry about (what with FERPA and all). We are trying to take into account all common security practices and scalablity concerns, but in the end, our code may be improved by others.
UPDATE
As per nnichols suggestion, I put in a decent amount of data and ran some tests on different queries. I put around 250 users in the table, and about 2000 answers in each of the 3 tables. I found the links provided very informative
(links removed because I can't hyperlink more than twice yet) Links are in nnichols' response
as well as this one that I found:
http://phpmaster.com/using-explain-to-write-better-mysql-queries/
I tried 3 different types of queries, and in the end, the one I proposed worked the best.
First: using EXISTS
SELECT u.ID,...
FROM User u WHERE 1
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=# AND value>#) -- or any condition on value
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=another # AND some_condition(value))
AND EXISTS
(SELECT * FROM AnswerText
...
I used 10 conditions on each of the 3 answer tables (resulting in 30 EXISTS)
Second: using IN - a very similar approach (maybe even exactly?) which yields the same results
SELECT u.ID,...
FROM User u WHERE 1
AND (u.ID) IN (SELECT uid FROM AnswerNumber WHERE qid=# AND ...)
...
again with 30 subqueries.
The third one I tried was the same as described above (using 30 JOINs)
The results of using EXPLAIN on the first two were as follows: (identical)
The primary query on table u had a type of ALL (bad, though users table is not huge) and rows searched was roughly twice the size of the user table (not sure why). Each other row in the output of EXPLAIN was a dependent query on the relevant answer table, with a type of eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall not bad.
For the query I suggested (JOINing):
The primary query was actually on whatever table you joined first (in my case AnswerBoolean) with type of ref (better than ALL). The number of rows searched was equal to the number of questions answered by anyone (as in 50 distinct questions have been answered by anyone) (which will be much less than the number of users). For each additional row in EXPLAIN output, it was a SIMPLE query with type eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall almost the same, but a smaller starting multiplier.
One final advantage to the JOIN method: it was the only one I could figure out how to order by various values (such as n1.value). Since the other two queries were using subqueries, I could not access the value of a specific subquery. Adding the order by clause did change the extra field in the first query to also have 'using temporary' (required, I believe, for order by's) and 'using filesort' (not sure how to avoid that). However, even with those slow-downs, the number of rows is still much less, and the other two (as far as I could get) cannot use order by.
You could answer most of these questions yourself with a suitably large test dataset and the use of EXPLAIN and/or the profiler.
Your INNER JOINs will almost certainly perform better than switching to EXISTS but again this is easy to test with a suitable test dataset and EXPLAIN.

Is there something more efficient than joining tables in MySQL?

I have a table with entity-attribute-value structure. As an example, as entities I can have different countries. I can have the following attributes: "located in", "has border with", "capital".
Then I want to find all those countries which are "located in Asia" and "has border with Russia". The straightforward way to do that is to join the table with itself using entities are the column for joining and then to use where.
However, if I have 20 rows where Russia in in the entity-column, than in the joint table I will have 20*20=400 rows with Russia as the entity. And it is so for every country. So, the joint table going to be huge.
Will it be not more efficient to use the original table to extract all countries which are located in Asia, then to extract all countries which have border with Russia and then to use those elements which are in both sets of countries?
You shouldn't end up having a huge number of records so this should work
SELECT a.entity,
a.located_in,
a.border
FROM my_table a
WHERE a.border in (SELECT b.entity FROM my_table b WHERE b.entity = 'RUSSIA' )
AND a.located_in = 'ASIA'
You are confusing join with Cartesian product. There could never be more rows in the join then there are in the actual data, the only thing being altered is which elements/rows are taken.
So if you have 20 Russian rows, the table resulting from the join could never have more than 20 Russian entries.
The operation you suggest using is exactly what a join does. Just make sure you have the appropriate indices and let MySQL do the rest.