I have the following tables,
link_books_genres, *table structure -> book_id,genre_id*
genres, *table structure -> genre_id,genre_name*
Given a set of book_ids, I want to form the following result,
result_set structure -> genre_id, genre_name, count(book_id).
I wrote this query,
SELECT one.genre_id,
one.genre_name,
two.count
FROM genres as one,(SELECT genre_id,
count(book_id) as count
FROM link_f2_books_lists GROUP BY genre_id) as two
WHERE one.genre_id = two.genre_id;
I don't know if that's the best solution, but I want this to be optimized if possible or if it is well formed, validated.
P.S. It's done with ruby on rails, so any rails oriented approach would also be fine.
Your query is not using the SQL-92 JOIN syntax but the older implicit join syntax. It's time (20 years now), you should start using it.
It's also not very good to use keywords like COUNT for aliases. You could use cnt or book_count instead:
SELECT one.genre_id,
one.genre_name,
two.cnt
FROM
genres AS one
INNER JOIN
( SELECT genre_id,
COUNT(book_id) AS cnt
FROM link_f2_books_lists
GROUP BY genre_id
) AS two
ON one.genre_id = two.genre_id ;
MySQL usually is a bit faster with COUNT(*), so if book_id cannot be NULL, changing COUNT(book_id) to COUNT(*) will be a small performance improvement.
Off course you can rewrite the Join without the derived table:
SELECT one.genre_id,
one.genre_name,
COUNT(*) AS cnt
FROM
genres AS one
INNER JOIN
link_f2_books_lists AS two
ON one.genre_id = two.genre_id
GROUP BY one.genre_id ;
In both versions, you can change INNER JOIN to LEFT OUTER JOIN in order genres without any books (0 count) to be shown. But then do use COUNT(two.book_id) and not COUNT(*), for correct results.
The above versions (and yours) will not include those genres (that's one good reason to use the JOIN syntax, the change needed is very simple. Try that with your WHERE version!)
The LEFT JOIN versions can also be written like this:
SELECT one.genre_id,
one.genre_name,
( SELECT COUNT(*)
FROM link_f2_books_lists AS two
WHERE one.genre_id = two.genre_id
) AS cnt
FROM
genres AS one ;
Regarding performance, there is nothing better than testing yourself. It all depends on the version of MySQL you use (newer versions will have better optimizer that can select through more options to create an execution plan and possibly it will identify different versions as equivalent), the size of your tables, the indexes you have, the distribution of the data (how many different genres? how many books per genre on average? etc), your memory (and other MySQL) settings and probably many other factors that I'm forgetting now.
An advice is that an index on (genre_id, book_id) will be useful in most cases, for all the versions.
As a general advice, it's usually good to have both a (genre_id, book_id) and a (book_id, genre_id) index on the many-to-many table.
SELECT one.genre_id, one.genre_name, count(two.book_id)
FROM genres as one, link_books_genres as two
WHERE one.genre_id=two.genre_id
GROUP BY genre_id
Related
Given a table members and a table devices, where each member can have 0-many devices, what would be the fastest way to get all members that has at least one device?
select m.*, md.* from members m
left join (
SELECT count(*) as c, memberId from member_devices d GROUP BY d.memberId
) md ON m.memberId = md.memberId
WHERE md.c > 0
This works, but it seems really slow.
select s.* from members m where
EXISTS (
SELECT 1 FROM member_devices md WHERE m.memberId = md.memberId
)
Also works, and might be a little faster (?)
Any one out there with any experience? Thanks!
The second option, that uses EXISTS with a correlated subquery, is surely the fastest option here.
Unlike the other option it does not require aggregation and joining. Aggregation is an expensive operation, that usually does not scale well (when the number of records to process increases, the performance tends to dramatically drop).
Also, you don't actually need to count how many records there are in each group. You just want to know if at least one record is available. That's exactly what EXISTS is here for.
For performance in your query, make sure that you have the following indexes (they are probably already there if you properly implemented the relationship with a foreign key):
members(memberId)
member_devices(memberId)
"INNER JOIN" returns rows when there is a match in both tables. You can do :
SELECT m.*, md.*
FROM members m
INNER JOIN devices md ON m.memberId = md.memberId
I'm practicing some SQL (new to this),
I have the next tables:
screening_occapancy(idscreening,row,col,idclient)
screening(screeningid,idmovie,idtheater,screening_time)
Im trying to creating a query to search which clients watched all the movies in the "screening" table and show their ID(idclient).
this is what I written(which doesn't work):
select idclient from screening_occapancy p where not exists
(select screeningid from screening where screeningid=p.idscreening)
I know it's probably not that good so please try to explain also what am I doing wrong.
P.S My mission is to use not/exists while doing it...
Thanks!
Your query is basically fine, although the select distinct is unnecessary in the subquery:
select p.idclient
from screening_occapancy p
where not exists (select 1
from screening s
where s.screeningid = p.idscreening
);
Notes:
You can select anything in the exists subquery. Selecting a column is misleading.
Use table aliases and use them for all column references, particularly in a correlated subquery.
If you are designing the tables, I would advise you to give the primary key and foreign key the same name (screeningid or idscreening, but not both).
EDIT:
If you want clients who watched all movies, then I would approach this as:
select p.idclient
from screening_occapancy p
group by p.idclient
having count(distinct p.screening_occapancy p) = (select count(*) from screening);
Why don't you count the number of movies in the screening_table, load it into a variable and check the results of your query results against the variable?
load number of movies into variable (identified by idmovie):
SELECT count(DISTINCT(idmovie)) FROM screening INTO #number_of_movies;
check the results of your query against the variable:
SELECT A.idclient,
count(DISTINCT(idmovie)) AS number_of_movies_watched,
FROM screening_occapancy A
INNER JOIN screening B
ON(A.idscreening = B.screeningid)
GROUP BY A.idclient
HAVING number_of_movies_watched = #number_of_movies ;
If you want to find all clients, that attended all screenings, replace idmovie with screeningid.
Even someone relatively new to MySQL can get his head around this query. The "not exists"-approach is more difficult to understand.
I have what may be a basic performance question. I've done a lot of SQL queries, but not much in terms of complex inner joins and such. So, here it is:
I have a database with 4 tables, countries, territories, employees, and transactions.
The transactions links up with the employees and countries. The employees links up with the territories. In order to produce a required report, I'm running a PHP script that processes a SQL query against a mySQL database.
SELECT trans.transactionDate, agent.code, agent.type, trans.transactionAmount, agent.territory
FROM transactionTable as trans
INNER JOIN
(
SELECT agent1.code as code, agent1.type as type, territory.territory as territory FROM agentTable as agent1
INNER JOIN territoryTable as territory
ON agent1.zip=territory.zip
) AS agent
ON agent.code=trans.agent
ORDER BY trans.agent
There are about 50,000 records in the agent table, and over 200,000 in the transaction table. The other two are relatively tiny. It's taking about 7 minutes to run this query. And I haven't even inserted the fourth table yet, which needs to relate a field in the transactionTable (country) to a field in the countryTable (country) and return a field in the countryTable (region).
So, two questions:
Where would I logically put the connection between the transactionTable and the countryTable?
Can anyone suggest a way that this can be quickened up?
Thanks.
Your query should be equivalent to this:
SELECT tx.transactionDate,
a.code,
a.type,
tx.transactionAmount,
t.territory
FROM transactionTable tx,
agentTable a,
territoryTable t
WHERE tx.agent = a.code
AND a.zip = t.zip
ORDER BY tx.agent
or to this if you like to use JOIN:
SELECT tx.transactionDate,
a.code,
a.type,
tx.transactionAmount,
t.territory
FROM transactionTable tx
JOIN agentTable a ON tx.agent = a.code
JOIN territoryTable t ON a.zip = t.zip
ORDER BY tx.agent
In order to work fast, you must have following indexes on your tables:
CREATE INDEX transactionTable_agent ON transactionTable(agent);
CREATE INDEX territoryTable_zip ON territoryTable(zip);
CREATE INDEX agentTable_code ON agentTable(code);
(basically any field that is part of WHERE or JOIN constraint should be indexed).
That said, your table structure looks suspicious in a sense that it is joined by apparently non-unique fields like zip code. You really want to join by more unique entities, like agent id, transaction id and so on - otherwise expect your queries to generate a lot of redundant data and be really slow.
One more note: INNER JOIN is equivalent to simply JOIN, there is no reason to type redundant clause.
In the article Why Arel?, the author poses the problem:
Suppose we have a users table and a photos table and we want to select all user data and a *count* of the photos they have created.
His proposed solution (with a line break added) is
SELECT users.*, photos_aggregation.cnt
FROM users
LEFT OUTER JOIN (SELECT user_id, count(*) as cnt FROM photos GROUP BY user_id)
AS photos_aggregation
ON photos_aggregation.user_id = users.id
When I attempted to write such a query, I came up with
select users.*, if(count(photos.id) = 0, null, count(photos.id)) as cnt
from users
left join photos on photos.user_id = users.id
group by users.id
(The if() in the column list is just to get it to behave the same when a user has no photos.)
The author of the article goes on to say
Only advanced SQL programmers know how to write this (I’ve often asked this question in job interviews I’ve never once seen anybody get it right). And it shouldn’t be hard!
I don't consider myself an "advanced SQL programmer", so I assume I'm missing something subtle. What am I missing?
I believe your version would produce an error, at least in some database engines. In MSSQL your select would generate [Column Name] is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.. This is because you select can only contain values in the group by or the count.
You could modify your version to select users.id, count(photo.id) and it would work, but it would not be the same result as his query.
I would not say you have to be particularly advanced to come up with a working solution (or the specific solution he came up with) but it is necessary to do the group in a separate query either in the join or as #ron tornambe suggests.
In most DBMSs (MySQL and Postgres are exceptions) the version in your question would be invalid.
You would need to write the query which does not use the derived table as
select users.*, CASE WHEN count(photos.id) > 0 THEN count(photos.id) END as cnt
from users
left join photos on photos.user_id = users.id
group by users.id, users.name, users.email /* and so on*/
MySQL allows you to select non aggregated items that are not in the group by list but this is only safe if they are functionally dependant on the column(s) in the group by.
Whilst the group by list is more verbose without the derived table I would expect most optimisers to be able to transform one to the other anyway. Certainly in SQL Server if it sees you are grouping by the PK and some other columns it doesn't actually do group by comparisons on those other columns.
Some discussion about this MySQL behaviour vs standard SQL is in Debunking GROUP BY myths
Maybe the author of the article is wrong. Your solution works as well, and it may very well be faster.
Personally, I would drop the if alltogether. If you want to count the number of pictures, it makes sense that 'no pictures' results in 0 rather than null.
As an alternative, you can also write a correlated sub-query:
SELECT u.*, (SELECT Count(*) FROM photos p WHERE p.userid=u.id) as cnt
FROM users u
I'm using H2, and I have a database of books (table Entries) and authors (table Persons), connected through a many-to-many relationship, itself stored in a table Authorship.
The database is fairly large (900'000+ persons and 2.5M+ books).
I'm trying to efficiently select the list of all books authored by at least one author whose name matches a pattern (LIKE '%pattern%'). The trick here is that the pattern should severly restrict the number of matching authors, and each author has a reasonably small number of associated books.
I tried two queries:
SELECT p.*, e.title FROM (SELECT * FROM Persons WHERE name LIKE '%pattern%') AS p
INNER JOIN Authorship AS au ON au.authorId = p.id
INNER JOIN Entries AS e ON e.id = au.entryId;
and:
SELECT p.*, e.title FROM Persons AS p
INNER JOIN Authorship AS au ON au.authorId = p.id
INNER JOIN Entries AS e ON e.id = au.entryId
WHERE p.name like '%pattern%';
I expected the first one to be much faster, as I'm joining a much smaller (sub)table of authors, however they both take as long. So long in fact that I can manually decompose the query into three selects and find the result I want faster.
When I try to EXPLAIN the queries, I observe that indeed they are very similar (a full join on the tables and only then a WHERE clause), so my question is: how can I achieve a fast select, that relies on the fact that the filter on authors should result in a much smaller join with the other two tables?
Note that I tried the same queries with MySQL and got results in line with what I expected (selecting first is much faster).
Thank you.
OK, here is something that finally worked for me.
Instead of running the query:
SELECT p.*, e.title FROM (SELECT * FROM Persons WHERE name LIKE '%pattern%') AS p
INNER JOIN Authorship AS au ON au.authorId = p.id
INNER JOIN Entries AS e ON e.id = au.entryId;
...I ran:
SELECT title FROM Entries e WHERE id IN (
SELECT entryId FROM Authorship WHERE authorId IN (
SELECT id FROM Persons WHERE name LIKE '%pattern%'
)
)
It's not exactly the same query, because now I don't get the author id as a column in the result, but that does what I wanted: take advantage of the fact that the pattern restricts the number of authors to a very small value to search only through a small number of entries.
What is interesting is that this worked great with H2 (much, much faster than the join), but with MySQL it is terribly slow. (This has nothing to do with the LIKE '%pattern%' part, see comments in other answers.) I suppose queries are optimized differently.
SELECT * FROM Persons WHERE name LIKE '%pattern%' will always take LONG on a 900,000+ row table no matter what you do because when your pattern '%pattern%' starts with a % MySql can't use any indexes and should do a full table scan. You should look into full-text indexes and function.
Well, since the like condition starts with a wildcard it will result in a full table scan which is always slow, no internal caching can take place.
If you want to do full text searches, mysql is not the best bet you have. Look into other software (solr for instance) to solve this kind of problems.