MySQL LEFT JOIN using OR is very slow - mysql

table users has about 80,000 records
table friends has about 900,000 records
There are 104 records with firstname = 'verena'
this query (the point of the query is gone because its very simplified) is very slow (> 20 seconds):
SELECT users.id FROM users
LEFT JOIN friends ON (
users.id = friends.user_id OR
users.id = friends.friend_id
)
WHERE users.firstname = 'verena';
However, if I remove the OR inside the JOIN, the query is instant, so either:
SELECT users.id FROM users
LEFT JOIN friends ON (
users.id = friends.user_id
)
WHERE users.firstname = 'verena';
returning 1487 results or
SELECT users.id FROM users
LEFT JOIN friends ON (
users.id = friends.friend_id
)
WHERE users.firstname = 'verena';
returning 2849 results
execute instantly (0.001s)
If I remove everything else and go straight for
SELECT 1 FROM friends WHERE user_id = xxx OR friend_id = xxx
or
SELECT id FROM users WHERE firstname = 'verena';
these queries are also instant.
Indexes for friends.friend_id, friends.user_id and users.firstname are set.
I don't understand why the top query is slow while if manually taking it apart and executing the statements isolated everything is blazing fast.
My only suspicion now is that MariaDB is first joining ALL users with friends and only after that filters for WHERE firstname = 'verena', instead of the wanted behavior of first filtering for firstname = 'verena' and then joining the results with the friends table, but even then I don't see why removing the OR inside the JOIN condition would make it fast.
I tested this on 2 different machines, one running MariaDB 10.3.22 with Galera cluster and one with MariaDB 10.4.12 without Galera cluster
What is the technical reason why the top query is having such a huge slowdown and how do I fix this without having to split the SQL into several statements?
Edit:
Here is the EXPLAIN output for it, telling it's not using any index for the friends table and scanning through all records as correctly stated in Barmar's comment:
+------+-------------+---------+------+-------------------+-----------+---------+-------+--------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------+------+-------------------+-----------+---------+-------+--------+------------------------------------------------+
| 1 | SIMPLE | users | ref | firstname | firstname | 768 | const | 104 | Using where; Using index |
| 1 | SIMPLE | friends | ALL | user_id,friend_id | NULL | NULL | NULL | 902853 | Range checked for each record (index map: 0x6) |
+------+-------------+---------+------+-------------------+-----------+---------+-------+--------+------------------------------------------------+
Is there any way to make SQL use both indexes or do I just have to accept this limitation and work around it using for example Barmar's suggestion?

MySQL is not usually able to use an index when you use OR to join with different columns. It can only use one index per table in a join, so if it uses the friends.user_id index, it won't use friends.friend_id, and vice versa.
The solution is to do the two fast queries and combine them with UNION.
SELECT users.id FROM users
LEFT JOIN friends ON (
users.id = friends.user_id
)
WHERE users.firstname = 'verena';
UNION
SELECT users.id FROM users
LEFT JOIN friends ON (
users.id = friends.friend_id
)
WHERE users.firstname = 'verena';

Related

SELECT using three tables w/ 1000+ entries

For transaction listing I need to provide the following columns:
log_out.timestamp
items.description
log_out.qty
category.name
storage.name
log_out.dnr ( Representing the users id )
Table structure from log_out looks like this:
| id | timestamp | storageid | itemid | qty | categoryid | dnr |
| | | | | | | |
| 1 | ........ | 2 | 23 | 3 | 999 | 123 |
As one could guess, I only store the corresponding ID's from other tables in this table. Note: log_out.id is the primary key in this table.
To get the the corresponding strings, int's or whatever back, I tried two queries.
Approach 1
SELECT i.description, c.name, s.name as sname, l.*
FROM items i, categories c, storages s, log_out l
WHERE l.itemid = i.id AND l.storageid = s.id AND l.categoryid = c.id
ORDER BY l.id DESC
Approach 2
SELECT log_out.id, items.description, storages.name, categories.name AS cat, timestamp, dnr, qty
FROM log_out
INNER JOIN items ON log_out.itemid = items.id
INNER JOIN storages ON log_out.storageid = storages.id
INNER JOIN categories ON log_out.categoryid = categories.id
ORDER BY log_out.id DESC
They both work fine on my developing machine, which has approx 99 dummy transactions stored in log_out. The DB on the main server got something like 1100+ tx stored in the table. And that's where trouble begins. No matter which of these two approaches I run on the main machine, it always returns 0 rows w/o any error *sigh*.
First I thought, it's because the main machine uses MariaDB instead of MySQL. But after I imported the remote's log_out table to my dev-machine, it does the same as the main machine -> return 0 rows w/o error.
You guys got any idea what's going on ?
If the table has the data then it probably has something to do with JOIN and related records in corresponding tables. I would start with log_out table and incrementally add the other tables in the JOIN, e.g.:
SELECT *
FROM log_out;
SELECT *
FROM log_out
INNER JOIN items ON log_out.itemid = items.id;
SELECT *
FROM log_out
INNER JOIN items ON log_out.itemid = items.id
INNER JOIN storages ON log_out.storageid = storages.id;
SELECT *
FROM log_out
INNER JOIN items ON log_out.itemid = items.id
INNER JOIN storages ON log_out.storageid = storages.id
INNER JOIN categories ON log_out.categoryid = categories.id;
I would execute all the queries one by one and see which one results in 0 records. Additional join in that query would be the one with data discrepancy.
You're queries look fine to me, which makes me think that it is probably something unexpected with the data. Most likely the ids in your joins are not maintained right (do all of them have a foreign key constraint?). I would dig around the data, like SELECT COUNT(*) FROM items WHERE id IN (SELECT itemid FROM log_out), etc, and seeing if the returns make sense. Sorry I can't offer more advise, but I would be interested in hearing if the problem is in the data itself.

How to optimize this MySQL query? (CROSS JOIN, subquery)

I have a challenging question for MySQL experts.
I have a users permissions system with 4 tables:
users (id | email | created_at)
permissions (id | responsibility_id | key | weight)
permission_user (id | permission_id | user_id)
responsibilities (id | key | weight)
Users can have any number of permissions assigned and any permission can be granted to any number of users (many to many). Responsibilities are like groups for permissions, each permission belongs to exactly one responsibility. For example, one permission is called update with responsibility of customers. Another one would be delete with orders responsibility.
I need to get a full map of permissions per user, but only for those who have at least one permission granted. Results should be ordered by:
User's number of permissions from most to least
User's created_at column, oldest first
Responsibility's weight
Permission's weight
Example result set:
user_id | responsibility | permission | granted
-----------------------------------------------
5 | customers | create | 1
5 | customers | update | 1
5 | orders | create | 1
5 | orders | update | 1
2 | customers | create | 0
2 | customers | delete | 0
2 | orders | create | 1
2 | orders | update | 0
Let's say I have 10 users in database, but only two of them have any permissions granted. There are 4 permissions in total:
create of customers responsibility
update of customers responsibility
create of orders responsibility
update of orders responsibility.
That's why we have 8 records in results (2 users with any permission × 4 permissions). User with id = 5 is displayed first, because he's got more permissions. If there were any draws, the ones with older created_at date would go first. Permissions are always sorted by the weight of their responsibility and then by their own weight.
My question is, how to write optimal query for this case? I have already made one myself and it works good:
SELECT `users`.`id` AS `user_id`,
`responsibilities`.`key` AS `responsibility`,
`permissions`.`key` AS `permission`,
!ISNULL(`permission_user`.`id`) AS `granted`
FROM `users`
CROSS JOIN `permissions`
JOIN `responsibilities`
ON `responsibilities`.`id` = `permissions`.`responsibility_id`
LEFT JOIN `permission_user`
ON `permission_user`.`user_id` = `users`.`id`
AND `permission_user`.`permission_id` = `permissions`.`id`
WHERE (
SELECT COUNT(*)
FROM `permission_user`
WHERE `user_id` = `users`.`id`
) > 0
ORDER BY (
SELECT COUNT(*)
FROM `permission_user`
WHERE `user_id` = `users`.`id`
) DESC,
`users`.`created_at` ASC,
`responsibilities`.`weight` ASC,
`permissions`.`weight` ASC
The problem is that I'm using the same subquery twice.
Can I do better? I count on you, MySQL experts!
--- EDIT ---
Thanks to Gordon Linoff's comment I made it use HAVING clause:
SELECT `users`.`email`,
`responsibilities`.`key`,
`permissions`.`key`,
!ISNULL(`permission_user`.`id`) as `granted`,
(
SELECT COUNT(*)
FROM `permission_user`
WHERE `user_id` = `users`.`id`
) AS `total_permissions`
FROM `users`
CROSS JOIN `permissions`
JOIN `responsibilities`
ON `responsibilities`.`id` = `permissions`.`responsibility_id`
LEFT JOIN `permission_user`
ON `permission_user`.`user_id` = `users`.`id`
AND `permission_user`.`permission_id` = `permissions`.`id`
HAVING `total_permissions` > 0
ORDER BY `total_permissions` DESC,
`users`.`created_at` ASC,
`responsibilities`.`weight` ASC,
`permissions`.`weight` ASC
I was surprised to discover that HAVING can go alone without GROUP BY.
Can it now be improved for better performance?
Probably the most efficient way to do this is:
SELECT u.email, r.`key`, r.`key`,
!ISNULL(pu.id) as `granted`
FROM (SELECT u.*,
(SELECT COUNT(*) FROM `permission_user` pu WHERE pu.user_id = u.id
) AS `total_permissions`
FROM `users` u
) u CROSS JOIN
permissions p JOIN
responsibilities r
ON r.id = p.responsibility_id LEFT JOIN
permission_user pu
ON pu.user_id = u.id AND
pu.permission_id = p.id
WHERE u.total_permissions > 0
ORDER BY `total_permissions` DESC,
`users`.`created_at` ASC,
`responsibilities`.`weight` ASC,
`permissions`.`weight` ASC;
This will run the subquery once per user, rather than once per user/permission combination (as both the modified query and the original query were doing). This has two costs. The first is the materialization of the subquery, so the data in the users table has to be read and written again. Probably not a big deal, given everything else in the query. The second is the loss of indexes on the users table. Once again, with a cross join, indexes are (probably) not being used, so this is also minor.

Normalization made my queries slower

Before normalization I had a column called genreand it contained values like "Action, Thriller, Comedy"
Now I have normalized the genre column by creating genre and movie2genre tables.
The problem now is my queries are more complicated and are actually slower
These two queries basically search for movies that are action and thriller
Old query
select title, genre from movie where genre like '%action%' and genre like '%thriller%'
0.062 sec duration / 0.032 sec fetch
New Query
SELECT movie.title, movie.genre
FROM Movie
Where
EXISTS (
select *
from movie2genre
JOIN Genre on Genre.id = movie2genre.GenreId
where Movie.id = movie2genre.MovieId
and genre in ('action', 'thriller')
)
0.328 sec duration / 0.078 sec fetch
Am I doing something wrong?
More info:
Movie
+-------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| Title | varchar(345) | YES | | NULL | |
ETC....
Genre
+---------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+----------------+
| genreid | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(50) | YES | | NULL | |
+---------+-------------+------+-----+---------+----------------+.
movie2genre
+---------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+-------+
| movieid | int(11) | YES | | NULL | |
| genreid | int(11) | YES | | NULL | |
+---------+---------+------+-----+---------+-------+
Try this without correlated Queries (Please check the execution plan of both queries if you are concerned about the performance) also Make sure you have proper indexes on your new table.
SELECT *
FROM movie2genre mg, Genre g, Movie m
WHERE m.id = mg.MovieId
AND g.id = mg.GenreId
AND g.genre in ('action', 'thriller')
First, your two queries are not the same. The newer version does an or rather than an and, so the difference in time could simply be returning a larger result set. In addition, your new query refers to movie.genre, a column that wouldn't exist in a normalized database.
You seem to be asking for:
select m.title
from Movie m
where exists (select 1
from movie2genre m2g JOIN
Genre g
on g.id = m2g.GenreId
where m.id = m2g.MovieId and g.genre = 'action'
) and
exists (select 1
from movie2genre m2g JOIN
Genre g
on g.id = m2g.GenreId
where m.id = m2g.MovieId and g.genre = 'thriller'
);
Admittedly, you probably will not think this solves the "complication" problem. Leaving that aside, you need to have indexes for this to work well. Do you have the "obvious" indexes of: movie2genre(MovieId, GenreId) and genre(GenreId)?
Second, your data is not particularly large (judging by the duration for the queries). So, a full table scan may be more efficient than the joining and filtering with these tables. As the database grows, the normalized approach will often be faster.
A more equivalent query is:
select m.title, group_concat(g.genre)
from movies m join
movie2genre m2g
on m.movieid = m2g.movieid join
genre g
on g.genreid = m2g.genreid
group by m.title
having sum(g.genre = 'action') > 0 and sum(g.genre = 'thriller') > 0;
Because of the nature of your particular query -- you need to fetch all genres on a movie so you cannot filter on them -- this particular query is probably going to perform less well than the unnormalized version.
By the way, normalization is more about keeping data consistent than about speeding queries. Normalized databases require more join operations. Indexes can help performance, but there is still work in doing the join. In some cases, the tables themselves are bigger than the unnormalized forms. And, normalized databases may require aggregation where none is required for non-normalized database. All of these can affect performance, which is why in many decision support architectures, the central database is normalized but the application-specific databases are not.
Indexes are vitally important when doing joins (and sub queries tend to lose the indexing).
There are 2 ways I would suggest trying.
Firstly you join movies to movie2genre, and then one join to genre for each one you are checking. Well indexed this should be fast.
SELECT movie.title,
movie.genre
FROM Movie
INNER JOIN movie2genre
ON Movie.id = movie2genre.MovieId
INNER JOIN Genre G1
ON G1.id = movie2genre.GenreId
AND G1.genre = 'action'
INNER JOIN Genre G2
ON G2.id = movie2genre.GenreId
AND G2.genre = 'thriller'
An alternative is to use IN, and use the aggregate COUNT function to check that the number of genres found is the same as the number expected.
SELECT movie.title,
movie.genre
FROM Movie
INNER JOIN movie2genre
ON Movie.id = movie2genre.MovieId
INNER JOIN Genre
ON Genre.id = movie2genre.GenreId
AND Genre.genre IN ('action', 'thriller')
GROUP BY movie.title, movie.genre
HAVING COUNT(DISTINCT genreid) = 2
I would prefer the 1st solution, but it is a bit more complicated to set up the SQL for in code (ie, the SQL varies greatly depending on the number of genres), and potentially is limited by the max number of table joins if you are checking for lots of genres.

MySQL Statement extremely slow even with indexes

The following query takes around 200 seconds to complete. What i'm trying to achieve is get users who have made 6 or more payments, who have not made any orders yet (there are 2 orders tables for different marketplaces).
u.id, ju.id are both primary keys.
I've indexed the user_id and order_status combined into one index on both orders tables. If I remove the join and COUNT() on the mp_orders table, the query takes 8 seconds to complete, but with it, it takes too long. I think i've indexed every thing that I could have but I don't understand why it takes so long to complete. Any ideas?
SELECT
u.id,
ju.name,
COUNT(p.id) as payment_count,
COUNT(o.id) as order_count,
COUNT(mi.id) as marketplace_order_count
FROM users as u
INNER JOIN users2 as ju
ON u.id = ju.id
INNER JOIN payments as p
ON u.id = p.user_id
LEFT OUTER JOIN orders as o
ON u.id = o.user_id
AND o.order_status = 1
LEFT OUTER JOIN mp_orders as mi
ON u.id = mi.producer
AND mi.order_status = 1
WHERE u.package != 1
AND u.enabled = 1
AND u.chart_ban = 0
GROUP BY u.id
HAVING COUNT(p.id) >= 6
AND COUNT(o.id) = 0
AND COUNT(mi.id) = 0
LIMIT 10
payments table
+-----------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+----------------+
| id | bigint(255) | NO | PRI | NULL | auto_increment |
| user_id | bigint(255) | NO | | NULL | |
+-----------------+---------------+------+-----+---------+----------------+
orders table (mp_orders table pretty much the same)
+-----------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+----------------+
| id | int(255) | NO | PRI | NULL | auto_increment |
| order_number | varchar(1024) | NO | MUL | NULL | |
| user_id | int(255) | NO | MUL | NULL | |
+-----------------+---------------+------+-----+---------+----------------+
You don't need to COUNT the rows of your orders, you need to retrieve users which doesn't have orders, that's not really the same thing.
Instead of counting, filter the users which have no orders :
SELECT
u.id,
ju.name,
COUNT(p.id) as payment_count
FROM users as u
INNER JOIN users2 as ju
ON u.id = ju.id
INNER JOIN payments as p
ON u.id = p.user_id
LEFT OUTER JOIN orders as o
ON u.id = o.user_id
AND o.order_status = 1
LEFT OUTER JOIN mp_orders as mi
ON u.id = mi.producer
AND mi.order_status = 1
WHERE u.package != 1
AND u.enabled = 1
AND u.chart_ban = 0
AND o.id IS NULL -- filter happens here
AND mi.id IS NULL -- and here
GROUP BY u.id
HAVING COUNT(p.id) >= 6
LIMIT 10
This will prevent the engine to count each of the orders for each of your users, and you will gain a lot of time.
One can think that the engine should use the index for doing the count, and so the count must be fast enough.
I will quote from a different site: InnoDB COUNT(id) - Why so slow?
It may be to do with the buffering, InnoDb does not cache the index it
caches into memory the actual data rows, because of this for what
seems to be a simple scan it is not loading the primary key index but
all the data into RAM and then running your query on it. This may take
some time to work - hopefully if you were running queries after this
on the same table then they would run much faster.
MyIsam loads the indexes into RAM and then runs its calculations over
this space and then returns a result, as an index is generally much
much smaller than all the data in the table you should see an
immediate difference there.
Another option may be the way that innodb stores the data on the disk
- the innodb files are a virtual tablespace and as such are not necessarily ordered by the data in your table, if you have a
fragmented data file then this could be creating problems for your
disk IO and as a result running slower. MyIsam generally are
sequential files, and as such if you are using an index to access data
the system knows exactly in what location on disk the row is located -
you do not have this luxury with innodb, but I do not think this
particular issue comes into play with just a simple count(*)
==================== http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html
explains this:
InnoDB does not keep an internal count of rows in a table. (In
practice, this would be somewhat complicated due to multi-versioning.)
To process a SELECT COUNT(*) FROM t statement, InnoDB must scan an
index of the table, which takes some time if the index is not entirely
in the buffer pool. To get a fast count, you have to use a counter
table you create yourself and let your application update it according
to the inserts and deletes it does. If your table does not change
often, using the MySQL query cache is a good solution. SHOW TABLE
STATUS also can be used if an approximate row count is sufficient. See
Section 14.2.11, “InnoDB Performance Tuning Tips”.
=================== todd_farmer:It actually does explain the difference - MyISAM understands that COUNT(ID) where ID is a PK column
is the same as COUNT(*), which MyISAM keeps precalculated while InnoDB
does not.
Try removing the COUNT() = 0 by a IS NULL check instead:
SELECT
u.id,
ju.name,
COUNT(p.id) as payment_count,
0 as order_count,
0 as marketplace_order_count
FROM users as u
INNER JOIN users2 as ju
ON u.id = ju.id
INNER JOIN payments as p
ON u.id = p.user_id
LEFT OUTER JOIN orders as o
ON u.id = o.user_id
AND o.order_status = 1
LEFT OUTER JOIN mp_orders as mi
ON u.id = mi.producer
AND mi.order_status = 1
WHERE
u.package != 1
AND u.enabled = 1
AND u.chart_ban = 0
AND mi.id IS NULL
AND o.id IS NULL
GROUP BY u.id
HAVING COUNT(p.id) >= 6
LIMIT 10
But I think 8 seconds is still too much for the plain query. You should post the explain plan of the main query without the OUTER JOINS to see what's wrong, for example the package, enabled and chart-ban filters could be totally ruining it.

MySQL select distinct across multiple tables

I have a query that selects all columns from multiple tables, but it's returning multiples of the same values (I only want distinct values).
How can I incorporate something like this? When I try this, it still
Select Distinct A.*, B.*, C.*....
Does distinct only work when selecting the column names and not all (*) ? In this reference it says distinct in reference to column names, not across all of the tables. Is there any way that I can do this?
edit - I added more info below
Sorry guys, I just got back onto my computer. Also, I just realized that my query itself is the issue, and Distinct has nothing to do with it.
So, the overall goal of my Query is to do the following
Generate a list of friends that a user has
Go through the friends and check their activities (posting, adding friends, etc..)
Display a list of friends and their activities sorted by date (I guess like a facebook wall kind of deal).
Here are my tables
update_id | update | userid | timestamp //updates table
post_id | post | userid | timestamp //posts table
user_1 | user_2 | status | timestamp //friends table
Here is my query
SELECT U.* , P.* ,F.* FROM posts AS P
JOIN updates AS U ON P.userid = U.userid
JOIN friends AS F ON P.userid = F.user_2 or F.user_1
WHERE P.userid IN (
select user_1 from friends where user_2 = '1'
union
select user_2 from friends where user_1 = '1'
union
select userid from org_members where org_id = '1'
union
select org_id from org_members where userid = '1'
)
ORDER BY P.timestamp, U.timestamp, F.timestamp limit 30
The issue I'm having with this (that I thought was related to distinct), is that if values are found to meet the requirements in, say table Friends, a value for the Posts table will appear too. This means when I'm displaying the output of the SQL statement, it appears as if the Posts value is shown multiple times, when the actual values I'm looking for are also displayed
The output will appear something like this (notice difference between post value in the rows)
update_id | update | userid | timestamp | post_id | post | userid | timestamp | user_1 | user_2 | status | timestamp
1 | update1 | 1 | 02/01/2013 | 1 | post1| 1 | 2/02/2013| 1 | 2 | 1 | 01/30/2013
1 | update1 | 1 | 02/01/2013 | 2 | post2| 1 | 2/03/2013| 1 | 2 | 1 | 01/30/2013
So, as you can see, I thought I was having a distinct issue (because update1 appeared both times), but the query actually just selects all the values regardless. I get the results I'm looking for in the Post table, but all the other values are returned. So, when I display the table in PHP/HTML, the Post value will display, but I also get duplicates of the updates (just for this example)
When you select distinct *, you select every row, including the one that makes the record unique. If you want something better than what you are getting, you have to type the individual column names in your select clause.
It would be easy if you explain a little more what is the connection between the tables you'r querying, because you can use joins, unions (as mentioned above) or even group by's ...
Your updated post shows one of the JOIN conditions as:
JOIN friends AS F ON P.userid = F.user_2 OR F.user_1
This is equivalent to:
JOIN friends AS F ON (P.userid = F.user_2 OR F.user_1 != 0)
and will include many rows that you did not intend to include.
You probably intended:
JOIN friends AS F ON (P.userid = F.user_2 OR P.userid = F.user_1)
I think you want this:
select *
from tableA
union
select *
from tableB
union
select *
from tableC
This assumes that HHS tables all have the same number of columns and they are of the same data type. This not, you'll have to select specific columns to make it so.