Normalization made my queries slower - mysql

Before normalization I had a column called genreand it contained values like "Action, Thriller, Comedy"
Now I have normalized the genre column by creating genre and movie2genre tables.
The problem now is my queries are more complicated and are actually slower
These two queries basically search for movies that are action and thriller
Old query
select title, genre from movie where genre like '%action%' and genre like '%thriller%'
0.062 sec duration / 0.032 sec fetch
New Query
SELECT movie.title, movie.genre
FROM Movie
Where
EXISTS (
select *
from movie2genre
JOIN Genre on Genre.id = movie2genre.GenreId
where Movie.id = movie2genre.MovieId
and genre in ('action', 'thriller')
)
0.328 sec duration / 0.078 sec fetch
Am I doing something wrong?
More info:
Movie
+-------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| Title | varchar(345) | YES | | NULL | |
ETC....
Genre
+---------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+----------------+
| genreid | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(50) | YES | | NULL | |
+---------+-------------+------+-----+---------+----------------+.
movie2genre
+---------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+-------+
| movieid | int(11) | YES | | NULL | |
| genreid | int(11) | YES | | NULL | |
+---------+---------+------+-----+---------+-------+

Try this without correlated Queries (Please check the execution plan of both queries if you are concerned about the performance) also Make sure you have proper indexes on your new table.
SELECT *
FROM movie2genre mg, Genre g, Movie m
WHERE m.id = mg.MovieId
AND g.id = mg.GenreId
AND g.genre in ('action', 'thriller')

First, your two queries are not the same. The newer version does an or rather than an and, so the difference in time could simply be returning a larger result set. In addition, your new query refers to movie.genre, a column that wouldn't exist in a normalized database.
You seem to be asking for:
select m.title
from Movie m
where exists (select 1
from movie2genre m2g JOIN
Genre g
on g.id = m2g.GenreId
where m.id = m2g.MovieId and g.genre = 'action'
) and
exists (select 1
from movie2genre m2g JOIN
Genre g
on g.id = m2g.GenreId
where m.id = m2g.MovieId and g.genre = 'thriller'
);
Admittedly, you probably will not think this solves the "complication" problem. Leaving that aside, you need to have indexes for this to work well. Do you have the "obvious" indexes of: movie2genre(MovieId, GenreId) and genre(GenreId)?
Second, your data is not particularly large (judging by the duration for the queries). So, a full table scan may be more efficient than the joining and filtering with these tables. As the database grows, the normalized approach will often be faster.
A more equivalent query is:
select m.title, group_concat(g.genre)
from movies m join
movie2genre m2g
on m.movieid = m2g.movieid join
genre g
on g.genreid = m2g.genreid
group by m.title
having sum(g.genre = 'action') > 0 and sum(g.genre = 'thriller') > 0;
Because of the nature of your particular query -- you need to fetch all genres on a movie so you cannot filter on them -- this particular query is probably going to perform less well than the unnormalized version.
By the way, normalization is more about keeping data consistent than about speeding queries. Normalized databases require more join operations. Indexes can help performance, but there is still work in doing the join. In some cases, the tables themselves are bigger than the unnormalized forms. And, normalized databases may require aggregation where none is required for non-normalized database. All of these can affect performance, which is why in many decision support architectures, the central database is normalized but the application-specific databases are not.

Indexes are vitally important when doing joins (and sub queries tend to lose the indexing).
There are 2 ways I would suggest trying.
Firstly you join movies to movie2genre, and then one join to genre for each one you are checking. Well indexed this should be fast.
SELECT movie.title,
movie.genre
FROM Movie
INNER JOIN movie2genre
ON Movie.id = movie2genre.MovieId
INNER JOIN Genre G1
ON G1.id = movie2genre.GenreId
AND G1.genre = 'action'
INNER JOIN Genre G2
ON G2.id = movie2genre.GenreId
AND G2.genre = 'thriller'
An alternative is to use IN, and use the aggregate COUNT function to check that the number of genres found is the same as the number expected.
SELECT movie.title,
movie.genre
FROM Movie
INNER JOIN movie2genre
ON Movie.id = movie2genre.MovieId
INNER JOIN Genre
ON Genre.id = movie2genre.GenreId
AND Genre.genre IN ('action', 'thriller')
GROUP BY movie.title, movie.genre
HAVING COUNT(DISTINCT genreid) = 2
I would prefer the 1st solution, but it is a bit more complicated to set up the SQL for in code (ie, the SQL varies greatly depending on the number of genres), and potentially is limited by the max number of table joins if you are checking for lots of genres.

Related

MySQL LEFT JOIN using OR is very slow

table users has about 80,000 records
table friends has about 900,000 records
There are 104 records with firstname = 'verena'
this query (the point of the query is gone because its very simplified) is very slow (> 20 seconds):
SELECT users.id FROM users
LEFT JOIN friends ON (
users.id = friends.user_id OR
users.id = friends.friend_id
)
WHERE users.firstname = 'verena';
However, if I remove the OR inside the JOIN, the query is instant, so either:
SELECT users.id FROM users
LEFT JOIN friends ON (
users.id = friends.user_id
)
WHERE users.firstname = 'verena';
returning 1487 results or
SELECT users.id FROM users
LEFT JOIN friends ON (
users.id = friends.friend_id
)
WHERE users.firstname = 'verena';
returning 2849 results
execute instantly (0.001s)
If I remove everything else and go straight for
SELECT 1 FROM friends WHERE user_id = xxx OR friend_id = xxx
or
SELECT id FROM users WHERE firstname = 'verena';
these queries are also instant.
Indexes for friends.friend_id, friends.user_id and users.firstname are set.
I don't understand why the top query is slow while if manually taking it apart and executing the statements isolated everything is blazing fast.
My only suspicion now is that MariaDB is first joining ALL users with friends and only after that filters for WHERE firstname = 'verena', instead of the wanted behavior of first filtering for firstname = 'verena' and then joining the results with the friends table, but even then I don't see why removing the OR inside the JOIN condition would make it fast.
I tested this on 2 different machines, one running MariaDB 10.3.22 with Galera cluster and one with MariaDB 10.4.12 without Galera cluster
What is the technical reason why the top query is having such a huge slowdown and how do I fix this without having to split the SQL into several statements?
Edit:
Here is the EXPLAIN output for it, telling it's not using any index for the friends table and scanning through all records as correctly stated in Barmar's comment:
+------+-------------+---------+------+-------------------+-----------+---------+-------+--------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------+------+-------------------+-----------+---------+-------+--------+------------------------------------------------+
| 1 | SIMPLE | users | ref | firstname | firstname | 768 | const | 104 | Using where; Using index |
| 1 | SIMPLE | friends | ALL | user_id,friend_id | NULL | NULL | NULL | 902853 | Range checked for each record (index map: 0x6) |
+------+-------------+---------+------+-------------------+-----------+---------+-------+--------+------------------------------------------------+
Is there any way to make SQL use both indexes or do I just have to accept this limitation and work around it using for example Barmar's suggestion?
MySQL is not usually able to use an index when you use OR to join with different columns. It can only use one index per table in a join, so if it uses the friends.user_id index, it won't use friends.friend_id, and vice versa.
The solution is to do the two fast queries and combine them with UNION.
SELECT users.id FROM users
LEFT JOIN friends ON (
users.id = friends.user_id
)
WHERE users.firstname = 'verena';
UNION
SELECT users.id FROM users
LEFT JOIN friends ON (
users.id = friends.friend_id
)
WHERE users.firstname = 'verena';

MySQL Multiple Join with delimiting via FINDINSET

I am attempting to JOIN onto two different columns in the first table below from columns in the second and third tables.
I wish to JOIN users.id to job_listings.id to return users.username, and to also JOIN and delimit job_listings.categories to job_categories.id to return job_categories.description via FIND_IN_SET
job_listings
id | employer_id | categories
1 | 1 | 1,2
2 | 1 | 2
users
id | username | type
1 | foo | employer
2 | wat | employer
job_categories
id | description
1 | fun
2 | hak
I desire output that is of the following format:
output
username | type | category | description
foo | employer | 1 | fun
foo | employer | 2 | hak
foo | employer | 2 | hak
I have tried using various permutations of the following code:
SELECT users.username, users.type, job_listings.categories FROM users
JOIN job_listings ON users.id
JOIN job_listings AS category ON FIND_IN_SET(category.categories, job_categories.id)
ORDER BY users.username, category.categories
I know from other answers that I need to use an alias in order to use multiple JOIN operations with the same table, but despite adapting other answers I keep receiving errors related to declaring an alias, or returning output that has a column with the alias but no data returned in that column.
First, you should normalize your design. You should not store integer values in strings. You should not have foreign key references that you cannot declare as such. You should not store lists in strings. Is that enough reasons? You want a junction table for JobCategories with one row per job and one row per category.
Sometimes, we are stuck with other peoples lousy decisions and cannot readily change them. In that case, you want a query like:
SELECT u.username, u.type, jc.id, jc.category
FROM users u JOIN
job_listings jl
ON u.id = jl.employer_id and u.type = 'employer' join
job_categories jc
ON FIND_IN_SET(jc.id, j.categories) > 0
ORDER BY u.username, jc.category;
This query cannot take advantage of indexes for the category joins. That means that it will be slow. The proper data structure -- a junction table -- would fix this performance problem.

MySQL select row with two matching joined rows from another table

Hey I try to select a row from a table with two matching entries on another one.
The structure is as following:
----------------- ---------------------
| messagegroups | | user_messagegroup |
| | | |
| - id | | - id |
| - status | | - user_id |
| | | - messagegroup_id |
----------------- | |
---------------------
There exist two rows in user_messagegroup with the ids of two users and both times the same messagegroup_id.
I would like to select the messagegroup where this two users are inside.
I dont get it.. so I would appreciate some help ;)
The specification you provide isn't very clear.
You say "with the ids of two users"... if we take that to mean you have two user_id values you want to supply in the query, then one way to to find the messagegroups that contain these two specific users:
SELECT g.id
, g.status
FROM messagegroups g
JOIN ( SELECT u.messagegroup_id
FROM user_messagegroup u
WHERE u.user_id IN (42, 11)
GROUP BY u.messagegroup_id
HAVING COUNT(DISTINCT u.user_id) = 2
) c
ON c.messagegroup_id = g.id
The returned messagegroups could also contain other users, besides the two that were specified.
If you want to return messagegroups that contain ONLY these two users, and no other users...
SELECT g.id
, g.status
FROM messagegroups g
JOIN ( SELECT u.messagegroup_id
FROM user_messagegroup u
WHERE u.user_id IS NOT NULL
GROUP BY u.messagegroup_id
HAVING COUNT(DISTINCT IF(u.user_id IN (42,11),u.user_id,NULL)) = 2
AND COUNT(DISTINCT u.user_id) = 2
) c
ON c.messagegroup_id = g.id
For improved performance, you'll want suitable indexes on the tables, and it may be possible to rewrite these to eliminate the inline view.
Also, if you only need the messagegroup_id value, you could get that from just the inline view query, without the need for the outer query and the join operation to the messagegroups table.

MySQL Statement extremely slow even with indexes

The following query takes around 200 seconds to complete. What i'm trying to achieve is get users who have made 6 or more payments, who have not made any orders yet (there are 2 orders tables for different marketplaces).
u.id, ju.id are both primary keys.
I've indexed the user_id and order_status combined into one index on both orders tables. If I remove the join and COUNT() on the mp_orders table, the query takes 8 seconds to complete, but with it, it takes too long. I think i've indexed every thing that I could have but I don't understand why it takes so long to complete. Any ideas?
SELECT
u.id,
ju.name,
COUNT(p.id) as payment_count,
COUNT(o.id) as order_count,
COUNT(mi.id) as marketplace_order_count
FROM users as u
INNER JOIN users2 as ju
ON u.id = ju.id
INNER JOIN payments as p
ON u.id = p.user_id
LEFT OUTER JOIN orders as o
ON u.id = o.user_id
AND o.order_status = 1
LEFT OUTER JOIN mp_orders as mi
ON u.id = mi.producer
AND mi.order_status = 1
WHERE u.package != 1
AND u.enabled = 1
AND u.chart_ban = 0
GROUP BY u.id
HAVING COUNT(p.id) >= 6
AND COUNT(o.id) = 0
AND COUNT(mi.id) = 0
LIMIT 10
payments table
+-----------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+----------------+
| id | bigint(255) | NO | PRI | NULL | auto_increment |
| user_id | bigint(255) | NO | | NULL | |
+-----------------+---------------+------+-----+---------+----------------+
orders table (mp_orders table pretty much the same)
+-----------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+----------------+
| id | int(255) | NO | PRI | NULL | auto_increment |
| order_number | varchar(1024) | NO | MUL | NULL | |
| user_id | int(255) | NO | MUL | NULL | |
+-----------------+---------------+------+-----+---------+----------------+
You don't need to COUNT the rows of your orders, you need to retrieve users which doesn't have orders, that's not really the same thing.
Instead of counting, filter the users which have no orders :
SELECT
u.id,
ju.name,
COUNT(p.id) as payment_count
FROM users as u
INNER JOIN users2 as ju
ON u.id = ju.id
INNER JOIN payments as p
ON u.id = p.user_id
LEFT OUTER JOIN orders as o
ON u.id = o.user_id
AND o.order_status = 1
LEFT OUTER JOIN mp_orders as mi
ON u.id = mi.producer
AND mi.order_status = 1
WHERE u.package != 1
AND u.enabled = 1
AND u.chart_ban = 0
AND o.id IS NULL -- filter happens here
AND mi.id IS NULL -- and here
GROUP BY u.id
HAVING COUNT(p.id) >= 6
LIMIT 10
This will prevent the engine to count each of the orders for each of your users, and you will gain a lot of time.
One can think that the engine should use the index for doing the count, and so the count must be fast enough.
I will quote from a different site: InnoDB COUNT(id) - Why so slow?
It may be to do with the buffering, InnoDb does not cache the index it
caches into memory the actual data rows, because of this for what
seems to be a simple scan it is not loading the primary key index but
all the data into RAM and then running your query on it. This may take
some time to work - hopefully if you were running queries after this
on the same table then they would run much faster.
MyIsam loads the indexes into RAM and then runs its calculations over
this space and then returns a result, as an index is generally much
much smaller than all the data in the table you should see an
immediate difference there.
Another option may be the way that innodb stores the data on the disk
- the innodb files are a virtual tablespace and as such are not necessarily ordered by the data in your table, if you have a
fragmented data file then this could be creating problems for your
disk IO and as a result running slower. MyIsam generally are
sequential files, and as such if you are using an index to access data
the system knows exactly in what location on disk the row is located -
you do not have this luxury with innodb, but I do not think this
particular issue comes into play with just a simple count(*)
==================== http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html
explains this:
InnoDB does not keep an internal count of rows in a table. (In
practice, this would be somewhat complicated due to multi-versioning.)
To process a SELECT COUNT(*) FROM t statement, InnoDB must scan an
index of the table, which takes some time if the index is not entirely
in the buffer pool. To get a fast count, you have to use a counter
table you create yourself and let your application update it according
to the inserts and deletes it does. If your table does not change
often, using the MySQL query cache is a good solution. SHOW TABLE
STATUS also can be used if an approximate row count is sufficient. See
Section 14.2.11, “InnoDB Performance Tuning Tips”.
=================== todd_farmer:It actually does explain the difference - MyISAM understands that COUNT(ID) where ID is a PK column
is the same as COUNT(*), which MyISAM keeps precalculated while InnoDB
does not.
Try removing the COUNT() = 0 by a IS NULL check instead:
SELECT
u.id,
ju.name,
COUNT(p.id) as payment_count,
0 as order_count,
0 as marketplace_order_count
FROM users as u
INNER JOIN users2 as ju
ON u.id = ju.id
INNER JOIN payments as p
ON u.id = p.user_id
LEFT OUTER JOIN orders as o
ON u.id = o.user_id
AND o.order_status = 1
LEFT OUTER JOIN mp_orders as mi
ON u.id = mi.producer
AND mi.order_status = 1
WHERE
u.package != 1
AND u.enabled = 1
AND u.chart_ban = 0
AND mi.id IS NULL
AND o.id IS NULL
GROUP BY u.id
HAVING COUNT(p.id) >= 6
LIMIT 10
But I think 8 seconds is still too much for the plain query. You should post the explain plan of the main query without the OUTER JOINS to see what's wrong, for example the package, enabled and chart-ban filters could be totally ruining it.

How can I select all rows in a table a, which have n characteristics given in a table b

I'm making a web page for renting houses.
the publications are stored in a table like this
ta_publications
+---+-------------+------+
|id | name | date |
+---+-------------+------+
| 1 | name_001 | ... |
| 2 | name_002 | ... |
| 3 | name_003 | ... |
+---+-------------+------+
I have diferent publications, which have "features" such as "satellite tv", "Laundry cleaning", etc.
These features might change in the future, and I want to be able to add/remove/modify them, so I store them in the database in a table.
ta_feature_types
+---+-------------+
|id | name |
+---+-------------+
| 1 | Internet |
| 2 | Wi-Fi |
| 3 | satelital tv|
+---+-------------+
which are related to the publications using a table
ta_features
+---+-------------+----------------+
|id | type_id | publication_id |
+---+-------------+----------------+
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 3 | 1 |
+---+-------------+----------------+
I think it's pretty easy to understand; There is a publication called name_001 which have internet, wi-fi and satellite tv.
My problem is: I need to be able to efficiently search and select all publications(houses) wich have certain features. For example, all publications that have internet, wifi and "pets-allowed" features.
I just came up with another question: When the user likes one publication, say "house_003", how do I get a list of the features that it does have?
If you want to get publications by feature name:
SELECT p.*
FROM ta_publications p
JOIN ta_features f ON f.publication_id = p.id
JOIN ta_feature_types t ON f.type_id = t.id
WHERE t.name = ? -- feature name
If you already know the feature ID:
SELECT p.*
FROM ta_publications p
JOIN ta_features f ON f.publication_id = p.id
WHERE f.type_id = ? -- feature ID
EDIT: To get all publications that match all of multiple feature IDs:
SELECT p.id, p.name
FROM pub p
JOIN pub_feat pf ON pf.pub_id = p.id
WHERE pf.feat_id IN ? -- list of feature IDs, e.g. (1,2,3)
GROUP BY p.id, p.name HAVING COUNT(*) = ? -- size of list of feature IDs, e.g. 3
To get all the features (names, I assume) by publication ID:
SELECT t.name
FROM ta_feature_types t
JOIN ta_features f ON f.type_id = t.id
JOIN ta_publications p ON f.publication_id = p.id
WHERE p.id = ? -- publication ID
Some notes on your schema:
As I commented above, you don't need an ID column in the ta_features table unless a publication can have the same features multiples times, e.g. "2x Wi-Fi"
Your table names are confusing, may I suggest you rename
ta_features to ta_publication_features (or ta_pub_features) and
ta_feature_types to ta_features
For performance reasons you should create indices on all the columns used in the above JOIN conditions (using your original table names here):
ta_publications(id)
ta_features(type_id, publication_id)
ta_feature_types(id)
If the user selects multiple features use the IN keyword and a list of all features for a publication:
SELECT p.*
FROM ta_publications p
WHERE '1' in (select type_id from ta_features where publication_id = p.id)
AND '2' in (select type_id from ta_features where publication_id = p.id)
AND '3' in (select type_id from ta_features where publication_id = p.id)
You could generate the above with a loop in your server language of choice. ie.
SELECT p.*
FROM ta_publications p
WHERE 1=1
//START SERVER LANGUAGE
for (feature in featuresArray){
print("AND '$feature' in (select type_id from ta_features where publication_id = p.id)");
}
//END SERVER LANGUAGE
I think what you want is a subquery:
select a.*
from ta_publications as a
where '1' in (select type_id from ta_features where publication_id=a.id)
Substitute '1' for any other feature number you want.
For your second question. A simple query should do it:
select type_id
from ta_features
where publication_id=[[[id that someone likes]]]