I've managed to put together a query that works for my needs, albeit more complicated than I was hoping. But, for the size of tables the query is slower than it should be (0.17s). The reason, based on the EXPLAIN provided below, is because there is a table scan on the meta_relationships table due to it having the COUNT in the WHERE clause on an innodb engine.
Query:
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND meta_relationships.object_id
NOT IN (SELECT meta_relationships.object_id FROM meta_relationships
GROUP BY meta_relationships.object_id HAVING count(*) > 1)
GROUP BY meta_relationships.object_id
This particular query, selects posts which have ONLY the computers category. The purpose of count > 1 is to exclude posts that contain computers/hardware, computers/software, etc. The more categories that are selected, the higher the count would be.
Ideally, I'd like to get it functioning like this:
WHERE meta.meta_name IN ('computers') AND meta_relationships.meta_order IN (0)
or
WHERE meta.meta_name IN ('computers','software')
AND meta_relationships.meta_order IN (0,1)
etc..
But unfortunately this doesn't work, because it doesn't take into consideration that there may be a meta_relationships.meta_order = 2.
I've tried...
WHERE meta.meta_name IN ('computers')
GROUP BY meta_relationships.meta_order
HAVING meta_relationships.meta_order IN (0) AND meta_relationships.meta_order NOT IN (1)
but it doesn't return the correct amount of rows.
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY meta ref PRIMARY,idx_meta_name idx_meta_name 602 const 1 Using where; Using index; Using temporary; Using filesort
1 PRIMARY meta_data ref PRIMARY,idx_meta_id idx_meta_id 8 database.meta.meta_id 1
1 PRIMARY meta_relationships ref idx_meta_data_id idx_meta_data_id 8 database.meta_data.meta_data_id 11 Using where
1 PRIMARY posts eq_ref PRIMARY PRIMARY 4 database.meta_relationships.object_id 1
2 MATERIALIZED meta_relationships index NULL idx_object_id 4 NULL 14679 Using index
Tables/Indexes:
meta
This table contains the category and tag names.
indexes:
PRIMARY KEY (meta_id), KEY idx_meta_name (meta_name)
meta_data
This table contains additional data about the categories and tags such as type (category or tag), description, parent, count.
indexes:
PRIMARY KEY (meta_data_id), KEY idx_meta_id (meta_id)
meta_relationships
This is a junction/lookup table. It contains a foreign key to the posts_id, a foreign key to the meta_data_id, and also contains the order of the categories.
indexes:
PRIMARY KEY (relationship_id), KEY idx_object_id (object_id), KEY idx_meta_data_id (meta_data_id)
The count allows me to only select the posts with that correct level of category. For example, the category computers has posts with only the computers category but it also has posts with computers/hardware. The count filters out posts that contain those extra categories. I hope that makes sense.
I believe the key to optimizing the query is to get away completely from doing the COUNT.
An alternative to the COUNT would possibly be using meta_relationships.meta_order or meta_data.parent instead.
The meta_relationships table will grow quickly and with the current size (~15K rows) I'm hoping to achieve an execution time in the 100th of seconds rather than the 10ths of seconds.
Since there needs to be multiple conditions in the WHERE clause for each category/tag, any answer optimized for a dynamic query is preferred.
I have created an IDE with sample data.
How can I optimize this query?
EDIT :
I was never able to find an optimal solution to this problem. It was really a combination of smcjones recommendation of improving the indexes for which I would recommend doing an EXPLAIN and looking at EXPLAIN Output Format then change the indexes to whatever gives you the best performance.
Also, hpf's recommendation to add another column with the total count helped tremendously. In the end, after changing the indexes, I ended up going with this query.
SELECT posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE posts.meta_count = 2
GROUP BY posts.post_id
HAVING category = 'category,subcategory'
After getting rid of the COUNT, the big performance killer was the GROUP BY and ORDER BY, but the indexes are your best friend. I learned that when doing a GROUP BY, the WHERE clause is very important, the more specific you can get the better.
With a combination of optimized queries AND optimizing your tables, you will have fast queries. However, you cannot have fast queries without an optimized table.
I cannot stress this enough: If your tables are structured correctly with the correct amount of indexes, you should not be experiencing any full table reads on a query like GROUP BY... HAVING unless you do so by design.
Based on your example, I have created this SQLFiddle.
Compare that to SQLFiddle #2, in which I added indexes and added a UNIQUE index against meta.meta_naame.
From my testing, Fiddle #2 is faster.
Optimizing Your Query
This query was driving me nuts, even after I made the argument that indexes would be the best way to optimize this. Even though I still hold that the table is your biggest opportunity to increase performance, it did seem that there had to be a better way to run this query in MySQL. I had a revelation after sleeping on this problem, and used the following query (seen in SQLFiddle #3):
SELECT posts.post_id,posts.post_name,posts.post_title,posts.post_description,posts.date,meta.meta_name
FROM posts
LEFT JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = 'animals'
GROUP BY meta_relationships.object_id
HAVING sum(meta_relationships.object_id) = min(meta_relationships.object_id);
HAVING sum() = min() on a GROUP BY should check to see if there is more than one record of each type. Obviously, each time the record shows up, it will add more to the sum. (Edit: On subsequent tests it seems like this has the same impact as count(meta_relationships.object_id) = 1. Oh well, the point is I believe you can remove subquery and have the same result).
I want to be clear that you won't notice much if any optimization on the query I provided you unless the section, WHERE meta.meta_name = 'animals' is querying against an index (preferably a unique index because I doubt you'll need more than one of these and it will prevent accidental duplication of data).
So, instead of a table that looks like this:
CREATE TABLE meta_data (
meta_data_id BIGINT,
meta_id BIGINT,
type VARCHAR(50),
description VARCHAR(200),
parent BIGINT,
count BIGINT);
You should make sure you add primary keys and indexes like this:
CREATE TABLE meta_data (
meta_data_id BIGINT,
meta_id BIGINT,
type VARCHAR(50),
description VARCHAR(200),
parent BIGINT,
count BIGINT,
PRIMARY KEY (meta_data_id,meta_id),
INDEX ix_meta_id (meta_id)
);
Don't overdo it, but every table should have a primary key, and any time you are aggregating or querying against a specific value, there should be indexes.
When indexes are not used, the MySQL will walk through each row of the table until it finds what you want. In such a limited example as yours this doesn't take too long (even though it's still noticeably slower), but when you add thousands or more records, this will become extraordinarily painful.
In the future, when reviewing your queries, try to identify where your full table scans are occurring and see if there is an index on that column. A good place to start is wherever you are aggregating or using the WHERE syntax.
A note on the count column
I have not found putting count columns into the table to be helpful. It can lead to some pretty serious integrity issues. If a table is properly optimized, It should be very easy to use count() and get the current count. If you want to have it in a table, you can use a VIEW, although that will not be the most efficient way to make the pull.
The problem with putting count columns into a table is that you need to update that count, using either a TRIGGER or, worse, application logic. As your program scales out that logic can either get lost or buried. Adding that column is a deviation from normalization and when something like this is to occur, there should be a VERY good reason.
Some debate exists as to whether there is ever a good reason to do this, but I think I'd be wise to stay out of that debate because there are great arguments on both sides. Instead, I will pick a much smaller battle and say that I see this causing you more headaches than benefits in this use case, so it is probably worth A/B testing.
Since the HAVING seems to be the issue, can you instead create a flag field in the posts table and use that instead? If I understand the query correctly, you're trying to find posts with only one meta_relationship link. If you created a field in your posts table that was either a count of the meta_relationships for that post, or a boolean flag for whether there was only one, and indexed it of course, that would probably be much faster. It would involve updating the field if the post was edited.
So, consider this:
Add a new field to the posts table called "num_meta_rel". It can be an unsigned tinyint as long as you'll never have more than 255 tags to any one post.
Update the field like this:
UPDATE posts
SET num_meta_rel=(SELECT COUNT(object_id) from meta_relationships WHERE object_id=posts.post_id);
This query will take some time to run, but once done you have all the counts precalculated. Note this can be done better with a join, but SQLite (Ideone) only allows subqueries.
Now, you rewrite your query like this:
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND posts.num_meta_rel=1
GROUP BY meta_relationships.object_id
If I've done this correctly, the runnable code is here: http://ideone.com/ZZiKgx
Note that this solution requires that you update the num_meta_rel (choose a better name, that one is terrible...) if the post has a new tag associated with it. But that should be much faster than scanning your entire table over and over.
See if this gives you the right answer, possibly faster:
SELECT p.post_id, p.post_name,
GROUP_CONCAT(IF(md.type = 'category', meta.meta_name, null)) AS category,
GROUP_CONCAT(IF(md.type = 'tag', meta.meta_name, null)) AS tag
FROM
( SELECT object_id
FROM meta_relation
GROUP BY object_id
HAVING count(*) = 1
) AS x
JOIN meta_relation AS mr ON mr.object_id = x.object_id
JOIN posts AS p ON p.post_id = mr.object_id
JOIN meta_data AS md ON mr.meta_data_id = md.meta_data_id
JOIN meta ON md.meta_id = meta.meta_id
WHERE meta.meta_name = ?
GROUP BY mr.object_id
Unfortunately I have no possibility to test performance,
But try my query using your real data:
http://sqlfiddle.com/#!9/81b29/13
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
INNER JOIN (
SELECT meta_relationships.object_id
FROM meta_relationships
GROUP BY meta_relationships.object_id
HAVING count(*) < 3
) mr ON mr.object_id = posts.post_id
LEFT JOIN meta_relationships ON mr.object_id = meta_relationships.object_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
INNER JOIN (
SELECT *
FROM meta
WHERE meta.meta_name = 'health'
) meta ON meta_data.meta_id = meta.meta_id
GROUP BY posts.post_id
Use
sum(1)
instead of
count(*)
I want to get data that is separated on three tables:
app_android_devices:
id | associated_user_id | registration_id
app_android_devices_settings:
owner_id | is_user_id | notifications_receive | notifications_likes_only
app_android_devices_favorites:
owner_id | is_user_id | image_id
owner_id is either the id from app_android_devices or the associated_user_id, indicated by is_user_id.
That is because the user of my app should be able to login to their account or use the app anonymously. If the user logged in he will have the same settings and likes on all devices.
associated_user_id is 0 if the device is used anonymously or the user ID from another table.
Now i've got the following query:
SELECT registration_id
FROM app_android_devices d
JOIN app_android_devices_settings s
ON ((d.id=s.owner_id AND
s.is_user_id=0)
OR (
d.associated_user_id=s.owner_id AND
s.is_user_id=1))
JOIN app_android_devices_favorites f
ON (((d.id=f.owner_id AND
f.is_user_id=0)
OR
d.associated_user_id=f.owner_id AND
f.is_user_id=1)
AND f.image_id=86)
WHERE s.notifications_receive=1
AND (s.notifications_likes_only=0 OR f.image_id=86);
To decide if the device should receive a push notification on a new comment. I've set the following keys:
app_android_devices: id PRIMARY, associated_user_id
app_android_devices_settings: (owner_id, is_user_id) UNIQUE, notifications_receive, notifications_likes_only
app_android_devices_favorites: (owner_id, is_user_id, image_id) UNIQUE
I've noticed that the above query is really slow. If I run EXPLAIN on that query I see that MySQL is using no keys at all, although there are possible_keys listed.
What can I do to speed this query up?
Having such complicated JOIN conditions makes life hard for everyone. It makes life hard for the developer who wants to understand your query, and for the query optimizer that wants to give you exactly what you ask for while preferring more efficient operations.
So the first thing that I want to do, when you tell me that this query is slow and not using any index, is to take it apart and put it back together with simpler JOIN conditions.
From the way you describe this query, it sounds like the is_user_id column is a sort of state variable telling you whether the user is or is not logged in to your app. This is awkward to say the least; what happens if s.is_user_id != f.is_user_id? Why store this in both tables? For that matter, why store this in your database at all, instead of in a cookie?
Perhaps there's something I'm not understanding about the functionality you're going for here. In any case, the first thing I see that I want to get rid of is the OR in your JOIN conditions. I'm going to try to avoid making too many assumptions about which values in your query represent user input; here's a slightly generic example of how you might be able to rewrite these JOIN conditions as a UNION of two SELECT statements:
SELECT ... FROM
app_android_devices d
JOIN
app_android_devices_settings s ON d.id = s.owner_id
JOIN
app_android_devices_favorites f ON d.id = f.owner_id
WHERE s.is_user_id = 0 AND f.is_user_id = 0 AND ...
UNION ALL
SELECT ... FROM
app_android_devices d
JOIN
app_android_devices_settings s ON d.associated_user_id = s.owner_id
JOIN
app_android_devices_favorites f ON d.associated_user_id = f.owner_id
WHERE s.is_user_id = 1 AND f.is_user_id = 1 AND ...
If these two queries hit your indexes and are very selective, you might not notice the additional overhead (creation of a temporary table) required by the UNION operation. It looks as though one of your result sets may even be empty, in which case the cost of the UNION should be nil.
But, maybe this doesn't work for you; here's another suggestion for an optimization you might pursue. In your original query, you have the following condition:
WHERE s.notifications_receive=1
AND (s.notifications_likes_only=0 OR f.image_id=86);
This isn't too cryptic - you want results only when the notifications_receive setting is true, and only if the notifications_likes_only setting is false or the requested image is a "favorite" image. Depending on the state of notifications_likes_only, it looks like you may not even care about the favorites table - wouldn't it be nice to avoid even reading from that table unless absolutely necessary?
This looks like a good case for EXISTS(). Instead of joining app_android_devices_favorites, try using a condition like this:
WHERE s.notifications_receive = 1
AND (s.notifications_likes_only = 0
OR EXISTS(SELECT 1 FROM app_android_devices_favorites
WHERE image_id = 86 AND owner_id = s.owner_id)
It doesn't matter what you try to SELECT in an EXISTS() subquery; some people prefer *, I like 1, but even if you gave specific columns it wouldn't affect the execution plan.
I have a relatively simple game. I need help I think this query isn't optimized correctly.
I have a standard users table. There is an expansions table, which holds general information about the expansions in the game. Each time a user beats a level in an expansion, a row is added to playlog that says their final score (so at first, there are 0 rows in the playlog table for them for the expansion).
EXPLAIN SELECT users.username, expansions.title, expansions.description,
COUNT( playlog.id ) as levels_beaten
FROM users
INNER JOIN expansions
LEFT JOIN playlog ON users.id = playlog.user_id
AND expansions.id = playlog.expansions_id
WHERE users.id = 10
GROUP BY expansions.id
ORDER BY expansions.order_hint DESC
I have the following indexes:
users id - primary, username - unique
expansions id - primary, order_hint - index
playlog expansions_id - foreign, user_id - foreign
I took a database class awhile back and I remember the using temporary and filesorts was supposed to be bad but I don't really remember how to rectify it or if it's okay in this instance (ALSO if I don't select the username, it says "Using Index" in the first row of Explain as well)
Your query looked mostly accurate, but the trail of comments was taking a negative spin. I've rewritten the query to more explicitly show the relationship of the tables and join criteria. You had left vs inner joins. It appears from your description that the "Expansions" table is like a master list of expansions that ARE AVAILABLE in the game (like a lookup table). The ONLY way a record gets into the PLAYLOG is IF someone completes a given expansion. That said, start with the user to their playlog history. If no records, you are done anyhow. If there IS a playlog, then join to the expansions to get the descriptions. No need to get expansion descriptions if nobody completed any such levels.
SELECT
users.username,
expansions.title,
expansions.description,
COUNT( * ) as levels_beaten
FROM
users
JOIN playlog
ON users.id = playlog.user_id
JOIN expansions
ON playlog.expansions_id = expansions.id
WHERE
users.id = 10
GROUP BY
expansions.id
ORDER BY
expansions.order_hint DESC
If the query still appears to cause an issue, I would then suggest adding the keyword "STRAIGHT_JOIN" such as
SELECT STRAIGHT_JOIN ...rest of query.
STRAIGHT_JOIN tells the engine to query in the order I've said and not let it interpret a possibly less efficient query path.
I have a bridging table that looks like this
clients_user_groups
id = int
client_id = int
group_id = int
I need to find all client_id's of of clients that belong to the same group as client_id 46l
I can achieve it doing a query as below which produces the correct results
SELECT client_id FROM clients_user_groups WHERE group_id = (SELECT group_id FROM clients_user_groups WHERE client_id = 46);
Basically what I need to find out is if there's a way achieving the same results without using 2 queries or a faster way, or is the method above the best solution
You're using a WHERE-clause subquery which, in MySQL, ends up being reevaluated for every single row in your table. Use a JOIN instead:
SELECT a.client_id
FROM clients_user_groups a
JOIN clients_user_groups b ON b.client_id = 46
AND a.group_id = b.group_id
Since you plan on facilitating clients having more than one group in the future, you might want to add DISTINCT to the SELECT so that multiple of the same client_ids aren't returned when you do switch (as a result of the client being in more than one of client_id 46's groups).
If you haven't done so already, create the following composite index on:
(client_id, group_id)
With client_id at the first position in the index since it most likely offers the best initial selectivity. Also, if you've got a substantial amount of rows in your table, ensure that the index is being utilized with EXPLAIN.
you can try with a self join also
SELECT a.client_id
FROM clients_user_groups a
LEFT JOIN clients_user_groups b on b.client_id=46
Where b.group_id=a.group_id
set #groupID = (SELECT group_id FROM clients_user_groups WHERE client_id = 46);
SELECT client_id FROM clients_user_groups WHERE group_id = #groupID;
You will have a query which gets the group ID and you store it into a variable. After this you select the client_id values where the group_id matches the value stored in your variable. You can speed up this query even more if you define an index for clients_user_groups.group_id.
Note1: I didn't test my code, hopefully there are no typos, but you've got the idea I think.
Note2: This should be done in a single request, because DB requests are very expensive if we look at the needed time.
Based on your comment that each client can only belong to one group, I would suggest a schema change to place the group_id relation into the client table as a field. Typically, one would use the sort of JOIN table you have described to express many-to-many relationships within a relational database (i.e. clients could belong to many groups and groups could have many clients).
In such a scenario, the query would be made without the need for a sub-select like this:
SELECT c.client_id
FROM clients as c
INNER JOIN clients as c2 ON c.group_id = c2.group_id
WHERE c2.client_id = ?
I've got a users table and a votes table. The votes table stores votes toward other users. And for better or worse, a single row in the votes table, stores the votes in both directions between the two users.
Now, the problem is when I wanna list for example all people someone has voted on.
I'm no MySQL expert, but from what I've figured out, thanks to the OR condition in the join statement, it needs to look through the whole users table (currently +44,000 rows), and it creates a temporary table to do so.
Currently, the bellow query takes about two minutes, yes, two minutes to complete. If I remove the OR condition, and everything after it in the join statement, it runs in less than half a second, as it only needs to look through about 17 of the 44,000 user rows (explain ftw!).
The bellow example, the user ID is 9834, and I'm trying to fetch his/her own no votes, and join the info from user who was voted on to the result.
Is there a better, and faster way to do this query? Or should I restructure the tables? I seriously hope it can be fixed by modifying the query, cause there's already a lot of users (+44,000), and votes (+130,000) in the tables, which I'd have to migrate.
thanks :)
SELECT *, votes.id as vote_id
FROM `votes`
LEFT JOIN users ON (
(
votes.user_id_1 = 9834
AND
users.uid = votes.user_id_2
)
OR
(
votes.user_id_2 = 9834
AND
users.uid = votes.user_id_1
)
)
WHERE (
(
votes.user_id_1 = 9834
AND
votes.vote_1 = 0
)
OR
(
votes.user_id_2 = 9834
AND
votes.vote_2 = 0
)
)
ORDER BY votes.updated_at DESC
LIMIT 0, 10
Instead of the OR, you could do a UNION of 2 queries. I have known instances where this is an order of magnitude faster in at least one other DBMS, and I'm guessing MySQL's query optimizer may share the same "feature".
SELECT whatever
FROM votes v
INNER JOIN
users u
ON v.user_id_1 = u.uid
WHERE v.user_id_2 = 9834
AND v.votes_2 = 0
UNION
SELECT whatever
FROM votes v
INNER JOIN
users u
ON v.user_id_2 = u.uid
WHERE v.user_id_1 = 9834
AND v.votes_1 = 0
ORDER BY updated_at DESC
You've answered your own question: yes, you should redesign the table, as it's not working for you. It's too slow, and requires overly complicated queries. Fortunately, migrating the data is just a matter of doing essentially the query you're asking about here, but for all user instead of just one. (That is, a sum or count over the unions the first answering suggested.)