I've managed to put together a query that works for my needs, albeit more complicated than I was hoping. But, for the size of tables the query is slower than it should be (0.17s). The reason, based on the EXPLAIN provided below, is because there is a table scan on the meta_relationships table due to it having the COUNT in the WHERE clause on an innodb engine.
Query:
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND meta_relationships.object_id
NOT IN (SELECT meta_relationships.object_id FROM meta_relationships
GROUP BY meta_relationships.object_id HAVING count(*) > 1)
GROUP BY meta_relationships.object_id
This particular query, selects posts which have ONLY the computers category. The purpose of count > 1 is to exclude posts that contain computers/hardware, computers/software, etc. The more categories that are selected, the higher the count would be.
Ideally, I'd like to get it functioning like this:
WHERE meta.meta_name IN ('computers') AND meta_relationships.meta_order IN (0)
or
WHERE meta.meta_name IN ('computers','software')
AND meta_relationships.meta_order IN (0,1)
etc..
But unfortunately this doesn't work, because it doesn't take into consideration that there may be a meta_relationships.meta_order = 2.
I've tried...
WHERE meta.meta_name IN ('computers')
GROUP BY meta_relationships.meta_order
HAVING meta_relationships.meta_order IN (0) AND meta_relationships.meta_order NOT IN (1)
but it doesn't return the correct amount of rows.
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY meta ref PRIMARY,idx_meta_name idx_meta_name 602 const 1 Using where; Using index; Using temporary; Using filesort
1 PRIMARY meta_data ref PRIMARY,idx_meta_id idx_meta_id 8 database.meta.meta_id 1
1 PRIMARY meta_relationships ref idx_meta_data_id idx_meta_data_id 8 database.meta_data.meta_data_id 11 Using where
1 PRIMARY posts eq_ref PRIMARY PRIMARY 4 database.meta_relationships.object_id 1
2 MATERIALIZED meta_relationships index NULL idx_object_id 4 NULL 14679 Using index
Tables/Indexes:
meta
This table contains the category and tag names.
indexes:
PRIMARY KEY (meta_id), KEY idx_meta_name (meta_name)
meta_data
This table contains additional data about the categories and tags such as type (category or tag), description, parent, count.
indexes:
PRIMARY KEY (meta_data_id), KEY idx_meta_id (meta_id)
meta_relationships
This is a junction/lookup table. It contains a foreign key to the posts_id, a foreign key to the meta_data_id, and also contains the order of the categories.
indexes:
PRIMARY KEY (relationship_id), KEY idx_object_id (object_id), KEY idx_meta_data_id (meta_data_id)
The count allows me to only select the posts with that correct level of category. For example, the category computers has posts with only the computers category but it also has posts with computers/hardware. The count filters out posts that contain those extra categories. I hope that makes sense.
I believe the key to optimizing the query is to get away completely from doing the COUNT.
An alternative to the COUNT would possibly be using meta_relationships.meta_order or meta_data.parent instead.
The meta_relationships table will grow quickly and with the current size (~15K rows) I'm hoping to achieve an execution time in the 100th of seconds rather than the 10ths of seconds.
Since there needs to be multiple conditions in the WHERE clause for each category/tag, any answer optimized for a dynamic query is preferred.
I have created an IDE with sample data.
How can I optimize this query?
EDIT :
I was never able to find an optimal solution to this problem. It was really a combination of smcjones recommendation of improving the indexes for which I would recommend doing an EXPLAIN and looking at EXPLAIN Output Format then change the indexes to whatever gives you the best performance.
Also, hpf's recommendation to add another column with the total count helped tremendously. In the end, after changing the indexes, I ended up going with this query.
SELECT posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE posts.meta_count = 2
GROUP BY posts.post_id
HAVING category = 'category,subcategory'
After getting rid of the COUNT, the big performance killer was the GROUP BY and ORDER BY, but the indexes are your best friend. I learned that when doing a GROUP BY, the WHERE clause is very important, the more specific you can get the better.
With a combination of optimized queries AND optimizing your tables, you will have fast queries. However, you cannot have fast queries without an optimized table.
I cannot stress this enough: If your tables are structured correctly with the correct amount of indexes, you should not be experiencing any full table reads on a query like GROUP BY... HAVING unless you do so by design.
Based on your example, I have created this SQLFiddle.
Compare that to SQLFiddle #2, in which I added indexes and added a UNIQUE index against meta.meta_naame.
From my testing, Fiddle #2 is faster.
Optimizing Your Query
This query was driving me nuts, even after I made the argument that indexes would be the best way to optimize this. Even though I still hold that the table is your biggest opportunity to increase performance, it did seem that there had to be a better way to run this query in MySQL. I had a revelation after sleeping on this problem, and used the following query (seen in SQLFiddle #3):
SELECT posts.post_id,posts.post_name,posts.post_title,posts.post_description,posts.date,meta.meta_name
FROM posts
LEFT JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = 'animals'
GROUP BY meta_relationships.object_id
HAVING sum(meta_relationships.object_id) = min(meta_relationships.object_id);
HAVING sum() = min() on a GROUP BY should check to see if there is more than one record of each type. Obviously, each time the record shows up, it will add more to the sum. (Edit: On subsequent tests it seems like this has the same impact as count(meta_relationships.object_id) = 1. Oh well, the point is I believe you can remove subquery and have the same result).
I want to be clear that you won't notice much if any optimization on the query I provided you unless the section, WHERE meta.meta_name = 'animals' is querying against an index (preferably a unique index because I doubt you'll need more than one of these and it will prevent accidental duplication of data).
So, instead of a table that looks like this:
CREATE TABLE meta_data (
meta_data_id BIGINT,
meta_id BIGINT,
type VARCHAR(50),
description VARCHAR(200),
parent BIGINT,
count BIGINT);
You should make sure you add primary keys and indexes like this:
CREATE TABLE meta_data (
meta_data_id BIGINT,
meta_id BIGINT,
type VARCHAR(50),
description VARCHAR(200),
parent BIGINT,
count BIGINT,
PRIMARY KEY (meta_data_id,meta_id),
INDEX ix_meta_id (meta_id)
);
Don't overdo it, but every table should have a primary key, and any time you are aggregating or querying against a specific value, there should be indexes.
When indexes are not used, the MySQL will walk through each row of the table until it finds what you want. In such a limited example as yours this doesn't take too long (even though it's still noticeably slower), but when you add thousands or more records, this will become extraordinarily painful.
In the future, when reviewing your queries, try to identify where your full table scans are occurring and see if there is an index on that column. A good place to start is wherever you are aggregating or using the WHERE syntax.
A note on the count column
I have not found putting count columns into the table to be helpful. It can lead to some pretty serious integrity issues. If a table is properly optimized, It should be very easy to use count() and get the current count. If you want to have it in a table, you can use a VIEW, although that will not be the most efficient way to make the pull.
The problem with putting count columns into a table is that you need to update that count, using either a TRIGGER or, worse, application logic. As your program scales out that logic can either get lost or buried. Adding that column is a deviation from normalization and when something like this is to occur, there should be a VERY good reason.
Some debate exists as to whether there is ever a good reason to do this, but I think I'd be wise to stay out of that debate because there are great arguments on both sides. Instead, I will pick a much smaller battle and say that I see this causing you more headaches than benefits in this use case, so it is probably worth A/B testing.
Since the HAVING seems to be the issue, can you instead create a flag field in the posts table and use that instead? If I understand the query correctly, you're trying to find posts with only one meta_relationship link. If you created a field in your posts table that was either a count of the meta_relationships for that post, or a boolean flag for whether there was only one, and indexed it of course, that would probably be much faster. It would involve updating the field if the post was edited.
So, consider this:
Add a new field to the posts table called "num_meta_rel". It can be an unsigned tinyint as long as you'll never have more than 255 tags to any one post.
Update the field like this:
UPDATE posts
SET num_meta_rel=(SELECT COUNT(object_id) from meta_relationships WHERE object_id=posts.post_id);
This query will take some time to run, but once done you have all the counts precalculated. Note this can be done better with a join, but SQLite (Ideone) only allows subqueries.
Now, you rewrite your query like this:
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND posts.num_meta_rel=1
GROUP BY meta_relationships.object_id
If I've done this correctly, the runnable code is here: http://ideone.com/ZZiKgx
Note that this solution requires that you update the num_meta_rel (choose a better name, that one is terrible...) if the post has a new tag associated with it. But that should be much faster than scanning your entire table over and over.
See if this gives you the right answer, possibly faster:
SELECT p.post_id, p.post_name,
GROUP_CONCAT(IF(md.type = 'category', meta.meta_name, null)) AS category,
GROUP_CONCAT(IF(md.type = 'tag', meta.meta_name, null)) AS tag
FROM
( SELECT object_id
FROM meta_relation
GROUP BY object_id
HAVING count(*) = 1
) AS x
JOIN meta_relation AS mr ON mr.object_id = x.object_id
JOIN posts AS p ON p.post_id = mr.object_id
JOIN meta_data AS md ON mr.meta_data_id = md.meta_data_id
JOIN meta ON md.meta_id = meta.meta_id
WHERE meta.meta_name = ?
GROUP BY mr.object_id
Unfortunately I have no possibility to test performance,
But try my query using your real data:
http://sqlfiddle.com/#!9/81b29/13
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
INNER JOIN (
SELECT meta_relationships.object_id
FROM meta_relationships
GROUP BY meta_relationships.object_id
HAVING count(*) < 3
) mr ON mr.object_id = posts.post_id
LEFT JOIN meta_relationships ON mr.object_id = meta_relationships.object_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
INNER JOIN (
SELECT *
FROM meta
WHERE meta.meta_name = 'health'
) meta ON meta_data.meta_id = meta.meta_id
GROUP BY posts.post_id
Use
sum(1)
instead of
count(*)
SELECT COUNT(*)
FROM song AS s
JOIN user AS u
ON(u.user_id = s.user_id)
WHERE s.is_active = 1 AND s.public = 1
The s.active and s.public are index as well as u.user_id and s.user_id.
song table row count 310k
user table row count 22k
Is there a way to optimize this? We're getting 1 second query times on this.
Ensure that you have a compound "covering" index on song: (user_id, is_active, public). Here, we've named the index covering_index:
SELECT COUNT(s.user_id)
FROM song s FORCE INDEX (covering_index)
JOIN user u
ON u.user_id = s.user_id
WHERE s.is_active = 1 AND s.public = 1
Here, we're ensuring that the JOIN is done with the covering index instead of the primary key, so that the covering index can be used for the WHERE clause as well.
I also changed COUNT(*) to COUNT(s.user_id). Though MySQL should be smart enough to pick the column from the index, I explicitly named the column just in case.
Ensure that you have enough memory configured on the server so that all of your indexes can stay in memory.
If you're still having issues, please post the results of EXPLAIN.
Perhaps write it as a stored procedure or view... You could also try selecting all the IDs first then running the count on the result... if you do it all as one query it may be faster. Generally optimisation is done by using nested selects or making the server do the work so in this context that is all I can think of.
SELECT Count(*) FROM
(SELECT song.user_id FROM
(SELECT * FROM song WHERE song.is_active = 1 AND song.public = 1) as t
JOIN user AS u
ON(t.user_id = u.user_id))
Also be sure you are using the correct kind of join.
I have two tables - one called customer_records and another called customer_actions.
customer_records has the following schema:
CustomerID (auto increment, primary key)
CustomerName
...etc...
customer_actions has the following schema:
ActionID (auto increment, primary key)
CustomerID (relates to customer_records)
ActionType
ActionTime (UNIX time stamp that the entry was made)
Note (TEXT type)
Every time a user carries out an action on a customer record, an entry is made in customer_actions, and the user is given the opportunity to enter a note. ActionType can be one of a few values (like 'designatory update' or 'added case info' - can only be one of a list of options).
What I want to be able to do is display a list of records from customer_records where the last ActionType was a certain value.
So far, I've searched the net/SO and come up with this monster:
SELECT * FROM (
SELECT * FROM (
SELECT * FROM `customer_actions` ORDER BY `EntryID` DESC
) list1 GROUP BY `CustomerID`
) list2 WHERE `ActionType`='whatever' LIMIT 0,30
Which is great - it lists each customer ID and their last action. But the query is extremely slow on occasions (note: there are nearly 20,000 records in customer_records). Can anyone offer any tips on how I can sort this monster of a query out or adjust my table to give faster results? I'm using MySQL. Any help is really appreciated, thanks.
Edit: To be clear, I need to see a list of customers who's last action was 'whatever'.
To filter customers by their last action, you could use a correlated sub-query...
SELECT
*
FROM
customer_records
INNER JOIN
customer_actions
ON customer_actions.CustomerID = customer_records.CustomerID
AND customer_actions.ActionDate = (
SELECT
MAX(ActionDate)
FROM
customer_actions AS lookup
WHERE
CustomerID = customer_records.CustomerID
)
WHERE
customer_actions.ActionType = 'Whatever'
You may find it more efficient to avoid the correlated sub-query as follows...
SELECT
*
FROM
customer_records
INNER JOIN
(SELECT CustomerID, MAX(ActionDate) AS ActionDate FROM customer_actions GROUP BY CustomerID) AS last_action
ON customer_records.CustomerID = last_action.CustomerID
INNER JOIN
customer_actions
ON customer_actions.CustomerID = last_action.CustomerID
AND customer_actions.ActionDate = last_action.ActionDate
WHERE
customer_actions.ActionType = 'Whatever'
I'm not sure if I understand the requirements but it looks to me like a JOIN would be enough for that.
SELECT cr.CustomerID, cr.CustomerName, ...
FROM customer_records cr
INNER JOIN customer_actions ca ON ca.CustomerID = cr.CustomerID
WHERE `ActionType` = 'whatever'
ORDER BY
ca.EntryID
Note that 20.000 records should not pose a performance problem
Please note that I've adapted Lieven's answer (I made a separate post as this was too long for a comment). Any credit for the solution itself goes to him, I'm just trying to show you some key points for improving performance.
If speed is a concern then the following should give you some suggestions for improving it:
select top 100 -- Change as required
cr.CustomerID ,
cr.CustomerName,
cr.MoreDetail1,
cr.Etc
from customer_records cr
inner join customer_actions ca
on ca.CustomerID = cr.CustomerID
where ca.ActionType = 'x'
order by cr.CustomerID
A few notes:
In some cases I find left outer joins to be faster then inner joins - It would be worth measuring performance for both for this query
Avoid returning * wherever possible
You don't have to reference 'cr.x' in the initial select but it's a good habit to get into for when you start working on large queries that can have multiple joins in them (this will make a lot of sense once you start doing this
When using joins always join on a primary key
Maybe I'm missing something but what's wrong with a simple join and a where clause?
Select ActionType, ActionTime, Note
FROM Customer_Records CR
INNER JOIN customer_Actions CA
ON CR.CustomerID = CA.CustomerID
Where ActionType = 'added case info'
i have big query like this and i can't totally rebuild application because of customer:
SELECT count(AdvertLog.id) as count,AdvertLog.advert,AdvertLog.ut_fut_tstamp_dmy as day,
AdvertLog.operation,
Advert.allow_clicks,
Advert.slogan as name,
AdvertLog.log,
(User.tx_reality_credit
+-20
-(SELECT COUNT(advert_log.id) FROM advert_log WHERE ut_fut_tstamp_dmy <= day AND operation = 0 AND advert IN (168))
+(SELECT IF(ISNULL(SUM(log)),0,SUM(log)) FROM advert_log WHERE ut_fut_tstamp_dmy <= day AND operation IN (1, 2) AND advert = 40341 )) AS points
FROM `advert_log` AS AdvertLog
LEFT JOIN `tx_reality_advert` Advert ON Advert.uid = AdvertLog.advert
LEFT JOIN `fe_users` AS User ON (User.uid = Advert.user or User.uid = AdvertLog.advert)
WHERE User.uid = 40341 and AdvertLog.id>0
GROUP BY AdvertLog.ut_fut_tstamp_dmy, AdvertLog.advert
ORDER BY AdvertLog.ut_fut_tstamp_dmy_12 DESC,AdvertLog.operation,count DESC,name
LIMIT 0, 15
It takes 1.5s approximately which is too long.
Indexes:
User.uid
AdvertLog.advert
AdvertLog.operation
AdvertLog.advert
AdvertLog.ut_fut_tstamp_dmy
AdvertLog.id
Advert.user
AdvertLog.log
Output of Explain:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY User const PRIMARY PRIMARY 4 const 1 Using temporary; Using filesort
1 PRIMARY AdvertLog range PRIMARY,advert PRIMARY 4 NULL 21427 Using where
1 PRIMARY Advert eq_ref PRIMARY PRIMARY 4 etrend.AdvertLog.advert 1 Using where
3 DEPENDENT SUBQUERY advert_log ref ut_fut_tstamp_dmy,operation,advert advert 5 const 1 Using where
2 DEPENDENT SUBQUERY advert_log index_merge ut_fut_tstamp_dmy,operation,advert advert,operation 5,2 NULL 222 Using intersect(advert,operation); Using where
Can anyone help me, because i tried different things but no improvements
The query is pretty large, and I'd expect this to take a fair bit of time, but you could try adding an index on Advert.uid, if it's not present. Other than that, someone with much better SQL-foo than I will have to answer this.
First, your WHERE clause is based on a specific "User.ID", yet there is an index on the Advert_Log by the Advert (user ID). So, first, change the WHERE clause to reflect this...
Where
AdverLog.Advert = 40341
Then, remove the "LEFT JOIN" to just a "JOIN" to the user table.
Finally (without a full rewrite of the query), I would tack on the "STRAIGHT_JOIN" keyword...
select STRAIGHT_JOIN
... rest of query ...
Which tells the optimizer to perform the query in the order / relations explicitly stated.
Another area to optimize would be to pre-query the "points" (counts and logs based on advert and operation) once and pull the answer from that (as a subquery) instead of it running through two queries)... but I'd be interested to know impact of above WHERE, JOIN and STRAIGHT_JOIN helps.
Additionally, looking at the join to the user table based on EITHER The Advert_Log.Advert (userID), or the TX_Reality_Credit.User (another user ID which does not appear to be the same since the join between Advert_Log and TX_Reality_Credit (TRC) is based on the TRC.UID) unless that is an incorrect assumption. This could possibly give erroneous results as you are testing for MULTIPLE User IDs... the advert user, and whoever else is the "user" from the "TRC" table... which would result in which user's credit is being applied to the "points" calculation.
To better understand the relationship and context, can you give some more clarification of what is IN these tables from the Advert_Log to TX_Reality_Credit perspective, and the Advert vs UID vs User...
I have an optimisation question here.
Background
I have a 12000 users in a user table, on record per user. Each user can be in zero or more groups. I have a groups table with 45 groups and a groups_link table with 75000 records (to facilitate the many to many relationship).
I am making a querying screen which allows a user to filter users from the user list.
Aim
Specifically, I need help with: Selecting all users that are in one group but are not in another group.
DB Structure
Query
My current query which runs too slowly...
SELECT U.user_id,U.user_email
FROM (sc_module_users AS U)
JOIN sc_module_users_groups_links AS group_join ON group_join.user_id = U.user_id
LEFT JOIN sc_module_users_groups_links AS excluded_group_join ON group_join.user_id = U.user_id
WHERE group_join.group_id IN (27) AND excluded_group_join.group_id NOT IN (19) OR excluded_group_join.group_id IS NULL AND U.user_subscribed=1 AND U.user_active=1
GROUP BY U.user_id,U.user_id
This query takes 9 minutes to complete, it returns 11,000 records (out of 12,000).
Explain
Here's the explain on that query:
Click here for a closer look
Can anyone help me optimise this to below the 1 minute mark...?
After 3 revisions, I changed it to this
SELECT U.user_id,U.user_email FROM (sc_module_users AS U) WHERE ( user_country LIKE '%australia%' ) AND
EXISTS (SELECT group_id FROM sc_module_users_groups_links l WHERE group_id in (31) AND l.user_id=U.user_id) AND
NOT EXISTS (SELECT group_id FROM sc_module_users_groups_links l WHERE group_id in (27) AND l.user_id=U.user_id)
AND U.user_subscribed=1 AND U.user_active=1 GROUP BY U.user_id
'
mucccch faster
EDIT: removed my query suggestion but the index stuff should still apply:
The indexes on the sc_module_users_groups_links could be improved by creating a composite index just on user_id and group_id. The order of the columns in the index can have an impact - i believe having user_id first should perform better.
You could also try removing the link_id and just using a composite primary key since the link_id doesn't seem to serve any other purpose.
I believe the very first thing you need to do is to place parentheses:
// should be
.. AND ( excluded_group_join.group_id NOT IN (19)
OR excluded_group_join.group_id IS NULL) AND ....