Which columns should I be indexing? - mysql

I have a query that takes an insanely long time to execute. Here's the query:
SELECT *
FROM `posts`
WHERE `posts`.`id` IN (... MANY MANY DOZENS OF IDs ...)
ORDER BY `created_at` DESC;
Would I create an index on just id or on both id and created_at?

For your query, an index only on posts(id) is best. If you had only one id in the list, then you could do posts(id, creaated_at).
If the order by is taking most of the time, you could try this version:
select p.*
from (select p.*
from posts p
order by created_at desc
) p
where p.id in (. . .);
Under some circumstances, this might prevent the sort if you have an index on posts(created_at). I'm not thrilled with this formulation because it depends on the subquery returning ordered results -- something that works in practice in MySQL.

Related

MySQL slow query joining two tables with fulltext condition on each

I need to join two tables (1M rows and 10M rows respectively)
Each table is filtered with a fulltext match condition :
SELECT SQL_NO_CACHE c.company_index
FROM dw.companies c INNER JOIN dw.people p
ON c.company_index = p.company_index
WHERE MATCH ( c.tag ) AGAINST ( 'ecommerce' IN BOOLEAN MODE )
AND MATCH ( p.title ) AGAINST ( 'director' IN BOOLEAN MODE )
ORDER BY c.company_index DESC ;
Both tables have fulltext indexes (on "tag" and "title")
The query time is more than 1 mn with both conditions.
With only one of the two conditions, the query time is below 1 sec.
How could I optimize this query ?
I think the problem is that FULLTEXT is very fast if it can be performed first, but very slow if not. With both of the tests in the original query, one MATCH can be first, but the other cannot be.
Here's a messy idea on how to work around the problem.
SELECT c.company_index
FROM ( SELECT company_index FROM companies WHERE MATCH... ) AS c
JOIN ( SELECT company_index FROM people WHERE MATCH... ) AS p
ON c.company_index
= p.company_index
ORDER BY ...
What version of MySQL are you using? Newer versions will automatically create an index on one of the 'derived' tables, thereby making the JOIN quite efficient.
Here's another approach:
SELECT c.company_index
FROM ( SELECT company_index FROM companies WHERE MATCH... ) AS c
WHERE EXISTS ( SELECT 1 FROM people WHERE MATCH...
AND company_index = c.company_index )
ORDER BY ...
In both cases (I hope) one SELECT will use one FT index; the other will use the other, thereby getting the FT performance benefit.

Combined sql query takes too much time

My one query in MySQL takes too much time in executing. In this query, I use IN operator for fetching database from MySQL database.
My query :
SELECT *
FROM databse_posts.post_feeds
WHERE
post_id IN (SELECT post_id FROM database_users.user_bookmarks where user_id=3) AND
post_date < unix_timestamp();
In this, both individual queries takes very less time for execution like
SELECT post_id FROM database_users.user_bookmarks where user_id=3
takes around 400 ms max
and
SELECT * FROM databse_posts.post_feeds Where post_date < unix_timestamp();
takes 300 ms max
But combining both of queries in one using IN operator it takes around 6 to 7 secs.
Why this taking too much time.
I also write a different same type of queries but all those don't take that amount of time.
instead of a where IN (subselect) you could try a inner join on the subselect
SELECT *
FROM databse_posts.post_feeds
INNER JOIN (
SELECT post_id
FROM database_users.user_bookmarks
where user_id=3
) T on T.post_id = post_feeds.post_id
AND
post_date < unix_timestamp();
and be sure you have proper index on post_feeds.post_id and user_bookmarks.user_id, user_bookmarks.post_id
My approach:
You need to create indexes for the fields post_feeds.post_id, user_bookmarks.post_id, user_bookmarks.user_id and post_feeds.post_date fields, then use INNER JOIN to let MySQL engine to manipulate the filtering and merging of rows in an efficient way:
SELECT
pf.*
FROM
databse_posts.post_feeds pf
INNER JOIN database_users.user_bookmarks ub
ON ( pf.post_id = ub.post_id )
WHERE
ub.user_id = 3
AND pf.post_date < unix_timestamp();
My rough guess here would be that the WHERE IN expression is doing something of which you might not be aware. Consider your full query:
SELECT *
FROM databse_posts.post_feeds
WHERE
post_id IN (SELECT post_id FROM database_users.user_bookmarks where user_id=3) AND
post_date < unix_timestamp();
MySQL has to check, for each record, each value of post_id and compare it to the list of post_id coming from the subquery. This is much more costly than just running that subquery once. There are various tricks available at MySQL's disposal to speed this up, but the subquery inside a WHERE IN clause is not the same as just running that subquery once.
If this hypothesis be correct, then the following query should also be in the range of 6-7 seconds:
SELECT *
FROM databse_posts.post_feeds
WHERE
post_id IN (SELECT post_id FROM database_users.user_bookmarks where user_id=3)
If so, then we would know the source of slow performance.

Do I need inner ORDER BY when there is an outer ORDER BY?

Here is my query:
( SELECT id, table_code, seen, date_time FROM events
WHERE author_id = ? AND seen IS NULL
) UNION
( SELECT id, table_code, seen, date_time FROM events
WHERE author_id = ? AND seen IS NOT NULL
LIMIT 2
) UNION
( SELECT id, table_code, seen, date_time FROM events
WHERE author_id = ?
ORDER BY (seen IS NULL) desc, date_time desc -- inner ORDER BY
LIMIT 15
)
ORDER BY (seen IS NULL) desc, date_time desc; -- outer ORDER BY
As you see there is an outer ORDER BY and also one of those subqueries has its own ORDER BY. I believe that ORDER BY in subquery is useless because final result will be sorted by that outer one. Am I right? Or that inner ORDER BY has any effect on the sorting?
Also my second question about query above: in reality I just need id and table_code. I've selected seen and date_time just for that outer ORDER BY, Can I do that better?
You need the inner order by when you have a limit in the query. So, the third subquery is choosing 15 rows based on the order by.
In general, when you have limit, you should be using order by. This is particularly true if you are learning databases. You might seem to get the right answer -- and then be very surprised when it doesn't work at some later point in time. Just because something seems to work doesn't mean that it is guaranteed to work.
The outer order by just sorts all the rows returned by the subqueries.

Optimizing database query with up to 10mil rows as result

I have a MySQL Query that i need to optimize as much as possible (should have a load time below 5s, if possible)
Query is as follow:
SELECT domain_id, COUNT(keyword_id) as total_count
FROM tableName
WHERE keyword_id IN (SELECT DISTINCT keyword_id FROM tableName WHERE domain_id = X)
GROUP BY domain_id
ORDER BY total_count DESC
LIMIT ...
X is an integer that comes from an input
domain_id and keyword_id are indexed
database is on localhost, so the network speed should be max
The subquery from the WHERE clause can get up to 10 mil results. Also, for MySQL seems really hard to calculate the COUNT and ORDER BY this count.
I tried to mix this query with SOLR, but no results, getting such a high number of rows at once gives hard time for both MySQL and SOLR
I'm looking for a solution to have the same results, no matter if i have to use a different technology or an improvement to this MySQL query.
Thanks!
Query logic is this:
We have a domain and we are searching for all the keywords that are being used on that domain (this is the sub query). Then we take all the domains that use at least one of the keywords found on the first query, grouped by domain, with the number of keywords used for each domain, and we have to display it ordered DESC by the number of keywords used.
I hope this make sense
You may try JOIN instead of subquery:
SELECT tableName.domain_id, COUNT(tableName.keyword_id) AS total_count
FROM tableName
INNER JOIN tableName AS rejoin
ON rejoin.keyword_id = tableName.keyword_id
WHERE rejoin.domain_id = X
GROUP BY tableName.domain_id
ORDER BY tableName.total_count DESC
LIMIT ...
I am not 100% sure but can you try this please
SELECT t1.domain_id, COUNT(t1.keyword_id) as total_count
FROM tableName AS t1 LEFT JOIN
(SELECT DISTINCT keyword_id FROM tableName WHERE domain_id = X) AS t2
ON t1.keyword_id = t2.keyword_id
WHERE t2.keyword_id IS NTO NULL
GROUP BY t1.domain_id
ORDER BY total_count DESC
LIMIT ...
The goal is to replace the WHERE IN clause with INNER JOIN and that will make it lot quicker. WHERE IN clause always make the Mysql server to struggle, but it is even more noticeable when you do it with huge amount of data. Use WHERE IN only if it make you query look easier to be read/understood, you have a small data set or it is not possible in another way (but you probably will have another way to do it anyway :) )
In terms of MySQL all you can do is to minimize Disk IO for the query using covering indexes and rewrite it a little more efficient so that the query would benefit from them.
Since keyword_id has a match in another copy of the table, COUNT(keyword_id) becomes COUNT(*).
The kind of subqueries you use is known to be the worst case for MySQL (it executes the subquery for each row), but I am not sure if it should be replaced with a JOIN here, because It might be a proper strategy for your data.
As you probably understand, the query like:
SELECT domain_id, COUNT(*) as total_count
FROM tableName
WHERE keyword_id IN (X,Y,Z)
GROUP BY domain_id
ORDER BY total_count DESC
would have the best performance with a covering composite index (keyword_id, domain_id [,...]), so it is a must. From the other side, the query like:
SELECT DISTINCT keyword_id FROM tableName WHERE domain_id = X
will have the best performance on a covering composite index (domain_id, keyword_id [,...]). So you need both of them.
Hopefully, but I am not sure, when you have the latter index, MySQL can understand that you do not need to select all those keyword_id in the subquery, but you just need to check if there is an entry in the index, and I am sure that it is better expressed if you do not use DISTINCT.
So, I would try to add those two indexes and rewrite the query as:
SELECT domain_id, COUNT(*) as total_count
FROM tableName
WHERE keyword_id IN (SELECT keyword_id FROM tableName WHERE domain_id = X)
GROUP BY domain_id
ORDER BY total_count DESC
Another option is to rewrite the query as follows:
SELECT domain_id, COUNT(*) as total_count
FROM (
SELECT DISTINCT keyword_id
FROM tableName
WHERE domain_id = X
) as kw
JOIN tableName USING (keyword_id)
GROUP BY domain_id
ORDER BY total_count DESC
Once again you need those two composite indexes.
Which one of the queries is quicker depends on the statistics in your tableName.

MySQL: query the top n aggregations

I have a table which counts occurrences of one specific action by different users on different objects:
CREATE TABLE `Actions` (
`object_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`actionTime` datetime
);
Every time a user performs this action, a row is inserted. I can count how many actions were performed on each object, and order objects by 'activity':
SELECT object_id, count(object_id) AS action_count
FROM `Actions`
GROUP BY object_id
ORDER BY action_count;
How can I limit the results to the top n objects? The LIMIT clause is applied before the aggregation, so it leads to wrong results. The table is potentially huge (millions of rows) and I probably need to count tens of times per minute, so I'd like to do this as efficient as possible.
edit: Actually, Machine is right, and I was wrong with the time at which LIMIT is applied. My query returned the correct results, but the GUI presenting them to me threw me off...this kind of makes this question pointless. Sorry!
Actually... LIMIT is applied last, after a eventual HAVING clause. So it should not give you incorrect results. However, since LIMIT is applied last, it will not provide any faster execution of your query, since a temporary table will have to be created and sorted in order of action count before chopping off the result. Also, remember to sort in descending order:
SELECT object_id, count(object_id) AS action_count
FROM `Actions`
GROUP BY object_id
ORDER BY action_count DESC
LIMIT 10;
You could try adding an index to object_id for optimization. In that way only the index will need to be scanned instead of the Actions table.
How about:
SELECT * FROM
(
SELECT object_id, count(object_id) AS action_count
FROM `Actions`
GROUP BY object_id
ORDER BY action_count
)
LIMIT 15
Also, if you have some measure of what must be the minimum number of actions to be included (e.g. the top n ones are surely more than 1000), you can increase the efficiency by adding a HAVING clause:
SELECT * FROM
(
SELECT object_id, count(object_id) AS action_count
FROM `Actions`
GROUP BY object_id
HAVING action_count > 1000
ORDER BY action_count
)
LIMIT 15
I know this thread is 2 years old but stackflow still finds it relevant so here goes my $0.02. ORDER BY clauses are computationally very expensive so they should be avoided in large tables. A trick I used (in part from Joe Celko's SQL for Smarties) is something like:
SELECT COUNT(*) AS counter, t0.object_id FROM (SELECT COUNT(*), actions.object_id FROM actions GROUP BY id) AS t0, (SELECT COUNT(*), actions.object_id FROM actions GROUP BY id) AS t1 WHERE t0.object_id < t1.object_id GROUP BY object_id HAVING counter < 15
Will give you the top 15 edited objects without sorting. Note that as of v5, mysql will only cache result sets for exactly duplicate (whitespace incl) queries so the nested query will not get cached. Using a view would resolve that problem.
Yes, it's three queries instead of two and and the only gain is the not having to sort the grouped query but if you have a lot of groups, it will be faster.
Side note: the query is really handy for median functions w/o sorts
SELECT * FROM (SELECT object_id, count(object_id) AS action_count
FROM `Actions`
GROUP BY object_id
ORDER BY action_count) LIMIT 10;