Fetch latest row for a specific identifier - mysql

I have a table that looks like this
ID | identifier | data | created_at
------------------------------------
1 | 500 | test1 | 2011-08-30 15:27:29
2 | 501 | test1 | 2011-08-30 15:27:29
3 | 500 | test2 | 2011-08-30 15:28:29
4 | 865 | test3 | 2011-08-30 15:29:29
5 | 501 | test2 | 2011-08-30 15:31:29
6 | 500 | test3 | 2011-08-30 15:31:29
What I need is the most up to date entry for each identifier, that could be decided by either the ID or the date in created_at. I assumed ID is the better choice due to the indexing.
I would expect this result set:
4 | 865 | test3 | 2011-08-30 15:29:29
5 | 501 | test2 | 2011-08-30 15:31:29
6 | 500 | test3 | 2011-08-30 15:31:29
The result should be ordered by either date or ID in ascending order.
It's important that this is a table that contains ~ 8 millions of rows.
I tried quite some approaches now with self joining and sub queries. Unfortunately all of those came out with either wrong results or half a decade of run time.
To provide an example:
SELECT lo1.*
FROM table lo1
INNER JOIN
(
SELECT MAX(id) MaxID, identifier, id
FROM table
GROUP BY identifier
) lo2
ON lo1.identifier= lo2.identifier
AND lo1.id = lo2.MaxID
ORDER BY lo1.id DESC
LIMIT 10
The above query takes very long and does sometimes not return the latest result for an identifier, not quite sure why though.
Does anyone have an approach that is able to fetch the required result sets and preferably does not take a decade?
As asked, here is the create code:
CREATE TABLE `table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`identifier` int(11) NOT NULL,
`data` varchar(200) COLLATE latin1_bin NOT NULL,
`created_at` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `identifier` (`identifier`),
KEY `created_at` (`created_at`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin

The correct query that gives the correct results but it won't scale on larger tables.
Query
SELECT
`table`.*
FROM
`table`
INNER JOIN
(
SELECT
MAX(id) AS MaxID
, identifier
FROM
`table`
GROUP BY
identifier
#disables GROUP BY Sorting might make the query faster.
ORDER BY
NULL
) `table_group`
ON
`table`.ID = `table_group`.MaxID
ORDER BY
`table`.ID DESC
LIMIT 10
Result
| id | identifier | data | created_at |
|----|------------|-------|----------------------|
| 6 | 500 | test3 | 2011-08-30T15:31:29Z |
| 5 | 501 | test2 | 2011-08-30T15:31:29Z |
| 4 | 865 | test3 | 2011-08-30T15:29:29Z |
see demo http://www.sqlfiddle.com/#!9/7f4401/4
But when you check "View Execution Plan" you can see "Using where; Using temporary; Using filesort" in the extra column meaning MySQL needs to use a quicksort algorithm "Using temporary;" means the quicksort algorithm first will be run on a memory temporary table.
If the memory temporary table becomes to large it will be converted to a MyISAM on disk temporary table.
Meaning the quicksort will need disk based random i/o to sort which is slow on disks.
So this method will not scale on the table with ~8 millions of rows.
This query below also gives the same results but it should be more optimized
Query
SELECT
`table`.*
FROM
`table`
INNER JOIN (
SELECT
`table`.ID
FROM
`table`
INNER JOIN
(
SELECT
MAX(id) AS MaxID
, identifier
FROM
`table`
GROUP BY
identifier
#disables GROUP BY Sorting might make the query faster.
ORDER BY
NULL
)
AS `table_group`
ON
`table`.ID = `table_group`.MaxID
)
AS `table_group_max`
ON
`table`.ID = `table_group_max`.ID
ORDER BY
`table`.ID DESC
LIMIT 10
Result
| id | identifier | data | created_at |
|----|------------|-------|----------------------|
| 6 | 500 | test3 | 2011-08-30T15:31:29Z |
| 5 | 501 | test2 | 2011-08-30T15:31:29Z |
| 4 | 865 | test3 | 2011-08-30T15:29:29Z |
see demo http://www.sqlfiddle.com/#!9/7f4401/21
When you check "View Execution Plan" there is no more "Using temporary; Using filesort" meaning the query should be more optimal then the previous query and should in theory execute faster.
Because the combination "Using temporary; Using filesort" can really be a performance killer like explained.

Related

MySQL query slow with OR in WHERE statement

I have a SQL query which looks simple but runs very slow ~4s:
SELECT tblbooks.*
FROM tblbooks LEFT JOIN
tblauthorships ON tblbooks.book_id = tblauthorships.book_id
WHERE (tblbooks.added_by=3 OR tblauthorships.author_id=3)
GROUP BY tblbooks.book_id
ORDER BY tblbooks.book_id DESC
LIMIT 10
EXPLAIN result:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------------+-------+-------------------+---------+---------+------------------------+------+-------------+
| 1 | SIMPLE | tblbooks | index | fk_books__users_1 | PRIMARY | 62 | NULL | 10 | Using where |
| 1 | SIMPLE | tblauthorships | ref | book_id | book_id | 62 | tblbooks.book_id | 1 | Using where |
+------+-------------+----------------+-------+-------------------+---------+---------+------------------------+------+-------------+
2 rows in set (0.000 sec)
If I run the above query individually on each part of OR in WHERE statement, both queries return result in less than 0.01s.
Simplified schema:
tblbooks (~1 million rows):
| Field | Type | Null | Key | Default | Extra |
+---------------+-----------------------+------+-----+---------------------+----------------+
| id | int(10) unsigned | NO | MUL | NULL | auto_increment |
| book_id | varchar(20) | NO | PRI | NULL | |
| added_by | int(11) unsigned | NO | MUL | NULL | |
+---------------+-----------------------+------+-----+---------------------+----------------+
tblauthorships (< 100 rows):
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------------------+----------------+
| authorship_id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| book_id | varchar(20) | NO | MUL | NULL | |
| author_id | int(11) unsigned | NO | MUL | NULL | |
+---------------+------------------+------+-----+---------------------+----------------+
Both book_id and author_id columns in tblauthorships have their index created.
Can anyone point me to the right direction?
Note: I'm aware of book_id varchar issue.
My usual analogy for indexing is a telephone book. It's sorted by last name then by first name. If you look up a person by last name, you can find them efficiently. If you look up a person by last name AND first name, it's also efficient. But if you look up a person by first name only, the sort order of the book doesn't help, and you have to search every page the hard way.
Now what happens if you need to search a telephone book for a person by last name OR first name?
SELECT * FROM TelephoneBook WHERE last_name = 'Thomas' OR first_name = 'Thomas';
This is just as bad as searching only by first name. Since all entries matching the first name you searched should be included in the result, you have to find them all.
Conclusion: Using OR in an SQL search is hard to optimize, given that MySQL can use only one index per table in a given query.
Solution: Use two queries and UNION them:
SELECT * FROM TelephoneBook WHERE last_name = 'Thomas'
UNION
SELECT * FROM TelephoneBook WHERE first_name = 'Thomas';
The two individual queries each use an index on the respective column, then the results of both queries are unified (by default UNION eliminates duplicates).
In your case you don't even need to do the join for one of the queries:
(SELECT b.*
FROM tblbooks AS b
WHERE b.added_by=3)
UNION
(SELECT b.*
FROM tblbooks AS b
INNER JOIN tblauthorships AS a USING (book_id)
WHERE a.author_id=3)
ORDER BY book_id DESC
LIMIT 10
The two answers so far are not very optimal. Since they have both UNION and LIMIT, let me further optimize their answers:
( SELECT ...
ORDER BY ...
LIMIT 10
) UNION DISTINCT
( SELECT ...
ORDER BY ...
LIMIT 10
)
ORDER BY ...
LIMIT 10
This gives each SELECT a chance to optimize the ORDER BY and LIMIT, making them faster. Then the UNION DISTINCT dedups. Finally, the first 10 are peeled off to make the resultset.
If there will be pagination via OFFSET, this optimization gets trickier. See http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or
Also... Your table needs two indexes:
INDEX(added_by)
INDEX(author_id)
(Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE.)

MySQL - nested select queries running many time slower than sequential queries (on a large table)

I have a MySQL query that I am having performance problems with that I do not understand. When I try to debug and run the overall query as a sequence of separate subqueries they seem to perform reasonably well, given the volume of data. When I combine them into a single nested query I get much much much longer execution times.
The main ratings table mentioned below is approx 30 million rows (4GB of disk space), with a couple of foreign keys (it's a many-to-many table linking users and items with a small amount of additional supplementary user specific item information - approx 13 fields and 30 bytes).
Query 1 - approx 23s
SELECT COUNT(1) FROM (SELECT fields FROM ratings WHERE (id >= 0 AND id < 10000)
AND item_type = 1) AS t1;
Query 1 saved to table - approx 65s if I save the results to a temporary table
CREATE TABLE temp_table SELECT fields FROM ratings WHERE (id >= 0 AND id < 10000)
AND item_type = 1;
Query 2 - approx 3s
SELECT COUNT(1) FROM temp_table WHERE id IN (SELECT id from item_stats WHERE
ratings_count > 1000);
Bases on this I would expect a combined query to be approx 30s or so, and not more than approx 70s.
Combined query (Query 1 + Query 2) - indeterminate time (10s of minutes before I give up and cancel)
SELECT COUNT(1) from (SELECT * FROM (SELECT fields FROM ratings WHERE (id >= 0
AND id < 10000) AND item_type = 1) AS t1 WHERE t1.id IN (SELECT id FROM
item_stats WHERE ratings_count > 1000)) as t2;
Can anyone help explain this difference and guide me in creating a query that works? If I need to I can rely on the sequential queries (which would take approx 70s), but that is cumbersome and does not seem the right way to go.
I have tried using INNER JOIN instead of IN but this did not seem to make much difference. The ID count from the item_stats table is about 2700 IDs.
It's using MySQL 8.0 on a laptop (16GB RAM, SSD).
Response to suggestions / questions:
Query 1
EXPLAIN select user_id, game_id, item_type_id, rating, plays, own, bgg_last_modified from collections where (user_id >= 0 and user_id < 10000) and item_type_id = 1;
+----+-------------+-------------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | collections | NULL | ALL | user_id | NULL | NULL | NULL | 32898400 | 1.31 | Using where |
+----+-------------+-------------+------------+------+---------------+------+---------+------+----------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
Query 2
EXPLAIN select * from temp_coll where game_id in (select game_id from games_ratings_stats where (ratings_count > 1000) or (ratings_count > 500 and ratings_avg >= 7.0));
+----+--------------+---------------------+------------+------+---------------+------+---------+------+---------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+---------------------+------------+------+---------------+------+---------+------+---------+----------+--------------------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | NULL |
| 1 | SIMPLE | temp_coll | NULL | ALL | NULL | NULL | NULL | NULL | 1674386 | 10.00 | Using where; Using join buffer (hash join) |
| 2 | MATERIALIZED | games_ratings_stats | NULL | ALL | NULL | NULL | NULL | NULL | 81585 | 40.74 | Using where |
+----+--------------+---------------------+------------+------+---------------+------+---------+------+---------+----------+--------------------------------------------+
3 rows in set, 1 warning (0.00 sec)
Combined query
EXPLAIN select * from (select user_id, game_id, item_type_id, rating, plays, own, bgg_last_modified from collections where (user_id >= 0 and user_id < 10000) and item_type_id = 1) as t1 where t1.game_id in (select game_id from games_ratings_stats where (ratings_count > 1000) or (ratings_count > 500 and ratings_avg >= 7.0));
+----+--------------+---------------------+------------+------+-----------------+---------+---------+---------------------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+---------------------+------------+------+-----------------+---------+---------+---------------------+-------+----------+-------------+
| 1 | SIMPLE | <subquery3> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where |
| 1 | SIMPLE | collections | NULL | ref | user_id,game_id | game_id | 5 | <subquery3>.game_id | 199 | 1.31 | Using where |
| 3 | MATERIALIZED | games_ratings_stats | NULL | ALL | NULL | NULL | NULL | NULL | 81585 | 40.74 | Using where |
+----+--------------+---------------------+------------+------+-----------------+---------+---------+---------------------+-------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)
Your query appears to be functionally identical to the following (rather implausible) query:
SELECT COUNT(*) total
FROM ratings r
JOIN item_stats s
ON s.id = r.id
WHERE r.id >= 0
AND r.id < 10000
AND r.item_type = 1
AND s.ratings_count > 1000
r.id is, presumably, the PRIMARY KEY, so it's automatically included in any INNODB index, which leaves just item_type and ratings_count requiring indexes.
You would benefit a lot from an online tutorial on learning how to read the EXPLAIN plan. The EXPLAINS you shared clearly show missing indexes.
As a general rule, queries should not take 23 seconds or 65 seconds, even with millions of rows. Proper indexes + partitioning should resolve the slowness.
Query 1: The user_id index on that table is not helping performance, as 99% of users are within the range in the where clause. You can add an index on item_type_id
ALTER TABLE collections ADD KEY (item_type_id)
Query 2: The temp_coll table is missing a game_id index. Also, I'm not sure if the underlying code for games_ratings_stats has an index on ratings_count and if that would help. I dont have experience with MySQL materialized tables.
ALTER TABLE temp_coll ADD KEY (game_id)
Query 3:
Would benefit from above indexes.
Increasing the InnoDB Buffer Pool Size (now set to 8GB) seems to have made a significant improvement. If anyone has any further setup or tuning advice on MySQL then that would be appreciated!

Subquery for faster result

I have this query which takes me more than 117 seconds on a mysql database.
select users.*, users_oauth.* FROM users LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id WHERE (
(MATCH (user_email) AGAINST ('sometext')) OR
(MATCH (user_firstname) AGAINST ('sometext')) OR
(MATCH (user_lastname) AGAINST ('sometext')) )
ORDER BY user_date_accountcreated DESC LIMIT 1400, 50
How can I use a subquery in order to optimize it ?
The 3 fields are fulltext :
ALTER TABLE `users` ADD FULLTEXT KEY `email_fulltext` (`user_email`);
ALTER TABLE `users` ADD FULLTEXT KEY `firstname_fulltext` (`user_firstname`);
ALTER TABLE `users` ADD FULLTEXT KEY `lastname_fulltext` (`user_lastname`);
There is only one search input in a website to search in different table users fields.
If the limit is for example LIMIT 0,50, the query will run in less than 3 seconds but when the LIMIT increase the query becomes very slow.
Thanks.
Use a single FULLTEXT index:
FULLTEXT(user_email, user_firstname, user_lastname)
And change the 3 matches to just one:
MATCH (user_email, user_firstname, user_lastname) AGAINST ('sometext')
Here's another issue: ORDER BY ... DESC LIMIT 1400, 50. Read about the evils of pagination via OFFSET . That has a workaround, but I doubt if it would apply to your statement.
Do you really have thousands of users matching the text? Does someone (other than a search engine robot) really page through 29 pages? Think about whether it makes sense to really have such a long-winded UI.
And a 3rd issue. Consider "lazy eval". That is, find the user ids first, then join back to users and users_oauth to get the rest of the columns. It would be a single SELECT with the MATCH in a derived table, then JOIN to the two tables. If the ORDER BY an LIMIT can be in the derived table, it could be a big win.
Please indicate which table each column belongs to -- my last paragraph is imprecise because of not knowing about the date column.
Update
In your second attempt, you added OR, which greatly slows things down. Let's turn that into a UNION to try to avoid the new slowdown. First let's debug the UNION:
( SELECT * -- no mention of oauth columns
FROM users -- No JOIN
WHERE users.user_id LIKE ...
ORDER BY user_id DESC
LIMIT 0, 50
)
UNION ALL
( SELECT * -- no mention of oauth columns
FROM users
WHERE MATCH ...
ORDER BY user_id DESC
LIMIT 0, 50
)
Test it by timing each SELECT separately. If one of the is still slow, then let's focus on it. Then test the UNION. (This is a case where using the mysql commandline tool may be more convenient than PHP.)
By splitting, each SELECT can use an optimal index. The UNION has some overhead, but possibly less than the inefficiency of OR.
Now let's fold in users_oauth.
First, you seem to be missing a very important INDEX(oauth_user_id). Add that!
Now let's put them together.
SELECT u.*
FROM ( .... the entire union query ... ) AS u
LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id
ORDER BY user_id DESC -- yes, repeat
LIMIT 0, 50 -- yes, repeat
Yes #Rick
I changed the index fulltext to:
ALTER TABLE `users`
ADD FULLTEXT KEY `fulltext_adminsearch` (`user_email`,`user_firstname`,`user_lastname`);
And now there is some php conditions, $_POST['search'] can be empty:
if(!isset($_POST['search'])) {
$searchId = '%' ;
} else {
$searchId = $_POST['search'] ;
}
$searchMatch = '+'.str_replace(' ', ' +', $_POST['search']);
$sqlSearch = $dataBase->prepare(
'SELECT users.*, users_oauth.*
FROM users
LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id
WHERE ( users.user_id LIKE :id OR
(MATCH (user_email, user_firstname, user_lastname)
AGAINST (:match IN BOOLEAN MODE)) )
ORDER BY user_id DESC LIMIT 0,50') ;
$sqlSearch->execute(array('id' => $searchId,
'match' => $searchMatch )) ;
The users_oauth table has a column with user_id:
Table users:
+--------------------------+-----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------+-----------------+------+-----+---------+----------------+
| user_id | int(8) unsigned | NO | PRI | NULL | auto_increment |
| user_activation_key | varchar(40) | YES | | NULL | |
| user_email | varchar(40) | NO | UNI | | |
| user_login | varchar(30) | YES | | NULL | |
| user_password | varchar(40) | YES | | NULL | |
| user_firstname | varchar(30) | YES | | NULL | |
| user_lastname | varchar(50) | YES | | NULL | |
| user_lang | varchar(2) | NO | | en
+--------------------------+-----------------+------+-----+---------+----------------+
Table users_oauth:
+----------------------+-----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+-----------------+------+-----+---------+----------------+
| oauth_id | int(8) unsigned | NO | PRI | NULL | auto_increment |
| oauth_user_id | int(8) unsigned | NO | | NULL | |
| oauth_google_id | varchar(30) | YES | UNI | NULL | |
| oauth_facebook_id | varchar(30) | YES | UNI | NULL | |
| oauth_windowslive_id | varchar(30) | YES | UNI | NULL | |
+----------------------+-----------------+------+-----+---------+----------------+
The Left Join is long, the request takes 3 seconds with, 0,0158 seconds wihtout.
It would be more rapid to make a sql request for each 50 rows.
Would it be more rapid with a subquery ? How to make it with a subquery ?
Thanks

Optimizing a large keyword table?

I have a huge table like
CREATE TABLE IF NOT EXISTS `object_search` (
`keyword` varchar(40) COLLATE latin1_german1_ci NOT NULL,
`object_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`keyword`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
with around 39 million rows (using over 1 GB space) containing the indexed data for 1 million records in the object table (where object_id points at).
Now searching through this with a query like
SELECT object_id, COUNT(object_id) AS hits
FROM object_search
WHERE keyword = 'woman' OR keyword = 'house'
GROUP BY object_id
HAVING hits = 2
is already significantly faster than doing a LIKE search on the composed keywords field in the object table but still takes up to 1 minute.
It's explain looks like:
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| 1 | SIMPLE | search | ref | PRIMARY | PRIMARY | 42 | const | 345180 | 100.00 | Using where; Using index |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
The full explain with joined object and object_color and object_locale table, while the above query is run in a subquery to avoid overhead, looks like:
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 182544 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | object_color | eq_ref | object_id | object_id | 4 | search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | locale | eq_ref | object_id | object_id | 4 | search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | object | eq_ref | PRIMARY | PRIMARY | 4 | search.object_id | 1 | 100.00 | |
| 2 | DERIVED | search | ref | PRIMARY | PRIMARY | 42 | | 345180 | 100.00 | Using where; Using index |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
My top goal would be to be able to scan through this within 1 or 2 seconds.
So, are there further techniques to improve search speed for keywords?
Update 2013-08-06:
Applying most of Neville K's suggestion I now have the following setup:
CREATE TABLE `object_search_keyword` (
`keyword_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`keyword` varchar(64) COLLATE latin1_german1_ci NOT NULL,
PRIMARY KEY (`keyword_id`),
FULLTEXT KEY `keyword_ft` (`keyword`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
CREATE TABLE `object_search` (
`keyword_id` int(10) unsigned NOT NULL,
`object_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`keyword_id`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The new query's explain looks like this:
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 24381 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | object_color | eq_ref | object_id | object_id | 4 | object_search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | object | eq_ref | PRIMARY | PRIMARY | 4 | object_search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | locale | eq_ref | object_id | object_id | 4 | object_search.object_id | 1 | 100.00 | |
| 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | 100.00 | |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 24381 | 100.00 | |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | DERIVED | object_keyword | fulltext | PRIMARY,keyword_ft | keyword_ft | 0 | | 1 | 100.00 | Using where; Using temporary; Using filesort |
| 3 | DERIVED | object_search | ref | PRIMARY | PRIMARY | 4 | object_keyword.keyword_id | 2190225 | 100.00 | Using index |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
The many derives are coming from the keyword comparing subquery being nested into another subquery which does nothing but count the amount of rows returned:
SELECT SQL_NO_CACHE object.object_id, ..., #rn AS numrows
FROM (
SELECT *, #rn := #rn + 1
FROM (
SELECT SQL_NO_CACHE search.object_id, COUNT(turbo.object_id) AS hits
FROM object_keyword AS kwd
INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
WHERE MATCH (kwd.keyword) AGAINST ('+(woman) +(house)')
GROUP BY search.object_id HAVING hits = 2
) AS numrowswrapper
CROSS JOIN (SELECT #rn := 0) CONST
) AS turbo
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS object_color ON (search.object_id = object_color.object_id)
LEFT JOIN object_locale AS locale ON (search.object_id = locale.object_id)
ORDER BY timestamp_upload DESC
The above query will actually run within ~6 seconds, since it searches for two keywords. The more keywords I search for, the faster the search goes down.
Any way to further optimize this?
Update 2013-08-07
The blocking thing seems almost certainly to be the appended ORDER BY statement. Without it, the query executes in less than a second.
So, is there any way to sort the result faster? Any suggestions welcome, even hackish ones that would require post processing somewhere else.
Update 2013-08-07 later that day
Alright ladies and gentlemen, nesting the WHERE and ORDER BY statements in another layer of subquery to not let it bother with tables it doesn't need roughly doubled it's performance again:
SELECT wowrapper.*, locale.title
FROM (
SELECT SQL_NO_CACHE object.object_id, ..., #rn AS numrows
FROM (
SELECT *, #rn := #rn + 1
FROM (
SELECT SQL_NO_CACHE search.media_id, COUNT(search.media_id) AS hits
FROM object_keyword AS kwd
INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
WHERE MATCH (kwd.keyword) AGAINST ('+(frau)')
GROUP BY search.media_id HAVING hits = 1
) AS numrowswrapper
CROSS JOIN (SELECT #rn := 0) CONST
) AS search
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS color ON (search.object_id = color.object_id)
WHERE 1
ORDER BY object.object_id DESC
) AS wowrapper
LEFT JOIN object_locale AS locale ON (jfwrapper.object_id = locale.object_id)
LIMIT 0,48
Searches that took 12 seconds (single keyword, ~200K results) now take 6, and a search for two keywords that took 6 seconds (60K results) now takes around 3.5 secs.
Now this is already a massive improvement, but is there any chance to push this further?
Update 2013-08-08 early that day
Undid that last nested variation of the query, since it actually slowed down other variations of it...
I'm now trying some other things with different table layouts and FULLTEXT indexes using MyISAM for a dedicated search table with a combined keyword field (comma separated in a TEXT field).
Update 2013-08-08
Alright, a plain fulltext index doesnt really help.
Back to the previous setup, the only thing blocking is the ORDER BY (which resorts to using a temporary table and filesort). Without it a search is complete within less than a second!
So basically what's left of all this is:
How do I optimize the ORDER BY statement to run faster, likely by eliminating the use of the temporary table?
Full text search will be much faster than using the standard SQL string comparison features.
Secondly, if you have a high degree of redundancy in the keywords, you could consider a "many to many" implementation:
Keywords
--------
keyword_id
keyword
keyword_object
-------------
keyword_id
object_id
objects
-------
object_id
......
If this reduces the string comparison from 39 million rows to 100K rows (roughly the size of the English dictionary), you may also see a distinct improvement, as the query would only have to perform 100K string comparisons, and joining on an integer keyword_id and object_id field should be much, much faster than doing 39M string comparisons.
The best solution for this will be a FULLTEXT search, but you will probably need a MyISAM table for that. You can setup a mirror table and update it with some events and triggers or if you have a slave replicating from your server you can change its table to MyISAM and use it for searches.
For this query the only thing I can come up with is to rewrite it as:
SELECT s1.object_id
FROM object_search s1
JOIN object_search s2 ON s2.object_id = s1.object_id AND s2.key_word = 'word2'
JOIN object_search s3 ON s3.object_id = s1.object_id AND s3.key_word = 'word3'
....
WHERE s1.key_word = 'word1'
and I'm not sure it will be faster this way.
Also you will need to have an index on object_id (assuming your PK is (key_word, object_id)).
If you have seldom INSERTs and often SELECTs you could optimize your data for the reads i.e. recalculate the number of object_ids per keyword and directly store it in the database. The SELECTs would then be very fast, the INSERTs would take some seconds though,.

delete duplicate rows from large dB

This is a second post to my original question posted here.
My setup:
amazon RDS using MySQL Workbench with connection timeout set to max
I am trying to DELETE duplicate rows from my dB which has close to 1MIL rows.
the table looks like this, mytext is a mediumtext blob. id is AUTO_INCREMENT
+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc | 123 |
| 2 | joe | min | abc | 123 |
| 3 | mar | kam | def | 789 |
| 4 | kel | smi | ghi | 456 |
+------------------------------+
I would like to end up with a table like this
+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc | 123 |
| 3 | mar | kam | def | 789 |
| 4 | kel | smi | ghi | 456 |
+------------------------------+
This solution started woking but after about 10,000 rows the process takes longer and eventualy hangs.
I let this run for over 20 hours, settings at 10 thousand rows with a WHERE condition (i thought deleting in chunks would be safer).
But even with the WHERE clause the system hangs then I have to Reboot RDS to access the dB.
DELETE
FROM yourTable
WHERE id>40000
AND id<=50000
AND id NOT IN
(
SELECT MAXID FROM
(
SELECT MAX(id) as MAXID
FROM yourTable
GROUP BY mytext
) as temp_table
)
heres the create statement
CREATE TABLE `yourTable` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`fname` varchar(45) DEFAULT NULL,
`lname` varchar(45) DEFAULT NULL,
`mytext` mediumtext,
`morevar` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1$$
Question
Is this sql command ok for handeling large amounts of rows and what I am trying to achieve? Or is there a better solution.
How long would it normally take to process 1MIL rows?
Is there a setting like in php.ini inside amazon for large data set manipulation?
Or would it make more sense to create a new table and insert all rows excluding duplicates?
I really wouldn't use NOT IN.
I would ensure that there is an index on myText, id and then try this...
DELETE
FROM
yourTable
WHERE
id > 40000
AND id <= 50000
AND EXISTS (SELECT *
FROM yourTable AS lookup
WHERE lookup.myText = yourTable.myText
AND lookup.id > yourTable.id
)
This way you only check the myText values that you are potentially deleting.
Where as your sub-query will return ids for myTexts that don't even appear in the range you are checking.