I have precomputet some similarities (about 70 million) and want to find the similarities from one track to all other tracks. I only need the top-100-tracks that have the highest similarities. For my calculations i do this query about 15'000 times with different tracks as input. After a boot of the machine one calculation needs over 600 seconds for all 15k queries. After several runs, mysql has - i think - cached the indices so the complete run needs about 15 seconds. My only worries are: i have a very hight "Handler_read_rnd_nextDokumentation" value.
I have a MySQL table with this structure:
CREATE TABLE `similarity` (
`similarityID` int(11) NOT NULL AUTO_INCREMENT,
`trackID1` int(11) NOT NULL,
`trackID2` int(11) NOT NULL,
`tracksim` double DEFAULT NULL,
`timesim` double DEFAULT NULL,
`tagsim` double DEFAULT NULL,
`simsum` double DEFAULT NULL,
PRIMARY KEY (`similarityID`),
UNIQUE KEY `trackID1` (`trackID1`,`trackID2`),
KEY `trackID1sum` (`trackID1`,`simsum`),
KEY `trackID2sum` (`trackID2`,`simsum`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
I want to do very much queries on this. The queries look like this:
// simsum is a sum over tracksim, timesim, tagsim
(
SELECT similarityID, trackID2, tracksim, timesim, tagsim, simsum
FROM similarity
WHERE trackID1 = 512
ORDER BY simsum DESC
LIMIT 0,100
)
UNION
(
SELECT similarityID, trackID1, tracksim, timesim, tagsim, simsum
FROM similarity
WHERE trackID2 = 512
ORDER BY simsum DESC
LIMIT 0,100
)
ORDER BY simsum DESC
LIMIT 0,100
The query is quite fast and under 0.1 sec (previous question) but i'm worried about the very huge number in the status page. I thought i have set every index that i'm using in the query.
Handler_read_rndDokumentation 88,0 M
Handler_read_rnd_nextDokumentation 20,0 G
Is there anything "wrong"? Could i get the query even faster? Do i have to worry about the 20G ?
Thanks in advance
The first thing which is obviously wrong here is that you seem to be calculating a directional relationship between tuples - if f(a,b)===f(b,a) then you could simplify your system a lot by swapping around track1 and track2 where track1 is greater than track2 but retaining the existing primary key (and ignore collisions).
You're only halving the amount of data - so it won't be a huge performance increase.
There may be further scope for improving the performance but this is very much dependant on how frequently the data changes, more specifically, you should prune the records where similarity is not in the top 100.
Related
I'm having trouble understanding my options for how to optimize this specific query. Looking online, I find various resources, but all for queries that don't match my particular one. From what I could gather, it's very hard to optimize a query when you have an order by combined with a limit.
My usecase is that i would like to have a paginated datatable that displayed the latest records first.
The query in question is the following (to fetch 10 latest records):
select
`xyz`.*
from
xyz
where
`xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by
`registration_id` desc
limit 10 offset 0
& table DDL:
CREATE TABLE `xyz` (
`registration_id` int NOT NULL AUTO_INCREMENT,
`fk_campaign_id` int DEFAULT NULL,
`fk_customer_id` int DEFAULT NULL,
... other fields ...
`voided` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`registration_id`),
.... ~12 other indexes ...
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
) ENGINE=InnoDB AUTO_INCREMENT=280614594 DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;
The explain on the query mentioned gives me the following:
"id","select_type","table","partitions","type","possible_keys","key","key_len","ref","rows","filtered","Extra"
1,SIMPLE,db_campaign_registration,,index,"getTop5,winners,findByPage,foreignKeyExistingCheck,limitReachedIp,byCampaign,emailExistingCheck,getAll,getAllDated,activityOverview",PRIMARY,"4",,1626,0.65,Using where; Backward index scan
As you can see it says it only hits 1626 rows. But, when i execute it - then it takes 200+ seconds to run.
I'm doing this to fetch data for a datatable that is to display the latest 10 records. I also have pagination that allows one to navigate pages (only able to go to next page, not last or make any big jumps).
To further help with getting the full picture I've put together a dbfiddle. https://dbfiddle.uk/Jc_K68rj - this fiddle does not have the same results as my table. But i suspect this is because of the data size that I'm having with my table.
The table in question has 120GB data and 39.000.000 active records. I already have an index put in that should cover the query and allow it to fetch the data fast. Am i completely missing something here?
Another solution goes something like this:
SELECT b.*
FROM ( SELECT registration_id
FROM xyz
where `xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by `registration_id` desc
limit 10 offset 0 ) AS a
JOIN xyz AS b USING (registration_id)
order by `registration_id` desc;
Explanation:
The derived table (subquery) will use the 'best' query without any extra prompting -- since it is "covering".
That will deliver 10 ids
Then 10 JOINs to the table to get xyz.*
A derived table is unordered, so the ORDER BY does need repeating.
That's tricking the Optimizer into doing what it should have done anyway.
(Again, I encourage getting rid of any indexes that are prefixes of the the 3-column, optimal, index discussed.)
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
is optimal. (Nearly as good is the same index, but without the DESC).
Let's see the other indexes. I strongly suspect that there is at least one index that is a prefix of that index. Remove it/them. The Optimizer sometimes gets confused and picks the "smaller" index instead of the "better index.
Here's a technique for seeing whether it manages to read only 10 rows instead of most of the table: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts
I have geoip data in a table, network_start_ip and network_end_ip are varbinary(16) columns with the result of INET6_ATON(ip_start/end) as values. 2 other columns are latitude and longitude.
CREATE TABLE `ipblocks` (
`network_start_ip` varbinary(16) NOT NULL,
`network_last_ip` varbinary(16) NOT NULL,
`latitude` double NOT NULL,
`longitude` double NOT NULL,
KEY `network_start_ip` (`network_start_ip`),
KEY `network_last_ip` (`network_last_ip`),
KEY `idx_range` (`network_start_ip`,`network_last_ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
As you can see I have created 3 indexes for testing. Why does my (quite simple) query
SELECT
latitude, longitude
FROM
ipblocks b
WHERE
INET6_ATON('82.207.219.33') BETWEEN b.network_start_ip AND b.network_last_ip
not use any these indexes?
The query takes ~3 seconds which is way too long to use it in production.
It doesn't work because there are two columns referenced -- and that is really hard to optimize. Assuming that there are no overlapping IP ranges, you can restructure the query as:
SELECT b.*
FROM (SELECT b.*
FROM ipblocks b
WHERE b.network_start_ip <= INET6_ATON('82.207.219.33')
ORDER BY b.network_start_ip DESC
LIMIT 1
) b
WHERE INET6_ATON('82.207.219.33') <= network_last_ip;
The inner query should use an index on ipblocks(network_start_ip). The outer query is only comparing one row, so it does not need any index.
Or as:
SELECT b.*
FROM (SELECT b.*
FROM ipblocks b
WHERE b.network_last_ip >= INET6_ATON('82.207.219.33')
ORDER BY b.network_end_ip ASC
LIMIT 1
) b
WHERE network_last_ip <= INET6_ATON('82.207.219.33');
This would use an index on (network_last_ip). MySQL (and I think MariaDB) does a better job with ascending sorts than descending sorts.
Thanks to Gordon Linoff I found the optimal query for my question.
SELECT b.* FROM
(SELECT b.* FROM ipblocks b WHERE b.network_start_ip <= INET6_ATON('82.207.219.33')
ORDER BY b.network_start_ip DESC LIMIT 1 )
b WHERE INET6_ATON('82.207.219.33') <= network_last_ip
Now we select the blocks smaller than INET6_ATON(82.207.219.33) in the inner query but we order them descending which enables us to use the LIMIT 1 again.
Query response time is now .002 to .004 seconds. Great!
Does this query give you correct results? Your start/end IPs seem to be stored as a binary string while you're searching for an integer representation.
I would first make sure that network_start_ip and network_last_ip are unsigned INT fields with the integer representation of the IP addresses. This is assuming that you work with IPv4 only:
CREATE TABLE ipblocks_int AS
SELECT
INET_ATON(network_start_ip) as network_start_ip,
INET_ATON(network_last_ip) as network_last_ip,
latitude,
longitude
FROM ipblocks
Then use (network_start_ip,network_last_ip) as primary key.
It's a tough problem. There is no simple solution.
The reason it is tough is that it is effectively
start <= 123 AND
last >= 123
Regardless of what indexes are available, the Optimizer will work with one or the other of those. With INDEX(start, ...), it will pick start <= 123 it will scan the first part of the index. Similarly for the other clause. One of those scans more than half the index, the other scans less -- but not enough less to be worth using an index. Moving it into the PRIMARY KEY will help with some cases, but it is hardly worth the effort.
Bottom line, not matter what you do in the way of INDEX or PRIMARY KEY, most IP constants will lead to more than 1.5 seconds for the query.
Do your start/last IP ranges overlap? If so, that adds complexity. In particular, overlaps would probably invalidate Gordon's LIMIT 1.
My solution involves requires non-overlapping regions. Any gaps in IPs necessitate 'unowned' ranges of IPs. This is because there is only a start_ip; the last_ip is implied by being less than the start of the next item in the table. See http://mysql.rjweb.org/doc.php/ipranges (It includes code for IPv4 and for IPv6.)
Meanwhile, DOUBLE for lat/lng is overkill: http://mysql.rjweb.org/doc.php/latlng#representation_choices
I have a biggish table of events. (5.3 million rows at the moment). I need to traverse this table mostly from the beginning to the end in a linear fashion. Mostly no random seeks. The data currently includes about 5 days of these events.
Due to the size of the table I need to paginate the results, and the internet tells me that "seek pagination" is the best method.
However this method works great and fast for traversing the first 3 days, after this mysql really begins to slow down. I've figured out it must be something io-bound as my cpu usage actually falls as the slowdown starts.
I do belive this has something to do with the 2-column sorting I do, and the usage of filesort, maybe Mysql needs to read all the rows to sort my results or something. Indexing correctly might be a proper fix, but I've yet been unable to find an index that solves my problem.
The compexifying part of this database is the fact that the ids and timestamps are NOT perfectly in order. The software requires the data to be ordered by timestamps. However when adding data to this database, some events are added 1 minute after they have actually happened, so the autoincremented ids are not in the chronological order.
As of now, the slowdown is so bad that my 5-day traversal never finishes. It just gets slower and slower...
I've tried indexing the table on multiple ways, but mysql does not seem to want to use those indexes and EXPLAIN keeps showing "filesort". Indexing is used on the where-statement though.
The workaround I'm currently using is to first do a full table traversal and load all the row ids and timestamps in memory. I sort the rows in the python side of the software and then load the full data in smaller chunks from mysql as I traverse (by ids only). This works fine, but is quite unefficient due to the total of 2 traversals of the same data.
The schema of the table:
CREATE TABLE `events` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`server` varchar(45) DEFAULT NULL,
`software` varchar(45) DEFAULT NULL,
`timestamp` bigint(20) DEFAULT NULL,
`data` text,
`event_type` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index3` (`timestamp`,`server`,`software`,`id`),
KEY `index_ts` (`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=7410472 DEFAULT CHARSET=latin1;
The query (one possible line):
SELECT software,
server,
timestamp,
id,
event_type,
data
FROM events
WHERE ( server = 'a58b'
AND ( software IS NULL
OR software IN ( 'ASD', 'WASD' ) ) )
AND ( timestamp, id ) > ( 100, 100 )
AND timestamp <= 200
ORDER BY timestamp ASC,
id ASC
LIMIT 100;
The query is based on https://blog.jooq.org/2013/10/26/faster-sql-paging-with-jooq-using-the-seek-method/ (and some other postings with the same idea). I belive it is called "seek pagination with seek predicate". The basic gist is that I have a starting timestamp and ending timestamp, and I need to get all the events with the software on the servers I've specifed OR only the server-specific events (software = NULL). The weirdish ( )-stuff is due tho python constructing the queries based on the parameters it is given. I left them visible if by a small chance they might have some effect.
I'm excepting the traversal to finish before the heat death of the universe.
First change
AND ( timestamp, id ) > ( 100, 100 )
to
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
This optimisation is suggested in the official documentation: Row Constructor Expression Optimization
Now the engine will be able to use the index on (timestamp). Depending on cardinality of the columns server and software, that could be already fast enough.
An index on (server, timestamp, id) should improve the performance farther.
If still not fast enough, i would suggest a UNION optimization for
AND (software IS NULL OR software IN ('ASD', 'WASD'))
That would be:
(
SELECT software, server, timestamp, id, event_type, data
FROM events
WHERE server = 'a58b'
AND software IS NULL
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
AND timestamp <= 200
ORDER BY timestamp ASC, id ASC
LIMIT 100
) UNION ALL (
SELECT software, server, timestamp, id, event_type, data
FROM events
WHERE server = 'a58b'
AND software = 'ASD'
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
AND timestamp <= 200
ORDER BY timestamp ASC, id ASC
LIMIT 100
) UNION ALL (
SELECT software, server, timestamp, id, event_type, data
FROM events
WHERE server = 'a58b'
AND software = 'WASD'
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
AND timestamp <= 200
ORDER BY timestamp ASC, id ASC
LIMIT 100
)
ORDER BY timestamp ASC, id ASC
LIMIT 100
You will need to create an index on (server, software, timestamp, id) for this query.
There are multiple complications going on.
The quick fix is
INDEX(software, timestamp, id) -- in this order
together with
WHERE server = 'a58b'
AND timestamp BETWEEN 100 AND 200
AND ( software IS NULL
OR software IN ( 'ASD', 'WASD' ) ) )
AND ( timestamp, id ) > ( 100, 100 )
ORDER BY timestamp ASC,
id ASC
LIMIT 100;
Note that server needs to be first in the index, not after the thing you are doing a range on (timestamp). Also, I broke out timestamp BETWEEN ... to make it clear to the optimizer that the next column of the ORDER BY might make use of the index.
You said "pagination", so I assume you have an OFFSET, too? Add it back in so we can discuss the implications. My blog on "remembering where you left off" instead of using OFFSET may (or may not) be practical.
I am diagnosing an intermittent slow query, and have found a strange behaviour in MySQL I cannot explain. It's choosing a different, non-optimal key strategy for one specific case, only when doing a LIMIT 1.
Table (some unreferenced data columns removed for brevity)
CREATE TABLE `ch_log` (
`cl_id` BIGINT(20) NOT NULL AUTO_INCREMENT,
`cl_unit_id` INT(11) NOT NULL DEFAULT '0',
`cl_date` DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
`cl_type` CHAR(1) NOT NULL DEFAULT '',
`cl_data` TEXT NOT NULL,
`cl_event` VARCHAR(255) NULL DEFAULT NULL,
`cl_timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`cl_record_status` CHAR(1) NOT NULL DEFAULT 'a',
PRIMARY KEY (`cl_id`),
INDEX `cl_type` (`cl_type`),
INDEX `cl_date` (`cl_date`),
INDEX `cl_event` (`cl_event`),
INDEX `cl_unit_id` (`cl_unit_id`),
INDEX `log_type_unit_id` (`cl_unit_id`, `cl_type`),
INDEX `unique_user` (`cl_user_number`, `cl_unit_id`)
)
ENGINE=InnoDB
AUTO_INCREMENT=419582094;
This is the query, which only runs slow for one specific cl_unit_id:
EXPLAIN
SELECT *
FROM `ch_log`
WHERE `ch_log_type` ='I' and ch_log_event = 'G'
AND cl_unit_id=1234
ORDER BY cl_date DESC
LIMIT 1;
id|select_type|table |type |possible_keys |key |key_len|ref|rows|Extra
1 |SIMPLE |ch_log|index|cl_type,cl_event,cl_unit_id,log_type_unit_id|cl_date|8 |\N |5295|Using where
For all other values of cl_unit_id it uses the log_type_unit_id key which is much faster.
id|select_type|table |type|possible_keys |key |key_len|ref |rows|Extra
1 |SIMPLE |ch_log|ref |ch_log_type,ch_log_event,ch_log_unit_id,log_type_unit_id|log_type_unit_id|5 |const,const|3804|Using where; Using filesort
All queries take about 0.01 seconds.
The "slow unit" query takes 10-15 minutes!
I can't see anything strange about the data for this 'unit':
Unit 1234 only has 6 records of type I and event G.
Other units have many more.
Unit 1234 only has 32,000 logs in total which is typical.
the data itself is normal, no bigger or older.
There are around 3,000 "units" in the database, which represent devices logging stuff. The cl_unit_id is their unique PK (although no constraint).
General info
There are 30m records in total, around 12GB
mysql 5.1.69-log
Centos 64bit
The data is gradually changing (30m = 3months of logs) but I don't know if this has happened before
Things I've tried, and can "solve" the problem with:
Removing the LIMIT 1 - the query runs in milliseconds and returns the data.
Changing to LIMIT 2 or other combinations e.g. 2,3 - runs in milliseconds.
Adding a index hint - solves it:
FROM `ch_log` USE INDEX (log_type_unit_id)
but... I don't want to hard-code this into the application.
Adding a second order by on the primary key also "solves" it:
ORDER BY cl_id, cl_date DESC
giving explain:
id|select_type|table |type|possible_keys |key |key_len|ref |rows|Extra
1 |SIMPLE |ch_log|ref |ch_log_type,ch_log_event,ch_log_unit_id,log_type_unit_id|log_type_unit_id|5 |const,const|6870|Using where
which is slightly different to the type hinted one, with more records examined (6,000) but still runs in 10's of milliseconds.
Again I could do this, but I don't like using side-effects I don't understand.
So I think my main question are:
a) why does it only happen for LIMIT 1?
b) how can the data itself affect the key-strategy so much? And what aspect of the data, seeing as the quantity and spread in the indexes seems typical.
Mysql will pick an explain plan and use different indexes depending on what it thinks is statistically the best choice. For all your first questions, this is the answer:
Removing the LIMIT 1 - the query runs in milliseconds and returns the data.
and -> Yes, check it, the explain plan is good
Changing to LIMIT 2 or other combinations e.g. 2,3 - runs in milliseconds. -> the same applies. The optimizer chooses a different index because suddenly, the expected block reads became twice as big as with LIMIT 1 (that's just one possibility)
Adding a index hint solves it -> Of course, you force a good explain plan
Adding a second order by on the primary key also "solves" it -> yes, because by coincidence, the result is a better explain plan
Now, that only answers half of the questions.
a) why does it only happen for LIMIT 1?
It actually happens not only because of LIMIT 1, but because of
Your data statistic repartition (orients the optimizer's decisions)
Your ORDER BY DESC clause. Try with ORDER BY ... ASC and you will probably see an improvement too.
This phenomenon is perfectly aknowledged. Please read on.
One of the accepted solutions (bottom down in the article) is to force the index the same way you did. Yes, sometimes, it is justified. Otherwise, this hint thing would have been totally wiped out long ago. Robots cannot be always perfect :-)
b) how can the data itself affect the key-strategy so much? And what
aspect of the data, seeing as the quantity and spread in the indexes
seems typical.
You said it, the spread is what usually fucks up. Not only the optimizer might just make a wrong decision with accurate statistics, but it could also be completely off just because the delta on the table is right below 1 / 16th of the total row count...
this feels like a "do my homework for me" kind of question but I'm really stuck here trying to make this query run quickly against a table with many many rows. Here's a SQLFiddle that shows the schema (more or less).
I've played with the indexes, trying to get something that will show all the required columns but haven't had much success. Here's the create:
CREATE TABLE `AuditEvent` (
`auditEventId` bigint(20) NOT NULL AUTO_INCREMENT,
`eventTime` datetime NOT NULL,
`target1Id` int(11) DEFAULT NULL,
`target1Name` varchar(100) DEFAULT NULL,
`target2Id` int(11) DEFAULT NULL,
`target2Name` varchar(100) DEFAULT NULL,
`clientId` int(11) NOT NULL DEFAULT '1',
`type` int(11) not null,
PRIMARY KEY (`auditEventId`),
KEY `Transactions` (`clientId`,`eventTime`,`target1Id`,`type`),
KEY `TransactionsJoin` (`auditEventId`, `clientId`,`eventTime`,`target1Id`,`type`)
)
And (a version of) the select:
select ae.target1Id, ae.type, count(*)
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type;
I end up with a 'Using temporary' and 'Using filesort' as well. I tried dropping the count(*) and using select distinct instead, which doesn't cause the 'Using filesort'. This would probably be okay if there was a way to join back to get the counts.
Originally, the decision was made to track the target1Name and target2Name of the targets as they existed when the audit record was created. I need those names as well (the most recent will do).
Currently the query (above, with missing target1Name and target2Name columns) runs in about 5 seconds on ~24million records. Our target is in the hundreds of millions and we'd like the query to continue to perform along those lines (hoping to keep it under 1-2 minutes, but we'd like to have it much better), but my fear is once we hit that larger amount of data it won't (work to simulate additional rows is underway).
I'm not sure of the best strategy to get the additional fields. If I add the columns straight into the select I lose the 'Using index' on the query. I tried a join back to the table, which keeps the 'Using index' but takes around 20 seconds.
I did try changing the eventTime column to an int rather than a datetime but that didn't seem to affect the index use or time.
As you probably understand, the problem here is the range condition ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00' which (as it always does) breaks efficient usage of Transactions index (that is index is actually used only for clientId equation and first part of the range condition and the index is not used for grouping).
Most often, the solution is to replace the range condition with an equality check (in your case, introduce a period column, group eventTime into periods and replace the BETWEEN clause with a period IN (1,2,3,4,5)). But this might become an overhead for your table.
Another solution that you might try is to add another index (probably replace Transactions if it is not used anymore): (clientId, target1Id, type, eventTime), and use the following query:
SELECT
ae.target1Id,
ae.type,
COUNT(
NULLIF(ae.eventTime BETWEEN '2011-09-01 03:00:00'
AND '2012-09-30 23:57:00', 0)
) as cnt,
FROM AuditEvent ae
WHERE ae.clientId=4
GROUP BY ae.target1Id, ae.type;
That way, you will a) move the range condition to the end, b) allow using the index for the grouping, c) make the index the covering index for the query (that is the query does not need disk IO operations)
UPD1:
I am sorry, yesteday I did not carefully read your post and did not notice that your problem is to retrieve target1Name and target2Name. First of all, I am not sure that you correctly understand the meaning of Using index. The absence of Using index does not mean that no index is used for the query, Using index means that the index itself contains enough data to execute a subquery (that is the index is covering). Since target1Name and target2Name are not included in any index, the subquery that fetches them wil not have Using index.
If you question is just how to add those two fields to your query (which you consider fast enough), then just try the following:
SELECT a1.target1Id, a1.type, cnt, target1Name, target2Name
FROM (
select ae.target1Id, ae.type, count(*) as cnt, MAX(auditEventId) as max_id
from AuditEvent ae
where ae.clientId=4
and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type) as a1
JOIN AuditEvent a2 ON a1.max_id = a2.auditEventId
;