I have geoip data in a table, network_start_ip and network_end_ip are varbinary(16) columns with the result of INET6_ATON(ip_start/end) as values. 2 other columns are latitude and longitude.
CREATE TABLE `ipblocks` (
`network_start_ip` varbinary(16) NOT NULL,
`network_last_ip` varbinary(16) NOT NULL,
`latitude` double NOT NULL,
`longitude` double NOT NULL,
KEY `network_start_ip` (`network_start_ip`),
KEY `network_last_ip` (`network_last_ip`),
KEY `idx_range` (`network_start_ip`,`network_last_ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
As you can see I have created 3 indexes for testing. Why does my (quite simple) query
SELECT
latitude, longitude
FROM
ipblocks b
WHERE
INET6_ATON('82.207.219.33') BETWEEN b.network_start_ip AND b.network_last_ip
not use any these indexes?
The query takes ~3 seconds which is way too long to use it in production.
It doesn't work because there are two columns referenced -- and that is really hard to optimize. Assuming that there are no overlapping IP ranges, you can restructure the query as:
SELECT b.*
FROM (SELECT b.*
FROM ipblocks b
WHERE b.network_start_ip <= INET6_ATON('82.207.219.33')
ORDER BY b.network_start_ip DESC
LIMIT 1
) b
WHERE INET6_ATON('82.207.219.33') <= network_last_ip;
The inner query should use an index on ipblocks(network_start_ip). The outer query is only comparing one row, so it does not need any index.
Or as:
SELECT b.*
FROM (SELECT b.*
FROM ipblocks b
WHERE b.network_last_ip >= INET6_ATON('82.207.219.33')
ORDER BY b.network_end_ip ASC
LIMIT 1
) b
WHERE network_last_ip <= INET6_ATON('82.207.219.33');
This would use an index on (network_last_ip). MySQL (and I think MariaDB) does a better job with ascending sorts than descending sorts.
Thanks to Gordon Linoff I found the optimal query for my question.
SELECT b.* FROM
(SELECT b.* FROM ipblocks b WHERE b.network_start_ip <= INET6_ATON('82.207.219.33')
ORDER BY b.network_start_ip DESC LIMIT 1 )
b WHERE INET6_ATON('82.207.219.33') <= network_last_ip
Now we select the blocks smaller than INET6_ATON(82.207.219.33) in the inner query but we order them descending which enables us to use the LIMIT 1 again.
Query response time is now .002 to .004 seconds. Great!
Does this query give you correct results? Your start/end IPs seem to be stored as a binary string while you're searching for an integer representation.
I would first make sure that network_start_ip and network_last_ip are unsigned INT fields with the integer representation of the IP addresses. This is assuming that you work with IPv4 only:
CREATE TABLE ipblocks_int AS
SELECT
INET_ATON(network_start_ip) as network_start_ip,
INET_ATON(network_last_ip) as network_last_ip,
latitude,
longitude
FROM ipblocks
Then use (network_start_ip,network_last_ip) as primary key.
It's a tough problem. There is no simple solution.
The reason it is tough is that it is effectively
start <= 123 AND
last >= 123
Regardless of what indexes are available, the Optimizer will work with one or the other of those. With INDEX(start, ...), it will pick start <= 123 it will scan the first part of the index. Similarly for the other clause. One of those scans more than half the index, the other scans less -- but not enough less to be worth using an index. Moving it into the PRIMARY KEY will help with some cases, but it is hardly worth the effort.
Bottom line, not matter what you do in the way of INDEX or PRIMARY KEY, most IP constants will lead to more than 1.5 seconds for the query.
Do your start/last IP ranges overlap? If so, that adds complexity. In particular, overlaps would probably invalidate Gordon's LIMIT 1.
My solution involves requires non-overlapping regions. Any gaps in IPs necessitate 'unowned' ranges of IPs. This is because there is only a start_ip; the last_ip is implied by being less than the start of the next item in the table. See http://mysql.rjweb.org/doc.php/ipranges (It includes code for IPv4 and for IPv6.)
Meanwhile, DOUBLE for lat/lng is overkill: http://mysql.rjweb.org/doc.php/latlng#representation_choices
Related
I'm having trouble understanding my options for how to optimize this specific query. Looking online, I find various resources, but all for queries that don't match my particular one. From what I could gather, it's very hard to optimize a query when you have an order by combined with a limit.
My usecase is that i would like to have a paginated datatable that displayed the latest records first.
The query in question is the following (to fetch 10 latest records):
select
`xyz`.*
from
xyz
where
`xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by
`registration_id` desc
limit 10 offset 0
& table DDL:
CREATE TABLE `xyz` (
`registration_id` int NOT NULL AUTO_INCREMENT,
`fk_campaign_id` int DEFAULT NULL,
`fk_customer_id` int DEFAULT NULL,
... other fields ...
`voided` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`registration_id`),
.... ~12 other indexes ...
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
) ENGINE=InnoDB AUTO_INCREMENT=280614594 DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;
The explain on the query mentioned gives me the following:
"id","select_type","table","partitions","type","possible_keys","key","key_len","ref","rows","filtered","Extra"
1,SIMPLE,db_campaign_registration,,index,"getTop5,winners,findByPage,foreignKeyExistingCheck,limitReachedIp,byCampaign,emailExistingCheck,getAll,getAllDated,activityOverview",PRIMARY,"4",,1626,0.65,Using where; Backward index scan
As you can see it says it only hits 1626 rows. But, when i execute it - then it takes 200+ seconds to run.
I'm doing this to fetch data for a datatable that is to display the latest 10 records. I also have pagination that allows one to navigate pages (only able to go to next page, not last or make any big jumps).
To further help with getting the full picture I've put together a dbfiddle. https://dbfiddle.uk/Jc_K68rj - this fiddle does not have the same results as my table. But i suspect this is because of the data size that I'm having with my table.
The table in question has 120GB data and 39.000.000 active records. I already have an index put in that should cover the query and allow it to fetch the data fast. Am i completely missing something here?
Another solution goes something like this:
SELECT b.*
FROM ( SELECT registration_id
FROM xyz
where `xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by `registration_id` desc
limit 10 offset 0 ) AS a
JOIN xyz AS b USING (registration_id)
order by `registration_id` desc;
Explanation:
The derived table (subquery) will use the 'best' query without any extra prompting -- since it is "covering".
That will deliver 10 ids
Then 10 JOINs to the table to get xyz.*
A derived table is unordered, so the ORDER BY does need repeating.
That's tricking the Optimizer into doing what it should have done anyway.
(Again, I encourage getting rid of any indexes that are prefixes of the the 3-column, optimal, index discussed.)
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
is optimal. (Nearly as good is the same index, but without the DESC).
Let's see the other indexes. I strongly suspect that there is at least one index that is a prefix of that index. Remove it/them. The Optimizer sometimes gets confused and picks the "smaller" index instead of the "better index.
Here's a technique for seeing whether it manages to read only 10 rows instead of most of the table: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts
I am trying to find a way to improve performance for my mysql table containing ip ranges (it's gonna have up to 500 SELECT queries per second (!) in peak hours so I am little worried).
I have a table of this structure:
id smallint(5) Auto Increment
ip_start char(16)
ip_end char(16)
Coding is utf8_general_ci(on whole table and each columns except id), table is type of MyISAM (only SELECT queries, no insert/delete needed here). Indexes for this table are PRIMARY id.
At this momen table has almost 2000 rows. All of them contains ranges for ip.
For example:
ip_start 128.6.230.0
ip_end 128.6.238.255
When user comes to a website I am checking if his ip is in some of those ranges in my table. I use this query (dibi sql library):
SELECT COUNT(*)
FROM ip_ranges
WHERE %s", $user_ip, " BETWEEN ip_start AND ip_end
If result of query is not zero then the ip of the user is in one of those ranges in table - which is all i need it to do.
I was thinking maybe about putting some indexes to that table? But i am not quite sure how it works and if it's such a good idea (since there is maybe nothing to really index, right? most of those ip ranges are different).
I also had varchar type on those ip_start and ip_end columns but i switched it to just char (guess its faster?).
Anyone any ideas about how to improve this table/queries even further?
You don't want to use aggregation. Instead, check whether the following returns any rows:
SELECT 1
FROM ip_ranges
WHERE %s", $user_ip, " BETWEEN ip_start AND ip_end
LIMIT 1;
The LIMIT 1 says to stop at the first match, so it is faster.
For this query, you want an index on ip_ranges(ip_start, ip_end).
This still has a performance problem when there is no match. The entire index after the ip being tested has to be scanned. I think the following should be an improvement:
SELECT COUNT(*)
FROM (SELECT i.start, ip_end
FROM ip_ranges i
WHERE %s", $user_ip, " >= ip_start
ORDER BY ip_start
LIMIT 1
) i
WHERE $user_ip <= ip_end;
The inner subquery should use the index but pull back the first match. The outer query should should then check the end of the range. Here the count(*) is okay, because there is only one row.
I have a query that is forcing a full index scan against an innodb table - which is expected, however, the performance is still much slower than expected. The table has a structure like:
Field Type Null Key
CUSTOMER_ID int(11) NO MUL
CustLatitude decimal(15,12) YES
CustLongitude decimal(15,12) YES
StoreLatitude decimal(15,12) NO
StoreLongitude decimal(15,12) NO
StoreID int(11) NO MUL
Distance double YES MUL
For each CUSTOMER_ID I am selecting the row that contains the minimum Distance value as follows:
select
distinct(CUSTOMER_ID) as incustid,
(select StoreID from CustomerStoreDistance where CUSTOMER_ID = incustid
order by Distance ASC limit 1) as closeststoreid
from
CustomerStoreDistance;
As shown above, there are indexes on CUSTOMER_ID, Distance and StoreID. There are approximately 43M rows in the CustomerStoreDistance table and running on RDS with a db.cr1.8xlarge class machine with 244 GB of RAM and 32vCPUs.
The parameters have been optimized to the best of my knowledge for sorting, temp space, etc. however, am curious if there is a better way and/or more optimizations.
Thanks,
Chad
I think you would do much better if you had a separate customer table. In any case, try this version of the query:
select incustid,
(select StoreId
from CustomerStoreDistinct csd2
where csd2.CustomerId = csd.incustid
order by distance
limit 1
) as ClosestStoreId
from (select distinct CustomerId as incustid
from CustomerStoreDistance
) csd;
The subquery is to help avoid confusion between the distinct and the subquery. I think MySQL will execute the subquery before the distinct, and that is just wasted effort.
To optimize this query, you want a composite index on CustomerStoreDistance(CustomerId, Distance, StoreId).
EDIT:
Because these queries already require an aggregation to eliminate duplicates, this might work better:
select CustomerId,
substring_index(group_concat(StoredId order by Distance), ',', 1) as ClosestStoreId
from CustomerStoreDistance
group by CustomerId;
I have a large table with about 100 million records, with fields start_date and end_date, with DATE type. I need to check the number of overlaps with some date range, say between 2013-08-20 AND 2013-08-30, So I use.
SELECT COUNT(*) FROM myTable WHERE end_date >= '2013-08-20'
AND start_date <= '2013-08-30'
date column are indexed.
The important points is that the date ranges that I am searching for overlap are always in the future, while the main part of the records in the table are in the past (say about 97-99 million).
So, will this query be faster, if I add a column is_future - TINYINT, so, by checking only that condition like this
SELECT COUNT(*) FROM myTable WHERE is_future = 1
AND end_date >= '2013-08-20' AND start_date <= '2013-08-30'
it will exclude the rest 97 million or so records and will check the date condition for only the remaining 1-3 million records ?
I use MySQL
Thanks
EDIT
The mysql engine is innodb, but will matter considerably if it is say, MyISAM
here is the create table
CREATE TABLE `orders` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`title`
`start_date` date DEFAULT NULL,
`end_date` date DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
EDIT 2 after #Robert Co answer
The partitioning looks like a good idea for this case, but it does not allow me to create partition based on is_future field unless I define it as primary key, otherwise I should remove my main primary key - id, which I can not do. So, if I define that field as primary key, then is there a meaning of partitioning, will not it be fast already if I search by is_future field which is primary key.
EDIT 3
The actual query where I need to use this is to select restaurant that have some free tables for that date range
SELECT r.id, r.name, r.table_count
FROM restaurants r
LEFT JOIN orders o
ON r.id = o.restaurant_id
WHERE o.id IS NULL
OR (r.table_count > (SELECT COUNT(*)
FROM orders o2
WHERE o2.restaurant_id = r.id AND
end_date >= '2013-08-20' AND start_date <= '2013-08-30'
AND o2.status = 1
)
)
SOLUTION
After a lot more research and testing the fastest way for counting the number of rows in my case was to just add one more condition, that start_date is more than current date (because the date ranges for search are always in the future)
SELECT COUNT(*) FROM myTable WHERE end_date >= '2013-09-01'
AND start_date >= '2013-08-20' AND start_date <= '2013-09-30'
also it is necessary to have one index - with start_date and end_date fields (thank you #symcbean).
As a result the execution time on table with 10m rows from 7 seconds - became 0.050 seconds.
SOLUTION 2 (#Robert Co)
partitioning in this case worked as well !! - perhaps it is better solution than indexing. Or they can both be applied together.
Thanks
This is a perfect use case for
table partitioning. If the Oracle INTERVAL feature makes it to MySQL, then it will just add to the awesomeness.
date column are indexed
What type of index? A hash based index is no use for range queries. If it's not a BTREE index then change it now. And you've not shown us *how they are indexed. Are both columns in the same index? Is there other stuff in there too? What order (end_date must appear as the first column)?
There are implicit type conversions in the script - this should be handled automatically by the optimizer, but it's worth checking....
SELECT COUNT(*) FROM myTable WHERE end_date >= 20130820000000
AND start_date <= 20130830235959
if I add a column is_future - TINYINT
First, in order to be of any use, this would require that the future dates be a small proportion of the total data stored in the table (less than 10%). And that's just to make it more efficient than a full table scan.
Secondly, it's going to require very frequent updates to the index to maintain it, which in addition to the overhead of initial populatiopn is likely to lead to fragmentation of the index and degraded performance (depending on how the iondex is constructed).
Thirdly, if this still has to process 3 million rows of data (and specifically, via an index lookup) then it's going to be very slow even with the data pegged in memory.
Further, the optimizer is never likely to use this index without being forced to (due to the low cardinality).
I have done a simple test, just created an index on the tinyint column. The structures may not be the same, but with an index it seems to work.
http://www.sqlfiddle.com/#!2/514ab/1/0
and for count
http://www.sqlfiddle.com/#!2/514ab/2/0
View execution plan there to see that the select just scans one row which means it would process only the lesser number of records in your case.
So the simple answer is yes, with an index it would work.
I have a database of measurements that indicate a sensor, a reading, and the timestamp the reading was taken. The measurements are only recorded when there's a change. I want to generate a result set that shows the range each sensor is reading a particular measurement.
The timestamps are in milliseconds but I'm outputting the result in seconds.
Here's the table:
CREATE TABLE `raw_metric` (
`row_id` BIGINT NOT NULL AUTO_INCREMENT,
`sensor_id` BINARY(6) NOT NULL,
`timestamp` BIGINT NOT NULL,
`angle` FLOAT NOT NULL,
PRIMARY KEY (`row_id`)
)
Right now I'm getting the results I want using a subquery, but it's fairly slow when there's a lot of datapoints:
SELECT row_id,
HEX(sensor_id),
angle,
(
COALESCE((
SELECT MIN(`timestamp`)
FROM raw_metric AS rm2
WHERE rm2.`timestamp` > rm1.`timestamp`
AND rm2.sensor_id = rm1.sensor_id
), UNIX_TIMESTAMP() * 1000) - `timestamp`
) / 1000 AS duration
FROM raw_metric AS rm1
Essentially, to get the range, I need to get the very next reading (or use the current time if there isn't another reading). The subquery finds the minimum timestamp that is later than the current one but is from the same sensor.
This query isn't going to occur very often so I'd prefer to not have to add an index on the timestamp column and slow down inserts. I was hoping someone might have a suggestion as to an alternate way of doing this.
UPDATE:
The row_id's should be incremented along with timestamps but it can't be guaranteed due to network latency issues. So, it's possible that an entry with a lower row_id comes occurs AFTER a later row_id, though unlikely.
This is perhaps more appropriate as a comment than as a solution, but it is too long for a comment.
You are trying to implement the lead() function in MySQL, and MySQL does not, unfortunately, have window functions. You could switch to Oracle, DB2, Postgres, SQL Server 2012 and use the built-in (and optimized) functionality there. Ok, that may not be realistic.
So, given your data structure you need to do either a correlated subquery or a non-equijoin (actually a partial equi-join because there is match on sensor_id). These are going to be expensive operations, unless you add an index. Unless you are adding measurements tens of times per second, the additional overhead on the index should not be a big deal.
You could also change your data structure. If you had a "sensor counter" that was a sequential number enumerating the readings, then you could use this as an equijoin (although for good performance you might still want an index). Adding this in to your table would require having a trigger -- and that is likely to perform even worse than an index for when inserting.
If you only have a handful of sensors, you could create a separate table for each one. Oh, I can feel the groans at this suggestion. But, if you did, then an auto-incremented id would perform the same role. To be honest, I would only do this if I could count the number of sensors on each hand.
In the end, I might suggest that you take the hit during insertion and have "effective" and "end' times on each record (as well as an index on sensor id and either timestamp or id). With these additional columns, you will probably find more uses for the table.
If you are doing this for just one sensor, then create a temoprary table for the information and use an auto-incremented id column. Then insert the data into it:
insert into temp_rawmetric (orig_row_id, sensor_id, timestamp, angle)
select orig_row_id, sensor_id, timestamp, angle
from raw_metric
order by sensor_id, timestamp;
Be sure your table has a temp_rawmetric_id column that is auto-incremented and the primary key (creates an index automatically). The order by makes sure this is incremented according to the timestamp.
Then you can do your query as:
select trm.sensor_id, trm.angle,
trm.timestamp as startTime, trmnext.timestamp as endTime
from temp_rawmetric trm left outer join
temp_rawmetric trmnext
on trmnext.temp_rawmetric_id = trm.temp_rawmetric_id+1;
This will require a pass through the original data to extra the data, and then a primary key join on the temporary table. The first might take some time. The second should be pretty quick.
Select rm1.row_id
,HEX(rm1.sensor_id)
,rm1.angle
,(COALESCE(rm2.timestamp, UNIX_TIMESTAMP() * 1000) - rm1.timestamp) as duration
from raw_metric rm1
left outer join
raw_metric rm2
on rm2.sensor_id = rm1.sensor_id
and rm2.timestamp = (
select min(timestamp)
from raw_metric rm3
where rm3.sensor_id = rm1.sensor_id
and rm3.timestamp > rm1.timestamp
)
If you use auto_increment for primary key, you may replace timestamp by row_id in query condition part. Like this:
SELECT row_id,
HEX(sensor_id),
angle,
(
COALESCE((
SELECT MIN(`timestamp`)
FROM raw_metric AS rm2
WHERE rm2.`row_id` > rm1.`row_id`
AND rm2.sensor_id = rm1.sensor_id
), UNIX_TIMESTAMP() * 1000) - `timestamp`
) / 1000 AS duration
FROM raw_metric AS rm1
It must work some quickly.
Also you can add one more subquery for fast select row id of new senser value. See:
SELECT row_id,
HEX(sensor_id),
angle,
(
COALESCE((
SELECT timestamp FROM raw_metric AS rm1a
WHERE row_id =
(
SELECT MIN(`row_id`)
FROM raw_metric AS rm2
WHERE rm2.`row_id` > rm1.`row_id`
AND rm2.sensor_id = rm1.sensor_id
)
), UNIX_TIMESTAMP() * 1000) - `timestamp`
) / 1000 AS duration
FROM raw_metric AS rm1