Selecting the oldest updated set of entries - mysql

I have the following table my_entry:
Id int(11) AI PK
InternalId varchar(30)
UpdatedDate datetime
IsDeleted bit(1)
And I have the following query:
SELECT
`Id`, `InternalId`
FROM
`my_entry`
WHERE
(`IsDeleted` = FALSE)
AND ((`UpdatedDate` IS NULL
OR DATE(`UpdatedDate`) != DATE(STR_TO_DATE('17/10/2019', '%d/%m/%Y'))))
ORDER BY `x`.`UpdatedDate`
Limit 200;
The table has around 3M records, I have a program running that executes the above query and returns 200 entries from the table that weren't updated today, the program then changes those 200 entries and updates them again setting the UpdatedDate to today's date, on the next execution those 200 entries will be ignored, and new 200 entries will get selected, this keeps running until all the entries in the table are selected and updated for today.
This way I can ensure that all the entries are updated at least once every day.
This works perfectly fine, for the very first thousands of entries, the select query executes in a couple of milliseconds, but as soon as more entries are updated and have today's date in the UpdatedDate the query keeps slowing down, reaching execution times up to 20 seconds.
I'm wondering if I can do something to optimize the query, or if there is a better approach to take without using the UpdatedDate.
I was thinking of using the Id and paginating the entries, but I'm afraid this way I might miss some of them.
What I already tried:
Adding indexes to both the UpdatedDate and IsDeleted.
Changing the UpdatedDate type from datetime to date.
Edit:
MySql version: 5.6.45
The table in hand:
CREATE TABLE `my_entry` (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`InternalId` varchar(30) NOT NULL,
`UpdatedDate` date DEFAULT NULL,
`IsDeleted` bit(1) NOT NULL DEFAULT b'0',
PRIMARY KEY (`Id`),
UNIQUE KEY `InternalId` (`InternalId`),
KEY `UpdatedDate` (`UpdatedDate`),
KEY `entry_isdeleted_index` (`IsDeleted`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=8204626 DEFAULT CHARSET=utf8mb4
The output of the EXPLAIN query:
+----+-------------+-------+-------+-------------------------------------+-------------+---------+------+------+---------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-------------------------------------+-------------+---------+------+------+---------------+
| 1 | SIMPLE | x | index | "UpdatedDate entry_isdeleted_index" | UpdatedDate | 4 | NULL | 400 | "Using where" |
+----+-------------+-------+-------+-------------------------------------+-------------+---------+------+------+---------------+
Example of data in the table:
+------------+--------+---------------------+-----------+
| InternalId | Id | UpdatedDate | IsDeleted |
+------------+--------+---------------------+-----------+
| 328044773 | 552990 | 2019-10-17 10:11:29 | 0 |
| 330082707 | 552989 | 2019-10-17 10:11:29 | 0 |
| 329701688 | 552988 | 2019-10-17 10:11:29 | 0 |
| 329954358 | 552987 | 2019-10-16 10:11:29 | 0 |
| 964227577 | 552986 | 2019-10-16 12:33:29 | 0 |
| 329794593 | 552985 | 2019-10-16 12:33:29 | 0 |
| 400015773 | 552984 | 2019-10-16 12:33:29 | 0 |
| 330674329 | 552983 | 2019-10-16 12:33:29 | 0 |
+------------+--------+---------------------+-----------+
Example expected output of the query:
+------------+--------+
| InternalId | Id |
+------------+--------+
| 329954358 | 552987 |
| 964227577 | 552986 |
| 329794593 | 552985 |
| 400015773 | 552984 |
| 330674329 | 552983 |
+------------+--------+

First, simplify the date arithmetic. Then take the following approach:
Take NULL values in one subquery
Take rows on the date in another
Then order and select the results
Start by writing the query as:
SELECT Id, InternalId
FROM ((SELECT Id, InternalId, 2 as priority
FROM my_entry
WHERE NOT IsDeleted AND UpdatedDate IS NULL
LIMIT 200
) UNION ALL
(SELECT Id, InternalId, 1 as priority
FROM my_entry
WHERE NOT IsDeleted AND UpdatedDate <> '2019-10-17'
LIMIT 200
)
) t
ORDER BY priority
LIMIT 200;
The index that you want is either (updateddate, isdeleted) or (isdeleted, updateddate). You can add id and internalid.
The idea is to select at most 200 rows from the two subqueries without sorting. Then the outer query is sorting at most 400 rows -- and that should not take multiple seconds.

Related

Extract records by amount summary from second table

My query:
SELECT fd.*
FROM `fin_document` as fd
LEFT JOIN `fin_income` as fi ON fd.id=fi.document_id
WHERE fd.dt_payment < NOW()
HAVING SUM(fi.amount) < fd.total_amount
which is obviously not correct, has to retrieve all records from fin_document where dt_payment is earlier than NOW(). This part is ok. But I have to filter them by the payments made on this documents. One document can have more than 1 payment ( 2,3,4,5 ...). In fin_income are those payments. There is column document_id in fin_income table which is foreign key fin_income.document_id=fin_document.id. The problem ( at least for me ) is that I don't have a specific id criterion and the amount is made from all records from fin_income table. I also have to find records that still don't have payments on them ( they don't have rows in fin_income ).
fin_document:
+-------------------+---------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| dt_payment | date | YES | | NULL |
| total_amount | decimal(10,2) | NO | | 0.00 |
+-------------------+---------------------------+------+-----+---------+----------------+
fin_income:
+------------------+---------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+---------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| document_id | int(11) | YES | MUL | NULL | |
| amount | decimal(10,2) | YES | | 0.00 |
+------------------+---------------+------+-----+-------------------+----------------+
I'm not sure if I understand you correctly, but you can try this:
SELECT fd.*, SUM(IFNULL(fi.amount, 0)) as sum_amount, COUNT(fi.amount) as count_amount
FROM `fin_document` as fd
LEFT JOIN `fin_income` as fi ON fd.id=fi.document_id
WHERE fd.dt_payment < NOW()
GROUP BY fd.id
HAVING sum_amount < fd.total_amount # condition for searching by sum of payments
AND count_amount = {needed_count} # condition for searching by count of payments;
# documents without payments will have
# sum and count equal to 0
All aggregations are made in SELECT part, then all documents are grouped by id to avoid duplicates in result and make possible to use aggregation results (SUM, COUNT). And finally you can apply needed conditions (about date, paid sum or count of payments).
Note: pay attention that GROUP BY significantly increases execution time for a lot of data.
You may just need a correlated sub query to test income
drop table if exists fin_document,fin_income;
create table fin_document
(id int(11),
dt_payment date ,
total_amount decimal(10,2)
) ;
create table fin_income
( id int(11) ,
document_id int(11) ,
amount decimal(10,2)
);
insert into fin_document values
(1,'2019-05-31',1000),
(2,'2019-06-10',1000),
(3,'2019-07-10',1000);
insert into fin_income values
(1,1,5),(1,1,5);
SELECT fd.*,(select coalesce(sum(fi.amount),0) from fin_income fi where fd.id=fi.document_id) income
FROM `fin_document` as fd
WHERE fd.dt_payment < NOW() and
fd.total_amount > (select coalesce(sum(fi.amount),0) from fin_income fi where fd.id=fi.document_id);
+------+------------+--------------+--------+
| id | dt_payment | total_amount | income |
+------+------------+--------------+--------+
| 1 | 2019-05-31 | 1000.00 | 10.00 |
| 2 | 2019-06-10 | 1000.00 | 0.00 |
+------+------------+--------------+--------+
2 rows in set (0.00 sec)

MySQL: Strange behavior of UPDATE query (ERROR 1062 Duplicate entry)

I have a MySQL database the stores news articles with the publications date (just day information), the source, and category. Based on these I want to generate a table that holds the article counts w.r.t. to these 3 parameters.
Since for some combinations of these 3 parameters there might be no article, a simple GROUP BY won't do. I therefore first generate a table news_article_counts with all possible combinations of the 3 parameters, and an default article_count of 0 -- like this:
SELECT * FROM news_article_counts;
+--------------+------------+----------+---------------+
| published_at | source | category | article_count |
+------------- +------------+----------+---------------+
| 2016-08-05 | 1826089206 | 0 | 0 |
| 2016-08-05 | 1826089206 | 1 | 0 |
| 2016-08-05 | 1826089206 | 2 | 0 |
| 2016-08-05 | 1826089206 | 3 | 0 |
| 2016-08-05 | 1826089206 | 4 | 0 |
| ... | ... | ... | ... |
+--------------+------------+----------+---------------+
For testing, I now created a temporary table tmp as the GROUP BY result from the original news article table:
SELECT * FROM tmp LIMIT 6;
+--------------+------------+----------+-----+
| published_at | source | category | cnt |
+--------------+------------+----------+-----+
| 2016-08-05 | 1826089206 | 3 | 1 |
| 2003-09-19 | 1826089206 | 4 | 1 |
| 2005-08-08 | 1826089206 | 3 | 1 |
| 2008-07-22 | 1826089206 | 4 | 1 |
| 2008-11-26 | 1826089206 | 8 | 1 |
| ... | ... | ... | ... |
+--------------+------------+----------+-----+
Given these two tables, the following query works as expected:
SELECT * FROM news_article_counts c, tmp t
WHERE c.published_at = t.published_at AND c.source = t.source AND c.category = t.category;
But now I need to update the article_count of table news_article_counts with the values in table tmp where the 3 parameters match up. For this I'm using the following query (I've tried different ways but with the same results):
UPDATE
news_article_counts c
INNER JOIN
tmp t
ON
c.published_at = t.published_at AND
c.source = t.source AND
c.category = t.category
SET
c.article_count = t.cnt;
Executing this query yields this error:
ERROR 1062 (23000): Duplicate entry '2018-04-07 14:46:17-1826089206-1' for key 'uniqueIndex'
uniqueIndex is a joint index over published_at, source, category of table news_article_counts. But this shouldn't be a problem since I do not -- as far as I can tell -- update any of those 3 values, only article_count.
What confuses me most is that in the error it mentions the timestamp I executed the query (here: 2018-04-07 14:46:17). I have no absolutely idea where this comes into play. In fact, some rows in news_article_counts now have 2018-04-07 14:46:17 as value for published_at. While this explains the error, I cannot see why published_at gets overwritten with the current timestamp. There is no ON UPDATE CURRENT_TIMESTAMP on this column; see:
CREATE TABLE IF NOT EXISTS `test`.`news_article_counts` (
`published_at` TIMESTAMP NOT NULL,
`source` INT UNSIGNED NOT NULL,
`category` INT UNSIGNED NOT NULL,
`article_count` INT UNSIGNED NOT NULL DEFAULT 0,
UNIQUE INDEX `uniqueIndex` (`published_at` ASC, `source` ASC, `category` ASC))
ENGINE = MyISAM
DEFAULT CHARACTER SET = utf8mb4;
What am I missing here?
UPDATE 1: I actually checked the table definition of news_article_counts in the database. And there's indeed the following:
mysql> SHOW COLUMNS FROM news_article_counts;
+---------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+-------------------+-----------------------------+
| published_at | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| source | int(10) unsigned | NO | | NULL | |
| category | int(10) unsigned | NO | | NULL | |
| article_count | int(10) unsigned | NO | | 0 | |
+---------------+------------------+------+-----+-------------------+-----------------------------+
But why is on update CURRENT_TIMESTAMP set. I double and triple-checked my CREATE TABLE statement. I removed the joint index, I added an artificial primary key (auto_increment). Nothing help. I've even tried to explicitly remove these attributes from published_at with:
ALTER TABLE `news_article_counts` CHANGE `published_at` `published_at` TIMESTAMP NOT NULL;
Nothing seems to work for me.
It looks like you have the explicit_defaults_for_timestamp system variable disabled. One of the effects of this is:
The first TIMESTAMP column in a table, if not explicitly declared with the NULL attribute or an explicit DEFAULT or ON UPDATE attribute, is automatically declared with the DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP attributes.
You could try enabling this system variable, but that could potentially impact other applications. I think it only takes effect when you're actually creating a table, so it shouldn't affect any existing tables.
If you don't to make a system-level change like this, you could add an explicit DEFAULT attribute to the published_at column of this table, then it won't automatically add ON UPDATE.

MySql calendar table and performances

for a project i'm working on, i have a single table with two dates meaning a range of dates and i needed a way to "multiply" my rows for every day in between the two dates.
So for instance i have start 2017-07-10, end 2017-07-14
I needed to have 4 lines with 2017-07-10, 2017-07-11, 2017-07-12, 2017-07-13
In order to do this i found here someone mentioning using a "calendar table" with all the dates for years.
So i built it, now i have these two simple tables:
CREATE TABLE `time_sample` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start` varchar(16) DEFAULT NULL,
`end` varchar(16) DEFAULT NULL,
PRIMARY KEY (`societa_id`),
KEY `start_idx` (`start`),
KEY `end_idx` (`end`)
) ENGINE=MyISAM AUTO_INCREMENT=222 DEFAULT CHARSET=latin1;
This table contains my date ranges, start and end are indexed, the primary key is an incremental int.
Sample Row:
id start end
1 2015-05-13 2015-05-18
Second table:
CREATE TABLE `time_dimension` (
`id` int(11) NOT NULL,
`db_date` date NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `td_dbdate_idx` (`db_date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
This has a date indexed for every day for many years to come.
Sample row:
id db_date
20120101 2012-01-01
Now, i made the join:
select * from time_sample s join time_dimension t on (t.db_date >= start and t.db_date < end);
This takes 3ms. Even if my first table is HUGE, this query will always be very quick (max i've seen was 50ms with a lot of records).
The issue i have is while grouping results (i need them grouped for my application):
select * from time_sample s join time_dimension t on (t.db_date >= start and t.db_date < end) group by db_date;
This takes more than one second with not so many rows in the first table, increasing dramatically. Why is this happening and how can i avoid this?
Changing the data types doesn't help, having the second table with just one column doesn't help.
Can i have suggestions, please :(
I cannot replicate this result...
I have a calendar table with lots of dates: calendar(dt) where dt is a PRIMARY KEY DATE data type.
DROP TABLE IF EXISTS time_sample;
CREATE TABLE time_sample (
id int(11) NOT NULL AUTO_INCREMENT,
start date not NULL,
end date null,
PRIMARY KEY (id),
KEY (start,end)
);
INSERT INTO time_sample (start,end) VALUES ('2010-03-13','2010-05-09);
SELECT *
FROM calendar x
JOIN time_sample y
ON x.dt BETWEEN y.start AND y.end;
+------------+----+------------+------------+
| dt | id | start | end |
+------------+----+------------+------------+
| 2010-03-13 | 1 | 2010-03-13 | 2010-05-09 |
| 2010-03-14 | 1 | 2010-03-13 | 2010-05-09 |
| 2010-03-15 | 1 | 2010-03-13 | 2010-05-09 |
| 2010-03-16 | 1 | 2010-03-13 | 2010-05-09 |
...
| 2010-05-09 | 1 | 2010-03-13 | 2010-05-09 |
+------------+----+------------+------------+
58 rows in set (0.10 sec)
EXPLAIN
SELECT * FROM calendar x JOIN time_sample y ON x.dt BETWEEN y.start AND y.end;
+----+-------------+-------+--------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | y | system | start | NULL | NULL | NULL | 1 | |
| 1 | SIMPLE | x | range | PRIMARY | PRIMARY | 3 | NULL | 57 | Using where; Using index |
+----+-------------+-------+--------+---------------+---------+---------+------+------+--------------------------+
2 rows in set (0.00 sec)
Even with a GROUP BY, I'm struggling to reproduce the problem. Here's a simple COUNT...
SELECT SQL_NO_CACHE dt, COUNT(1) FROM calendar x JOIN time_sample y WHERE x.dt BETWEEN y.start AND y.end GROUP BY dt ORDER BY COUNT(1) DESC LIMIT 3;
+------------+----------+
| dt | COUNT(1) |
+------------+----------+
| 2010-04-03 | 2 |
| 2010-05-05 | 2 |
| 2010-03-13 | 2 |
+------------+----------+
3 rows in set (0.36 sec)
EXPLAIN
SELECT SQL_NO_CACHE dt, COUNT(1) FROM calendar x JOIN time_sample y WHERE x.dt BETWEEN y.start AND y.end GROUP BY dt ORDER BY COUNT(1) DESC LIMIT 3;
+----+-------------+-------+-------+---------------+---------+---------+------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+----------------------------------------------+
| 1 | SIMPLE | y | index | start | start | 7 | NULL | 2 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | x | index | PRIMARY | PRIMARY | 3 | NULL | 1000001 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+----------------------------------------------+

Why doesn't MySql automatically optimises BETWEEN query?

I have two query for same output
Slow Query:
SELECT
*
FROM
account_range
WHERE
is_active = 1 AND '8033576667466317' BETWEEN range_start AND range_end;
Execution Time: ~800 ms.
Explain:
+----+-------------+---------------+------------+------+-------------------------------------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+-------------------------------------------+------+---------+------+--------+----------+-------------+
| 1 | SIMPLE | account_range | NULL | ALL | range_start,range_end,range_se_active_idx | NULL | NULL | NULL | 940712 | 2.24 | Using where |
+----+-------------+---------------+------------+------+-------------------------------------------+------+---------+------+--------+----------+-------------+
Very Fast Query: learnt from here
SELECT
*
FROM
account_range
WHERE
is_active = 1 AND
range_start = (SELECT
MAX(range_start)
FROM
account_range
WHERE
range_start <= '8033576667466317') AND
range_end = (SELECT
MIN(range_end)
FROM
account_range
WHERE
range_end >= '8033576667466317')
Execution Time: ~1ms
Explain:
+----+-------------+---------------+------------+------+-------------------------------------------+---------------------+---------+-------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+-------------------------------------------+---------------------+---------+-------------------+------+----------+------------------------------+
| 1 | PRIMARY | account_range | NULL | ref | range_start,range_end,range_se_active_idx | range_se_active_idx | 125 | const,const,const | 1 | 100.00 | NULL |
| 3 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
| 2 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+---------------+------------+------+-------------------------------------------+---------------------+---------+-------------------+------+----------+------------------------------+
Table Structure:
CREATE TABLE account_range (
id int(11) unsigned NOT NULL AUTO_INCREMENT,
range_start varchar(20) NOT NULL,
range_end varchar(20) NOT NULL,
is_active tinyint(1) NOT NULL,
bank_name varchar(100) DEFAULT NULL,
addedon timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
updatedon timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
description text,
PRIMARY KEY (id),
KEY range_start (range_start),
KEY range_end (range_end),
KEY range_se_active_idx (range_start , range_end , is_active)
) ENGINE=InnoDB AUTO_INCREMENT=946132 DEFAULT CHARSET=utf8;
Please do explain Why doesn't MySql automatically optimizes BETWEEN query?
Update:
Realised my mistake from #kordirko answer. My table contains only non-overlapping ranges, so both queries are returning same results.
Such a comparision doesn't make sense, since you are comparing apples to oranges.
These two queries are not eqivalent, they give different resuts, thus MySql optimises them in a different way and their plans can differ.
See this simple example: http://sqlfiddle.com/#!9/98678/2
create table account_range(
is_active int,
range_start int,
range_end int
);
insert into account_range values
(1,-20,100), (1,10,30);
First query gives 2 rows:
select * from account_range
where is_active = 1 and 25 between range_start AND range_end;
| is_active | range_start | range_end |
|-----------|-------------|-----------|
| 1 | -20 | 100 |
| 1 | 10 | 30 |
Second query gives only 1 row:
SELECT * FROM account_range
WHERE
is_active = 1 AND
range_start = (SELECT MAX(range_start)
FROM account_range
WHERE range_start <= 25
) AND
range_end = (SELECT MIN(range_end)
FROM account_range
WHERE range_end >= 25
)
| is_active | range_start | range_end |
|-----------|-------------|-----------|
| 1 | 10 | 30 |
To speed this query up (the first one), two bitmap indexes can be used together with "bitmap and" operation - but MySql doesn't have such a feature.
Another option is a spatial index (for example GIN indexes in PostgreSql: http://www.postgresql.org/docs/current/static/textsearch-indexes.html).
And another option is a star transformation (or a star schema) - you need to "divide" this table into two "dimension" or "measures" tables and one "fact" table .... but this is too broad topic, if you want know more you can start from here: https://en.wikipedia.org/wiki/Star_schema
Second query is fast because MySQL is able to use available index.
SELECT * FROM account_range
WHERE
is_active = 1 AND
range_start = a_constant_value_1 AND
range_end = a_constant_value_2
Above query is fast because range_se_active_idx index can satisfy search criteria so it is used.
Both subqueries are also fast (see Select tables optimized away in EXPLAIN's output)
SELECT MAX(range_start) FROM account_range
WHERE range_start <= '8033576667466317'
SELECT MIN(range_end) FROM account_range
WHERE range_end >= '8033576667466317'
because range_start and range_end both are indexed, they are ordered.
With ordered data, for first sub query, MySQL basically just picks one record whose range_start equals 8033576667466317 or one below it (MAX(range_start)). For second sub query, MySQL picks one record whose range_end equals 8033576667466317 or one above it (MIN(range_end)).
For BETWEEN ... AND .. query, MySQL cannot find any indices because that is not a range search. It's basically same as
SELECT * FROM account_range
WHERE
is_active = 1 AND
range_start >= '8033576667466317' AND
range_end <= '8033576667466317';
It has to search records with range_start from 8033576667466317 to largest value and also from smallest range_end to 8033576667466317. All indices cannot satisfy this criteria so it has to scan table.
I believe it can be optimized if you can rewrite it into something like this:
SELECT * FROM account_range
WHERE
is_active = 1 AND
(range_start BETWEEN a_min_value AND a_max_value) AND
(range_end BETWEEN a_min_value AND a_max_value);

Query against two integer columns take an absurd amount of time

I have a query that gets generated (by Django) like this:
SELECT `geo_ip`.`id`, `geo_ip`.`start_ip`,
`geo_ip`.`end_ip`, `geo_ip`.`start`,
`geo_ip`.`end`, `geo_ip`.`cc`, `geo_ip`.`cn`
FROM `geo_ip`
WHERE (`geo_ip`.`start` <= 2084738290 AND `geo_ip`.`end` >= 2084738290 )
LIMIT 1
It queries a GeoLocating table with 134189 entries in it. Each query takes >100ms to perform when indexes are added, which makes it unusable for more than one-off things. I'm going to cache the response so I only have to do the IP lookup once, but I'm curious if I'm missing some obvious way of making it a magnitude faster. My table:
CREATE TABLE `geo_ip` (
`start_ip` char(15) NOT NULL,
`end_ip` char(15) NOT NULL,
`start` bigint(20) NOT NULL,
`end` bigint(20) NOT NULL,
`cc` varchar(6) NOT NULL,
`cn` varchar(150) NOT NULL,
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`),
) ENGINE=InnoDB AUTO_INCREMENT=134190 DEFAULT CHARSET=latin1
Creating an index on both columns like so:
ALTER TABLE geo_ip ADD INDEX (start, end);
Gives the following explain:
EXPLAIN SELECT geo_ip.id, geo_ip.start_ip, geo_ip.end_ip,
geo_ip.start, geo_ip.end, geo_ip.cc, geo_ip.cn
FROM geo_ip
WHERE (geo_ip.end >= 2084738290 AND geo_ip.start < 2084738290)
LIMIT 1;
+----+-------------+--------+-------+---------------+-------+---------+------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+-------+---------------+-------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | geo_ip | range | start | start | 8 | NULL | 67005 | 100.00 | Using where |
+----+-------------+--------+-------+---------------+-------+---------+------+-------+----------+-------------+
It takes well over 100ms to complete selects:
SELECT geo_ip.id, geo_ip.start_ip, geo_ip.end_ip,
geo_ip.start, geo_ip.end, geo_ip.cc,
geo_ip.cn
FROM geo_ip
WHERE (geo_ip.end >= 2084738290 and geo_ip.start < 2084738290)
LIMIT 1;
+-------+--------------+----------------+------------+------------+----+-----------+
| id | start_ip | end_ip | start | end | cc | cn |
+-------+--------------+----------------+------------+------------+----+-----------+
| 51725 | 124.66.128.0 | 124.66.159.255 | 2084732928 | 2084741119 | SG | Singapore |
+-------+--------------+----------------+------------+------------+----+-----------+
1 row in set (0.18 sec)
Is more expensive than having a single individual index:
ALTER TABLE geo_ip ADD INDEX (`start`);
ALTER TABLE geo_ip ADD INDEX (`end`);
+----+-------------+--------+-------+---------------+-------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+-------+---------+------+-------+-------------+
| 1 | SIMPLE | geo_ip | range | start,end | start | 8 | NULL | 68017 | Using where |
+----+-------------+--------+-------+---------------+-------+---------+------+-------+-------------+
It takes around 100ms to complete these requests:
SELECT geo_ip.id, geo_ip.start_ip, geo_ip.end_ip, geo_ip.start, geo_ip.end, geo_ip.cc, geo_ip.cn FROM geo_ip
WHERE (geo_ip.end >= 2084738290 AND geo_ip.start < 2084738290) limit 1;
+-------+--------------+----------------+------------+------------+----+-----------+
| id | start_ip | end_ip | start | end | cc | cn |
+-------+--------------+----------------+------------+------------+----+-----------+
| 51725 | 124.66.128.0 | 124.66.159.255 | 2084732928 | 2084741119 | SG | Singapore |
+-------+--------------+----------------+------------+------------+----+-----------+
1 row in set (0.11 sec)
But both of these methods take way too long, is it possible to do anything about this?
Time is always consumed in the "where" clause.
And because you are working on two different fields with "lower than" or "greater than", it has to read a lot of indexes to find out which record is the one you want.
I should have done my table this way :
+-------+-------+----------------+------------+----+-----------+
| id | type | ip | geo | cc | cn |
+-------+-------+----------------+------------+----+-----------+
| 51725 | start | 124.66.159.255 | 2084732928 | SG | Singapore |
+-------+-------+----------------+------------+----+-----------+
| 51726 | end | 124.66.159.255 | 2084732928 | SG | Singapore |
+-------+-------+----------------+------------+----+-----------+
so that I can select this :
select * from table where geo between '2084732927' and '2084732928'
with an index on geo.
Should be much, much faster. But sorry, I have no time to try.