Efficient way to compute number of matchings between two columns in MySQL

Efficient way to compute number of matchings between two columns in MySQL - mysql

Description
I have a MySQL table like the following one:
CREATE TABLE `ticket` (
`ticket_id` int(11) NOT NULL AUTO_INCREMENT,
`ticket_number` varchar(30) DEFAULT NULL,
`pick1` varchar(2) DEFAULT NULL,
`pick2` varchar(2) DEFAULT NULL,
`pick3` varchar(2) DEFAULT NULL,
`pick4` varchar(2) DEFAULT NULL,
`pick5` varchar(2) DEFAULT NULL,
`pick6` varchar(2) DEFAULT NULL,
PRIMARY KEY (`ticket_id`)
) ENGINE=InnoDB AUTO_INCREMENT=19675 DEFAULT CHARSET=latin1;
Let's also asume we have the following values already stored in DB:
+-----------+-------------------+-------+-------+-------+-------+-------+-------+
| ticket_id | ticket_number | pick1 | pick2 | pick3 | pick4 | pick5 | pick6 |
+-----------+-------------------+-------+-------+-------+-------+-------+-------+
| 655 | 08-09-21-24-46-52 | 8 | 9 | 21 | 24 | 46 | 52 |
| 658 | 08-23-24-40-42-45 | 8 | 23 | 24 | 40 | 42 | 45 |
| 660 | 07-18-19-20-22-31 | 7 | 18 | 19 | 20 | 22 | 45 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 19674 | 06-18-33-43-49-50 | 6 | 18 | 33 | 43 | 49 | 50 |
+-----------+-------------------+-------+-------+-------+-------+-------+-------+
Now, my goal is to compare each ticket with each other one in the Table (except itself), in terms of their respective values in ticket_number field (6 elements per set, split by -). Put differently, for instance, imagine I compare ticket_id = 655 with ticket_id = 658, in terms of the elements in their respectives ticket_number fields, then I will find that elements 08 and 24 appear in both sets. If we now compare ticket_id = 660 with ticket_id = 19674, then we have that there is only one coincidence: 18.
What I am actually using to carry out these comparisons is the following query:
select A.ticket_id, A.ticket_number, P.ticket_id, P.ticket_number, count(P.ticket_number) as cnt from ticket A inner join ticket P on A.ticket_id != P.ticket_id
where
((A.ticket_number like concat("%", lpad(P.pick1,2,0), "%"))
+ (A.ticket_number like concat("%", lpad(P.pick2,2,0), "%"))
+ (A.ticket_number like concat("%", lpad(P.pick3,2,0), "%"))
+ (A.ticket_number like concat("%", lpad(P.pick4,2,0), "%"))
+ (A.ticket_number like concat("%", lpad(P.pick5,2,0), "%"))
+ (A.ticket_number like concat("%", lpad(P.pick6,2,0), "%")) > 3) group by A.ticket_id
having cnt > 5;
That is, first I create a INNER JOIN concatenating all rows with different ticket_id and then I compare each P.pickX (X=[1..6]) with the A.ticket_number of the resulting INNER JOIN operation, and I count the number of matchings between both sets.
Finally, after executing, I obtain something like this:
+-------------+-------------------+-------------+-------------------+-----+
| A.ticket_id | A.ticket_number | P.ticket_id | P.ticket_number | cnt |
+-------------+-------------------+-------------+-------------------+-----+
| 8489 | 14-21-28-32-48-49 | 2528 | 14-21-33-45-48-49 | 6 |
| 8553 | 02-14-17-38-47-53 | 2364 | 02-30-38-44-47-53 | 6 |
| 8615 | 05-12-29-33-36-43 | 4654 | 12-21-29-33-36-37 | 6 |
| 8686 | 09-13-29-34-44-48 | 6038 | 09-13-17-29-33-44 | 6 |
| 8693 | 01-10-14-17-42-50 | 5330 | 01-10-37-42-48-50 | 6 |
| ... | ... | ... | ... | ... |
| 19195 | 05-13-29-41-46-51 | 5106 | 07-13-14-29-41-51 | 6 |
+-------------+-------------------+-------------+-------------------+-----+
Problem
The problem is that I execute this for a table of 10476 rows, resulting in more tan 100 Million ticket_number vs pickX to compare, lasting around 172 seconds in total to conclude. This is too slow.
GOAL
My goal is to make this execution as fast as possible so as to be completed in less than a second, since this must work in real-time.
Is that possible?

If you want to keep the current structure then change pick1..6 to tinyint type instead of varchar
TINYINT(1) stores the values between -128 to 128 if it is signed. And then your query won't have that concat with % which is the cause of slow run.
Then, these two queries will give you the same result
select * FROM ticket where pick1 = '8';
select * FROM ticket where pick1 = '08';
This is the sql structure:
CREATE TABLE `ticket` (
`ticket_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`ticket_number` varchar(30) DEFAULT NULL,
`pick1` tinyint(1) unsigned zerofill DEFAULT NULL,
`pick2` tinyint(1) unsigned zerofill DEFAULT NULL,
`pick3` tinyint(1) unsigned zerofill DEFAULT NULL,
`pick4` tinyint(1) unsigned zerofill DEFAULT NULL,
`pick5` tinyint(1) unsigned zerofill DEFAULT NULL,
`pick6` tinyint(1) unsigned zerofill DEFAULT NULL,
PRIMARY KEY (`ticket_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1;
I think, you even can remove the zerofill
if this doesn't work, change the table design.

How big can the numbers be? Looks like 50. If the answer is 63 or less, then change the format to this:
All 6 numbers are stored in a single SET ('0','1','2',...,'50') and use suitable operations to set the nth bit.
Then, comparing two sets becomes BIT_COUNT(x & y) to find out how many match. A simple comparison will test for equality.
If your goal is to see if a particular lottery guess is already in the table, then index that column so that a lookup will be fast. I don't mean minutes or even seconds, but rather a few milliseconds. Even for a billion rows.
The bit arithmetic can be done in SQL or in your client language. For example, to build the SET for (11, 33, 7), the code would be
INSERT INTO t SET picks = '11,33,7' -- order does not matter
Also this would work:
... picks = (1 << 11) |
(1 << 33) |
(1 << 7)
A quick example:
CREATE TABLE `setx` (
`picks` set('1','2','3','4','5','6','7','8','9','10') NOT NULL
) ENGINE=InnoDB;
INSERT INTO setx (picks) VALUES ('2,10,6');
INSERT INTO setx (picks) VALUES ('1,3,5,7,9'), ('2,4,6,8,10'), ('9,8,7,6,5,4,3,2,1,10');
SELECT picks, HEX(picks+0) FROM setx;
+----------------------+--------------+
| picks | HEX(picks+0) |
+----------------------+--------------+
| 2,6,10 | 222 |
| 1,3,5,7,9 | 155 |
| 2,4,6,8,10 | 2AA |
| 1,2,3,4,5,6,7,8,9,10 | 3FF |
+----------------------+--------------+
4 rows in set (0.00 sec)

Related

Retyping alias column in mysql query

I am trying to convert some data from old tables to a new structure, where I need to convert a single key ID to composite one, amd it is giving me some trouble:
My table (simplified):
CREATE TABLE `tmplt_spoergsmaal` (
`SpID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`lbnr` int(10) unsigned DEFAULT NULL,
`SpTekst` text,
`SpTitel` varchar(100) NOT NULL DEFAULT '',
`fk_Naeste_Sp` int(10) unsigned DEFAULT NULL,
`kontrol` tinyint(3) unsigned NOT NULL DEFAULT 0,
`kontrol_kilde` int(10) unsigned NOT NULL DEFAULT 0,
FOREIGN KEY (kontrol_kilde)
REFERENCES tmplt_spoergsmaal(SpID)
ON DELETE ignore
)
Sample data:
Note that SPID is sequential
+--------+--------+-----------+-----------+------------+-----------------+
| SpID | lbnr | SpTekst | SpTitel | kontrol | kontrol_kilde |
+--------+--------+-----------+-----------+------------+-----------------+
| 9000 | 100 | blablabla | title1 | 0 | null |
+--------+--------+-----------+-----------+------------+-----------------+
| 9001 | 101 | blablabla | title2 | 0 | null |
+--------+--------+-----------+-----------+------------+-----------------+
| 9002 | 102 | blablabla | title3 | 0 | null |
+--------+--------+-----------+-----------+------------+-----------------+
| 9003 | 103 | blablabla | title4 | 1 | 9000 |
+--------+--------+-----------+-----------+------------+-----------------+
| 9004 | 104 | blablabla | title5 | 1 | 9001 |
+--------+--------+-----------+-----------+------------+-----------------+
I am redesigning the database, and using the lbnr column instead of the kontrol_kilde column. My preliminary query is this:
SELECT spid, lbnr, kontrol, kontrol_kilde, (lbnr- (spid - kontrol_kilde)* kontrol)* kontrol AS k
FROM tmplt_spoergsmaal;
This solves my issue, but an issue cropped up at one point (because of a flip of the subtraction (spid - kontrol_kilde had become kontrol_kilde - spid) which made part of the equation negative. Since the column is unsigned,this caused an error:
Error Code: 1690. BIGINT UNSIGNED value is out of range in
My question:
Can I "cast" the columns in the alias column k so that it is an int instead of unsigned int ?

Well, you can cast() as signed:
SELECT spid, lbnr, kontrol, kontrol_kilde,
cast(lbnr - (spid - kontrol_kilde) * kontrol)* kontrol as signed) AS k
FROM tmplt_spoergsmaal;

For each option in a matchmaker, make sure that at least one (but no more than one) matches per option

This will take a little explaining (moreso because I can't use the word "question" in the title of a question):
I have a matchmaker quiz with the following tables (simplified):
CREATE TABLE `Quiz` (
`quiz_id` int(10) unsigned NOT NULL,
`code` varchar(20) DEFAULT NULL,
`title` varchar(50) DEFAULT NULL,
PRIMARY KEY (`quiz_id`),
UNIQUE KEY `Quiz_1` (`code`)
);
CREATE TABLE `Quiz_Question` (
`quiz_id` int(10) unsigned NOT NULL,
`question_id` int(10) unsigned NOT NULL,
`question` varchar(250) DEFAULT NULL,
`type` int(10) unsigned NOT NULL, -- Lookup table of type of question: booean, radio, select, multiselect
PRIMARY KEY (`question_id`)
);
CREATE TABLE `Quiz_Answer` (
`question_id` int(10) unsigned NOT NULL,
`answer_id` int(10) unsigned NOT NULL,
`answer` varchar(250) DEFAULT NULL,
PRIMARY KEY (`answer_id`)
);
CREATE TABLE `Quiz_Response` (
`user_id` int(10) unsigned NOT NULL,
`quiz_id` int(10) unsigned NOT NULL,
`question_id` int(10) unsigned NOT NULL,
`answer_id` int(10) unsigned DEFAULT NULL,
UNIQUE KEY `Response_1` (`user_id`,`question_id`,`answer_id`),
KEY `Response_2` (`question_id`,`answer_id`)
);
All pretty straightforward so far.
Previously, the query went like this (simplified):
SELECT u.login, COUNT( u.user_id ) AS matches, ...
FROM User u
INNER JOIN Quiz_Response rep ON u.user_id = rep.user_id
WHERE u.active = 1
AND (
(rep.question_id = 3 AND rep.answer_id IN (20, 24)) OR
(rep.question_id = 10 AND rep.answer_id IN (83,84,85))
)
GROUP BY u.user_id
HAVING matches >= 2
ORDER BY u.login
Note: I've removed things like whether something is active or not, display order, blocked users, date ranges, etc from the CREATE TABLE and query to focus on the core problem.
So if a user answered question3 with either 20 or 24, they show up in the results once, and if they answer question10 with either 83, 84, or 85 they show up a second time. The query then counts the number of times any given user shows up and if it is equal or greater than the number of questions tried to match, it is considered a match (in this case the matchmaker checked two possible questions so their should be at least 2 entries (matches).
My issue is that I'm introducing a multiple choice matches. This has the end result of a single question can have multiple matches which throws off the counting.
So, if a searcher says that they are looking for people that answered question 5 with either A, B, or C, and a user says that they like A, B, and C, that becomes three matches essentially nullifying two other questions (searched for three things, and got back three matches just all from the same question).
So the question I'm asking is how do I check that for every given question, it only scores 1 match, even if multiple answers for a single question match multiple times.
Hope that all makes sense.

Instead of counting on u.user_id, count on distinct rep.question_id:
SELECT u.login, u.user_id, COUNT(distinct rep.question_id) AS matches
FROM User u
INNER JOIN Quiz_Response rep ON u.user_id = rep.user_id
WHERE u.active = 1
AND (
(rep.question_id = 3 AND rep.answer_id IN (20, 24)) OR
(rep.question_id = 10 AND rep.answer_id IN (83,84,85))
)
GROUP BY u.user_id
HAVING matches >= 2
ORDER BY u.login;
So if my Quiz_Response table looks like this:
+-------------+---------+-------------+-----------+---------+
| response_id | quiz_id | question_id | answer_id | user_id |
+-------------+---------+-------------+-----------+---------+
| 1 | 1 | 1 | 4 | 3 |
| 2 | 2 | 3 | 20 | 2 |
| 3 | 2 | 3 | 24 | 2 |
| 4 | 4 | 10 | 83 | 1 |
| 5 | 4 | 10 | 84 | 1 |
| 6 | 4 | 10 | 85 | 1 |
| 7 | 2 | 3 | 20 | 4 |
| 8 | 1 | 1 | 1 | 4 |
| 9 | 2 | 3 | 24 | 4 |
| 10 | 4 | 10 | 83 | 4 |
+-------------+---------+-------------+-----------+---------+
Output of the above query will be:
+---------------------+---------+---------+
| login | user_id | matches |
+---------------------+---------+---------+
| 2018-01-01 00:00:00 | 4 | 2 |
+---------------------+---------+---------+

MySQL: Strange behavior of UPDATE query (ERROR 1062 Duplicate entry)

I have a MySQL database the stores news articles with the publications date (just day information), the source, and category. Based on these I want to generate a table that holds the article counts w.r.t. to these 3 parameters.
Since for some combinations of these 3 parameters there might be no article, a simple GROUP BY won't do. I therefore first generate a table news_article_counts with all possible combinations of the 3 parameters, and an default article_count of 0 -- like this:
SELECT * FROM news_article_counts;
+--------------+------------+----------+---------------+
| published_at | source | category | article_count |
+------------- +------------+----------+---------------+
| 2016-08-05 | 1826089206 | 0 | 0 |
| 2016-08-05 | 1826089206 | 1 | 0 |
| 2016-08-05 | 1826089206 | 2 | 0 |
| 2016-08-05 | 1826089206 | 3 | 0 |
| 2016-08-05 | 1826089206 | 4 | 0 |
| ... | ... | ... | ... |
+--------------+------------+----------+---------------+
For testing, I now created a temporary table tmp as the GROUP BY result from the original news article table:
SELECT * FROM tmp LIMIT 6;
+--------------+------------+----------+-----+
| published_at | source | category | cnt |
+--------------+------------+----------+-----+
| 2016-08-05 | 1826089206 | 3 | 1 |
| 2003-09-19 | 1826089206 | 4 | 1 |
| 2005-08-08 | 1826089206 | 3 | 1 |
| 2008-07-22 | 1826089206 | 4 | 1 |
| 2008-11-26 | 1826089206 | 8 | 1 |
| ... | ... | ... | ... |
+--------------+------------+----------+-----+
Given these two tables, the following query works as expected:
SELECT * FROM news_article_counts c, tmp t
WHERE c.published_at = t.published_at AND c.source = t.source AND c.category = t.category;
But now I need to update the article_count of table news_article_counts with the values in table tmp where the 3 parameters match up. For this I'm using the following query (I've tried different ways but with the same results):
UPDATE
news_article_counts c
INNER JOIN
tmp t
ON
c.published_at = t.published_at AND
c.source = t.source AND
c.category = t.category
SET
c.article_count = t.cnt;
Executing this query yields this error:
ERROR 1062 (23000): Duplicate entry '2018-04-07 14:46:17-1826089206-1' for key 'uniqueIndex'
uniqueIndex is a joint index over published_at, source, category of table news_article_counts. But this shouldn't be a problem since I do not -- as far as I can tell -- update any of those 3 values, only article_count.
What confuses me most is that in the error it mentions the timestamp I executed the query (here: 2018-04-07 14:46:17). I have no absolutely idea where this comes into play. In fact, some rows in news_article_counts now have 2018-04-07 14:46:17 as value for published_at. While this explains the error, I cannot see why published_at gets overwritten with the current timestamp. There is no ON UPDATE CURRENT_TIMESTAMP on this column; see:
CREATE TABLE IF NOT EXISTS `test`.`news_article_counts` (
`published_at` TIMESTAMP NOT NULL,
`source` INT UNSIGNED NOT NULL,
`category` INT UNSIGNED NOT NULL,
`article_count` INT UNSIGNED NOT NULL DEFAULT 0,
UNIQUE INDEX `uniqueIndex` (`published_at` ASC, `source` ASC, `category` ASC))
ENGINE = MyISAM
DEFAULT CHARACTER SET = utf8mb4;
What am I missing here?
UPDATE 1: I actually checked the table definition of news_article_counts in the database. And there's indeed the following:
mysql> SHOW COLUMNS FROM news_article_counts;
+---------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+-------------------+-----------------------------+
| published_at | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| source | int(10) unsigned | NO | | NULL | |
| category | int(10) unsigned | NO | | NULL | |
| article_count | int(10) unsigned | NO | | 0 | |
+---------------+------------------+------+-----+-------------------+-----------------------------+
But why is on update CURRENT_TIMESTAMP set. I double and triple-checked my CREATE TABLE statement. I removed the joint index, I added an artificial primary key (auto_increment). Nothing help. I've even tried to explicitly remove these attributes from published_at with:
ALTER TABLE `news_article_counts` CHANGE `published_at` `published_at` TIMESTAMP NOT NULL;
Nothing seems to work for me.

It looks like you have the explicit_defaults_for_timestamp system variable disabled. One of the effects of this is:
The first TIMESTAMP column in a table, if not explicitly declared with the NULL attribute or an explicit DEFAULT or ON UPDATE attribute, is automatically declared with the DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP attributes.
You could try enabling this system variable, but that could potentially impact other applications. I think it only takes effect when you're actually creating a table, so it shouldn't affect any existing tables.
If you don't to make a system-level change like this, you could add an explicit DEFAULT attribute to the published_at column of this table, then it won't automatically add ON UPDATE.

How to improve an indexed inner join query Mysql?

this is my first question ever on forum so do not hesitate to tell me if there is anything to improve in my question.
I have a big database with two tables
"visit" (6M rows) which basically stores each visit on a website
| visitdate | city |
----------------------------------
| 2014-12-01 00:00:02 | Paris |
| 2015-01-03 00:00:02 | Marseille|
"cityweather" (1M rows) that stores weather infos 3 times a day for a lot of cities
| weatherdate | city |
------------------------------------
| 2014-12-01 09:00:02 | Paris |
| 2014-12-01 09:00:02 | Marseille|
I precise that there can be cities in the table visit that are not in cityweather and vice versa and I need to only take citties that are common to both tables.
I first had a big query that I tried to run and failed and I am therefore trying to go back to the simplest possible query joining those two table but the performance are terrible.
SELECT COUNT(DISTINCT(t.city))
FROM visit t
INNER JOIN cityweather d
ON t.city = d.city;
I precise that both tables are indexed on the column city and I already did the COUNT(DISTINCT(city)) on both tables independantly and it takes less than one second for each.
You can find below te result of the EXPLAIN on this query :
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
----------------------------------
| 1 | SIMPLE | d | index | idx_city | idx_city | 303 | NULL | 1190553 | Using where; Using index |
| 1 | SIMPLE | t | ref | Idxcity | Idxcity | 303 | meteo.d.city | 465 | Using index |
You will find below the table information and especialy the engine for both tables :
visit
| Name | Engine | Version | Row_Format | Rows | Avg_row_len | Data_len | Max_data_len | Index_len | Data_free |
--------------------------------------------------------------------------------------------------------------------
| visit | InnoDB | 10 | Compact | 6208060 | 85 | 531628032 | 0 | 0 | 0 |
The SHOW CREATE TABLE output :
CREATE TABLE
`visit` (
`productid` varchar(8) DEFAULT NULL,
`visitdate` datetime DEFAULT NULL,
`minute` int(2) DEFAULT NULL,
`hour` int(2) DEFAULT NULL,
`weekday` int(1) DEFAULT NULL,
`quotation` int(10) unsigned DEFAULT NULL,
`amount` int(10) unsigned DEFAULT NULL,
`city` varchar(100) DEFAULT NULL,
`weathertype` varchar(30) DEFAULT NULL,
`temp` int(11) DEFAULT NULL,
`pressure` int(11) DEFAULT NULL,
`humidity` int(11) DEFAULT NULL,
KEY `Idxvisitdate` (`visitdate`),
KEY `Idxcity` (`city`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
citiweather
| Name | Engine | Version | Row_Format | Rows | Avg_row_len | Data_len | Max_data_len | Index_len | Data_free |
------------------------------------------------------------------------------------------------------------------------------
| cityweather | InnoDB | 10 | Compact | 1190553 | 73 | 877670784 | 0 | 0 | 30408704 |
The SHOW CREATE TABLE output :
CREATE TABLE `cityweather` (
`city` varchar(100) DEFAULT NULL,
`lat` decimal(13,9) DEFAULT NULL,
`lon` decimal(13,9) DEFAULT NULL,
`weatherdate` datetime DEFAULT NULL,
`temp` int(11) DEFAULT NULL,
`pressure` int(11) DEFAULT NULL,
`humidity` int(11) DEFAULT NULL,
KEY `Idxweatherdate` (`weatherdate`),
KEY `idx_city` (`city`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
I have the feeling that the problem comes from the type = index and the ref = NULL but I have no idea how to fix it...
You can find here a close question that did not help me solve my problem
Thanks !

Your query is so slow because the index you use can't get the number of lines down to a faster amount. See your EXPLAIN output: It tells you that the use of the index on city (idx_city) in table cityweather will require 1.190.553 lines to process. Joining by city to your visit table will require again 465 lines from that table.
As a result your database will have to process 1.190.553 x 465 lines.
As your query is you can't improve its performance. But you can modify your query e.g. by adding a condition on your visiting data to narrow the results down. Try all kinds of EXISTS queries as well.
Update
Perhaps this helps:
CREATE TEMPORARY TABLE tmpTbl
SELECT distinct city as city from cityweather;
ALTER TABLE tmpTbl Add index adweerf (city);
SELECT COUNT(DISTINCT(city)) FROM visit WHERE city in (SELECT city from tmpTbl);

Since IN ( SELECT ... ) optimizes poorly, change
SELECT COUNT(DISTINCT(city)) FROM visit WHERE city in (SELECT city from tmpTbl);
to
SELECT COUNT(*)
FROM ( SELECT DISTINCT city FROM cityweather ) x
WHERE EXISTS( SELECT * FROM visit
WHERE city = x.city );
Both tables need (and have) an index on city. I'm pretty sure it is better to put the smaller table (cityweather) in the SELECT DISTINCT.
Other points:
Every InnoDB table really should have a PRIMARY KEY.
You could save a lot of space by using TINYINT UNSIGNED (1 byte), etc, instead of using 4-byte INT always.
9 decimal places for lat/lng is excessive for cities, and takes 12 bytes. I vote for DECIMAL(4,2)/(5,2) (1.6km / 1mi resolution; 5 bytes) or DECIMAL(6,4)/(7,4) (16m/52ft, 7 bytes).

GeoIP table join with table of IP's in MySQL

I am having a issue finding a fast way of joining the tables looking like that:
mysql> explain geo_ip;
+--------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------+------+-----+---------+-------+
| ip_start | varchar(32) | NO | | "" | |
| ip_end | varchar(32) | NO | | "" | |
| ip_num_start | int(64) unsigned | NO | PRI | 0 | |
| ip_num_end | int(64) unsigned | NO | | 0 | |
| country_code | varchar(3) | NO | | "" | |
| country_name | varchar(64) | NO | | "" | |
| ip_poly | geometry | NO | MUL | NULL | |
+--------------+------------------+------+-----+---------+-------+
mysql> explain entity_ip;
+------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+-------+
| entity_id | int(64) unsigned | NO | PRI | NULL | |
| ip_1 | tinyint(3) unsigned | NO | | NULL | |
| ip_2 | tinyint(3) unsigned | NO | | NULL | |
| ip_3 | tinyint(3) unsigned | NO | | NULL | |
| ip_4 | tinyint(3) unsigned | NO | | NULL | |
| ip_num | int(64) unsigned | NO | | 0 | |
| ip_poly | geometry | NO | MUL | NULL | |
+------------+---------------------+------+-----+---------+-------+
Please note that I am not interested in finding the needed rows in geo_ip by only ONE IP address at once, I need a entity_ip LEFT JOIN geo_ip (or similar/analogue way).
This is what I have for now (using polygons as advised on http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/):
mysql> EXPLAIN SELECT li.*, gi.country_code FROM entity_ip AS li
-> LEFT JOIN geo_ip AS gi ON
-> MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`);
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | ip_poly_index | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity AS li LEFT JOIN geo_ip AS gi ON MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`) limit 0, 20;
20 rows in set (2.22 sec)
No polygons
mysql> explain SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.`ip_num` >= gi.`ip_num_start` AND li.`ip_num` <= gi.`ip_num_end` LIMIT 0,20;
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | PRIMARY,geo_ip,geo_ip_end | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.ip_num BETWEEN gi.ip_num_start AND gi.ip_num_end limit 0, 20;
20 rows in set (2.00 sec)
(On higher number of rows in the search - there is no difference)
Currently I cannot get any faster performance from these queries as 0.1 seconds per IP is way too slow for me.
Is there any way to make it faster?

This approach has some scalability issues (should you choose to move to, say, city-specific geoip data), but for the given size of data, it will provide considerable optimization.
The problem you are facing is effectively that MySQL does not optimize range-based queries very well. Ideally you want to do an exact ("=") look-up on an index rather than "greater than", so we'll need to build an index like that from the data you have available. This way MySQL will have much fewer rows to evaluate while looking for a match.
To do this, I suggest that you create a look-up table that indexes the geolocation table based on the first octet (=1 from 1.2.3.4) of the IP addresses. The idea is that for each look-up you have to do, you can ignore all geolocation IPs which do not begin with the same octet than the IP you are looking for.
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
KEY `first_octet` (`first_octet`,`ip_numeric_start`,`ip_numeric_end`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Next, we need to take the data available in your geolocation table and produce data that covers all (first) octets the geolocation row covers: If you have an entry with ip_start = '5.3.0.0' and ip_end = '8.16.0.0', the lookup table will need rows for octets 5, 6, 7, and 8. So...
ip_geolocation
|ip_start |ip_end |ip_numeric_start|ip_numeric_end|
|72.255.119.248 |74.3.127.255 |1224701944 |1241743359 |
Should convert to:
ip_geolocation_lookup
|first_octet|ip_numeric_start|ip_numeric_end|
|72 |1224701944 |1241743359 |
|73 |1224701944 |1241743359 |
|74 |1224701944 |1241743359 |
Since someone here requested for a native MySQL solution, here's a stored procedure that will generate that data for you:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DELETE FROM ip_geolocation_lookup;
WHILE i < 256 DO
INSERT INTO ip_geolocation_lookup (first_octet, ip_numeric_start, ip_numeric_end)
SELECT i, ip_numeric_start, ip_numeric_end FROM ip_geolocation WHERE
( ip_numeric_start & 0xFF000000 ) >> 24 <= i AND
( ip_numeric_end & 0xFF000000 ) >> 24 >= i;
SET i = i + 1;
END WHILE;
END;
And then you will need to populate the table by calling that stored procedure:
CALL recalculate_ip_geolocation_lookup();
At this point you may delete the procedure you just created -- it is no longer needed, unless you want to recalculate the look-up table.
After the look-up table is in place, all you have to do is integrate it into your queries and make sure you're querying by the first octet. Your query to the look-up table will satisfy two conditions:
Find all rows which match the first octet of your IP address
Of that subset: Find the row which has the the range that matches your IP address
Because the step two is carried out on a subset of data, it is considerably faster than doing the range tests on the entire data. This is the key to this optimization strategy.
There are various ways for figuring out what the first octet of an IP address is; I used ( r.ip_numeric & 0xFF000000 ) >> 24 since my source IPs are in numeric form:
SELECT
r.*,
g.country_code
FROM
ip_geolocation g,
ip_geolocation_lookup l,
ip_random r
WHERE
l.first_octet = ( r.ip_numeric & 0xFF000000 ) >> 24 AND
l.ip_numeric_start <= r.ip_numeric AND
l.ip_numeric_end >= r.ip_numeric AND
g.ip_numeric_start = l.ip_numeric_start;
Now, admittedly I did get a little lazy in the end: You could easily get rid of ip_geolocation table altogether if you made the ip_geolocation_lookup table also contain the country data. I'm guessing dropping one table from this query would make it a bit faster.
And, finally, here are the two other tables I used in this response for reference, since they differ from your tables. I'm certain they are compatible, though.
# This table contains the original geolocation data
CREATE TABLE `ip_geolocation` (
`ip_start` varchar(16) NOT NULL DEFAULT '',
`ip_end` varchar(16) NOT NULL DEFAULT '',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
`country_code` varchar(3) NOT NULL DEFAULT '',
`country_name` varchar(64) NOT NULL DEFAULT '',
PRIMARY KEY (`ip_numeric_start`),
KEY `country_code` (`country_code`),
KEY `ip_start` (`ip_start`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
# This table simply holds random IP data that can be used for testing
CREATE TABLE `ip_random` (
`ip` varchar(16) NOT NULL DEFAULT '',
`ip_numeric` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Just wanted to give back to the community:
Here's an even better and optimized way building on Aleksi's solution:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
DELIMITER ;;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DROP TABLE `ip_geolocation_lookup`;
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` smallint(5) unsigned NOT NULL DEFAULT '0',
`startIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`endIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`locId` int(11) NOT NULL,
PRIMARY KEY (`first_octet`,`startIpNum`,`endIpNum`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT startIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT endIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
WHILE i < 1048576 DO
INSERT IGNORE INTO ip_geolocation_lookup
SELECT i, startIpNum, endIpNum, locId
FROM ip_geolocation_lookup
WHERE first_octet = i-1
AND endIpNum DIV 1048576 > i;
SET i = i + 1;
END WHILE;
END;;
DELIMITER ;
CALL recalculate_ip_geolocation_lookup();
It builds way faster than his solution and drills down more easily because we're not just taking the first 8, but the first 20 bits. Join performance: 100000 rows in 158ms. You might have to rename the table and field names to your version.
Query by using
SELECT ip, kl.*
FROM random_ips ki
JOIN `ip_geolocation_lookup` kb ON (ki.`ip` DIV 1048576 = kb.`first_octet` AND ki.`ip` >= kb.`startIpNum` AND ki.`ip` <= kb.`endIpNum`)
JOIN ip_maxmind_locations kl ON kb.`locId` = kl.`locId`;

Can't comment yet, but user1281376's answers is wrong and doesn't work. the reason you only use the first octet is because you aren't going to match all ip ranges otherwise. there's plenty of ranges that span multiple second octets which user1281376s changed query isn't going to match. And yes, this actually happens if you use the Maxmind GeoIp data.
with aleksis suggestion you can do a simple comparison on the fîrst octet, thus reducing the matching set.

I found a easy way. I noticed that all first ip in the group % 256 = 0,
so we can add a ip_index table
CREATE TABLE `t_map_geo_range` (
`_ip` int(10) unsigned NOT NULL,
`_ipStart` int(10) unsigned NOT NULL,
PRIMARY KEY (`_ip`)
) ENGINE=MyISAM
How to fill the index table
FOR_EACH(Every row of ip_geo)
{
FOR(Every ip FROM ipGroupStart/256 to ipGroupEnd/256)
{
INSERT INTO ip_geo_index(ip, ipGroupStart);
}
}
How to use:
SELECT * FROM YOUR_TABLE AS A
LEFT JOIN ip_geo_index AS B ON B._ip = A._ip DIV 256
LEFT JOIN ip_geo AS C ON C.ipStart = B.ipStart;
More than 1000 times Faster.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Efficient way to compute number of matchings between two columns in MySQL - mysql

Related

Retyping alias column in mysql query

For each option in a matchmaker, make sure that at least one (but no more than one) matches per option

MySQL: Strange behavior of UPDATE query (ERROR 1062 Duplicate entry)

How to improve an indexed inner join query Mysql?

GeoIP table join with table of IP's in MySQL

Categories

Resources