Storing email lists with a variable number of fields - mysql

I am looking to allow my users to store their email lists and I am using MySQL.
I want to allow users to decide what fields their email list may have. eg. Name, Email, Location, Birthday
Each user may want different fields and each list may have different fields to each other.
I'm really stuck as to how I should structure my databases to allow for this. Any help would be appreciated.

Since the number of fields and the types of fields are possibly unknown, and could change depending on the user, it might not make sense to hard code them as columns. Instead, I would recommend that you use a key value pair approach here. First define a table email_fields looking something like this:
user_id | id | field
1 | 1 | Name
1 | 2 | Email
1 | 3 | Location
1 | 4 | Birthday
Above user 1 has configured his email lists to have four fields, which are the ones you gave as an example in your question. Adding more fields for this user, or adding more users, just means adding more records to this table, without changing the actual columns.
Then, in another table email_lists you would store the actual metadata for email address and user:
id | user_id | field_id | value
1 | 1 | 1 | Sam Plan
1 | 1 | 2 | sam.plan#somewhere.com
1 | 1 | 3 | Gonesville, AL
1 | 1 | 4 | 1967-03-28
In other words, the basic idea is that every email, for every user, would be represented a set of records corresponding to a bunch of key value pairs.

I am looking to allow my users to store their email lists and I am using MySQL.
This means that a given user may have more than one email list.
So the appropriate table for storing the email lists should be like this one:
+---------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| user_id | int(10) unsigned | NO | | NULL | |
| title | varchar(255) | NO | | NULL | |
+---------+------------------+------+-----+---------+----------------+
Each user may want different fields and each list may have different fields to each other.
The fields will be text or numerical. The user could create his own fields
+---------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| email_list_id | int(10) unsigned | NO | | NULL | |
| field | varchar(255) | NO | | NULL | |
| value | longtext | YES | | NULL | |
+---------------+------------------+------+-----+---------+----------------+
Here is the SQL query to create those tables:
CREATE TABLE `email_list` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`user_id` int unsigned NOT NULL,
`title` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE `email_list_data` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`email_list_id` int unsigned NOT NULL,
`field` varchar(255) NOT NULL,
`value` longtext DEFAULT NULL,
PRIMARY KEY (`id`)
);
In another approach, you will use a different table for storing the fields and another one for storing the values.
CREATE TABLE `email_list` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`user_id` int unsigned NOT NULL,
`title` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE `email_list_fields` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`email_list_id` int unsigned NOT NULL,
`field` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE `email_list_field_values` (
`email_list_id` int unsigned NOT NULL,
`field_id` int unsigned NOT NULL,
`field_value` longtext DEFAULT ''
);

The other answers remove around metadata. This is great, though it isn't meant for large amounts of data or relationships based on a "record" because there would be no Contact model. You would return 30 records if there were 30 fields specified for the contacts in the list.
It would take 2 parts that are highly complicated to implement, but would be more structured in the end.
1) Tenant DB Structure
There are some tutorials out there that can help with this. Essentially, you would have 1 master database that stores your teams, etc. Then, each of your teams would get their own database. We have a project that implements this structure and it works well. You can be connected to both the main and the specific tenant database at the same time.
2) Generating migration files
I have never done this part before, but I assume it is possible..
Try making a flexible schema-file builder. You can execute system commands via PHP to save a file you generated into a directory structured on your teams, such as /migrations/teams/1/37284_user_submitted_migration or something like that.
Execute
You can then run the artisan migration command on that specific file, for that specific tenant database (use the Artisan::run() facade helper). Each of your teams could have whatever structure they wish, as if you had built it for them.
This can all be executed within the same job.

Related

Can this SQL query be optimized?

This is a query for a Postfix table lookup (smtpd_sender_login_maps) in MariaDB (MySQL). Given an email address it returns the users allowed to use that address. I am using two SQL tables to store accounts and aliases that need to be searched. Postfix requires a single query to return a single result set hence the UNION SELECT. I know there is unionmap:{} in postfix but i do not want to go that route and prefer the union select. The emails.email column is the username that is returned for Postfix SASL authentication. The %s in the query is where Postfix inserts the email address to search for. The reason for matching everything back to the emails.postfixPath is because that is the physical inbox, if two accounts share the same inbox they should both have access to use all the same emails including aliases.
Table: emails
+-------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| email | varchar(100) | NO | PRI | NULL | |
| postfixPath | varchar(100) | NO | MUL | NULL | |
| password | varchar(50) | YES | | NULL | |
| acceptMail | tinyint(1) | NO | | 1 | |
| allowLogin | tinyint(1) | NO | | 1 | |
| mgrLogin | tinyint(1) | NO | | 0 | |
+-------------+--------------+------+-----+---------+-------+
.
Table: aliases
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| email | varchar(100) | NO | PRI | NULL | |
| forwardTo | varchar(100) | NO | | NULL | |
| acceptMail | tinyint(1) | NO | | 1 | |
+------------+--------------+------+-----+---------+-------+
.
SELECT email
FROM emails
WHERE postfixPath=(
SELECT postfixPath
FROM emails
WHERE email='%s'
AND acceptMail=1
LIMIT 1)
AND password IS NOT NULL
AND allowLogin=1
UNION SELECT email
FROM emails
WHERE postfixPath=(
SELECT postfixPath
FROM emails
WHERE email=(
SELECT forwardTo
FROM aliases
WHERE email='%s'
AND acceptMail=1)
LIMIT 1)
AND password IS NOT NULL
AND allowLogin=1
AND acceptMail=1
This query works, it just looks heavy to me and i feel like it should be more streamlined / efficient. Does anyone have a better way to write this or is this as good as it gets?
I added CREATE INDEX index_postfixPath ON emails (postfixPath) per #The Impaler's suggestion.
#Rick James here is the additional table info:
Table: emails
Create Table: CREATE TABLE `emails` (
`email` varchar(100) NOT NULL,
`postfixPath` varchar(100) NOT NULL,
`password` varchar(50) DEFAULT NULL,
`acceptMail` tinyint(1) NOT NULL DEFAULT 1,
`allowLogin` tinyint(1) NOT NULL DEFAULT 1,
`mgrLogin` tinyint(1) NOT NULL DEFAULT 0,
PRIMARY KEY (`email`),
KEY `index_postfixPath` (`postfixPath`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Table: aliases
Create Table: CREATE TABLE `aliases` (
`email` varchar(100) NOT NULL,
`forwardTo` varchar(100) NOT NULL,
`acceptMail` tinyint(1) NOT NULL DEFAULT 1,
PRIMARY KEY (`email`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
.
Part 1:
SELECT email
FROM emails
WHERE postfixPath=
(
SELECT postfixPath
FROM emails
WHERE email='%s'
AND acceptMail = 1
LIMIT 1
)
AND password IS NOT NULL
AND allowLogin = 1
With indexes:
emails: (email, acceptMail, password)
I assume acceptMail has only 2 values? The Optimizer cannot know that, so it sees AND acceptMail as a range test. AND acceptMail = 1 fixes that. (No, > 0, != 0, etc, can't be optimized.)
Part 2:
This has 3 layers, and is probably where the inefficiency is.
SELECT e.email
FROM ( SELECT forwardTo ... ) AS c
JOIN ( SELECT postfixPath ... ) AS d ON ...
JOIN emails AS e ON e.postfixPath = d.postfixPath
This is how the Optimizer might optimize your version. But I am not sure it did, so I changed it to encourage it to do so.
Again, use =1 when testing for "true". Then have these indexes:
aliases: (email, acceptMail, forwardTo)
emails: (email, postfixPath)
emails: (postfixPath, allowLogin, acceptMail, password, email)
Finally, the UNION:
( SELECT ... part 1 ... )
UNION ALL
( SELECT ... part 2 ... )
I added parentheses to avoid ambiguities about what clauses belong to the Selects versus to the Union.
UNION ALL is faster than UNION (which is UNION DISTINCT), but you might get the same email twice. However, that may be nonsense -- forwarding an email to yourself??
The order of columns in each index is important. (However, some variants are equivalent.)
I think all the indexes I provided are "covering", thereby giving an extra performance boost.
Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE. "MUL" is especially ambiguous.
(Caveat: I threw this code together rather hastily; it may not be quite correct, but principles should help.)
For further optimization, please do like I did in splitting it into 3 steps. Check the performance of each.
The following three indexes will make the query faster:
create index ix1 on emails (allowLogin, postfixPath, acceptMail, password, email);
create index ix2 on emails (email, acceptMail);
create index ix3 on aliases (email, acceptMail);

MySQL - insert rows in one table and then update another with the auto increment ID

I have 2 tables called applications and filters. The structure of the tables are as follows:
mysql> DESCRIBE applications;
+-----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+----------------+
| id | tinyint(3) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | | NULL | |
| filter_id | int(3) | NO | | NULL | |
+-----------+---------------------+------+-----+---------+----------------+
3 rows in set (0.01 sec)
mysql> DESCRIBE filters;
+----------+----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+----------------------+------+-----+---------+----------------+
| id | smallint(5) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(100) | NO | | NULL | |
| label | varchar(255) | NO | | NULL | |
| link | varchar(255) | NO | | NULL | |
| anchor | varchar(100) | NO | | NULL | |
| group_id | tinyint(3) unsigned | NO | MUL | NULL | |
| comment | varchar(255) | NO | | NULL | |
+----------+----------------------+------+-----+---------+----------------+
7 rows in set (0.02 sec)
What I want to do is select all the records in applications and make a corresponding record in filters (so that filters.name is the same as applications.name). When the record is inserted in filters I want to get the primary key (filters.id) of the newly inserted record - which is an auto increment field - and update applications.filter_id with it. I should clarify that applications.filter_id is a field I've created for this purpose and contains no data at the moment.
I am a PHP developer and have written a script which can do this, but want to know if it's possible with a pure MySQL solution. In pseudo-code the way my script works is as follows:
Select all the records in applications
Do a foreach loop on (1)
Insert a record in filters (filters.name == applications.name)
Store the inserted ID (filters.id) to a variable and then update applications.filter_id with the variable's data.
I'm unaware of how to do the looping (2) and storing the auto increment ID (4) in MySQL.
I have read about Get the new record primary key ID from mysql insert query? so am aware of LAST_INSERT_ID() but not sure how to reference this in some kind of "loop" which goes through each of the applications records.
Please can someone advise if this is possible?
I don't think this is possible to do this with only one request to mysql.
But, i think this is a good use case for mysql triggers.
I think you should write it like this :
CREATE TRIGGER after_insert_create_application_filter AFTER INSERT
ON applications FOR EACH ROW
BEGIN
INSERT INTO filters (name) VALUES (NEW.name);
UPDATE applications SET filter_id = LAST_INSERT_ID() WHERE id = NEW.id;
END
This trigger is not tested but you should understand the way to write it.
If you don't know mysql triggers, you can read this part of the documentation.
This isn't an answer to your question, more a comment on your database design.
First of all, if the name field needs to contain the same information, they should be the same type and size (varchar(255))
Overall though, I think the schema you're using for your tables is wrong. Your description says that each record in applications can only hold one filter_id. If that is the case, there's no point in using two separate tables.
If there is a chance that there will be a many-to-one relationship, link the records via the relevant primary key. If multiple records in application can relate to a single filter, store filters.id in the applications table. If there are multiple filters for a single application, store applications.id in the filters table.
If there is a many-to-many relationship, create another table to store it:
CREATE TABLE `application_filters_mappings` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`application_id` int(10) unsigned NOT NULL,
`filters_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
);

How to improve an indexed inner join query Mysql?

this is my first question ever on forum so do not hesitate to tell me if there is anything to improve in my question.
I have a big database with two tables
"visit" (6M rows) which basically stores each visit on a website
| visitdate | city |
----------------------------------
| 2014-12-01 00:00:02 | Paris |
| 2015-01-03 00:00:02 | Marseille|
"cityweather" (1M rows) that stores weather infos 3 times a day for a lot of cities
| weatherdate | city |
------------------------------------
| 2014-12-01 09:00:02 | Paris |
| 2014-12-01 09:00:02 | Marseille|
I precise that there can be cities in the table visit that are not in cityweather and vice versa and I need to only take citties that are common to both tables.
I first had a big query that I tried to run and failed and I am therefore trying to go back to the simplest possible query joining those two table but the performance are terrible.
SELECT COUNT(DISTINCT(t.city))
FROM visit t
INNER JOIN cityweather d
ON t.city = d.city;
I precise that both tables are indexed on the column city and I already did the COUNT(DISTINCT(city)) on both tables independantly and it takes less than one second for each.
You can find below te result of the EXPLAIN on this query :
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
----------------------------------
| 1 | SIMPLE | d | index | idx_city | idx_city | 303 | NULL | 1190553 | Using where; Using index |
| 1 | SIMPLE | t | ref | Idxcity | Idxcity | 303 | meteo.d.city | 465 | Using index |
You will find below the table information and especialy the engine for both tables :
visit
| Name | Engine | Version | Row_Format | Rows | Avg_row_len | Data_len | Max_data_len | Index_len | Data_free |
--------------------------------------------------------------------------------------------------------------------
| visit | InnoDB | 10 | Compact | 6208060 | 85 | 531628032 | 0 | 0 | 0 |
The SHOW CREATE TABLE output :
CREATE TABLE
`visit` (
`productid` varchar(8) DEFAULT NULL,
`visitdate` datetime DEFAULT NULL,
`minute` int(2) DEFAULT NULL,
`hour` int(2) DEFAULT NULL,
`weekday` int(1) DEFAULT NULL,
`quotation` int(10) unsigned DEFAULT NULL,
`amount` int(10) unsigned DEFAULT NULL,
`city` varchar(100) DEFAULT NULL,
`weathertype` varchar(30) DEFAULT NULL,
`temp` int(11) DEFAULT NULL,
`pressure` int(11) DEFAULT NULL,
`humidity` int(11) DEFAULT NULL,
KEY `Idxvisitdate` (`visitdate`),
KEY `Idxcity` (`city`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
citiweather
| Name | Engine | Version | Row_Format | Rows | Avg_row_len | Data_len | Max_data_len | Index_len | Data_free |
------------------------------------------------------------------------------------------------------------------------------
| cityweather | InnoDB | 10 | Compact | 1190553 | 73 | 877670784 | 0 | 0 | 30408704 |
The SHOW CREATE TABLE output :
CREATE TABLE `cityweather` (
`city` varchar(100) DEFAULT NULL,
`lat` decimal(13,9) DEFAULT NULL,
`lon` decimal(13,9) DEFAULT NULL,
`weatherdate` datetime DEFAULT NULL,
`temp` int(11) DEFAULT NULL,
`pressure` int(11) DEFAULT NULL,
`humidity` int(11) DEFAULT NULL,
KEY `Idxweatherdate` (`weatherdate`),
KEY `idx_city` (`city`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
I have the feeling that the problem comes from the type = index and the ref = NULL but I have no idea how to fix it...
You can find here a close question that did not help me solve my problem
Thanks !
Your query is so slow because the index you use can't get the number of lines down to a faster amount. See your EXPLAIN output: It tells you that the use of the index on city (idx_city) in table cityweather will require 1.190.553 lines to process. Joining by city to your visit table will require again 465 lines from that table.
As a result your database will have to process 1.190.553 x 465 lines.
As your query is you can't improve its performance. But you can modify your query e.g. by adding a condition on your visiting data to narrow the results down. Try all kinds of EXISTS queries as well.
Update
Perhaps this helps:
CREATE TEMPORARY TABLE tmpTbl
SELECT distinct city as city from cityweather;
ALTER TABLE tmpTbl Add index adweerf (city);
SELECT COUNT(DISTINCT(city)) FROM visit WHERE city in (SELECT city from tmpTbl);
Since IN ( SELECT ... ) optimizes poorly, change
SELECT COUNT(DISTINCT(city)) FROM visit WHERE city in (SELECT city from tmpTbl);
to
SELECT COUNT(*)
FROM ( SELECT DISTINCT city FROM cityweather ) x
WHERE EXISTS( SELECT * FROM visit
WHERE city = x.city );
Both tables need (and have) an index on city. I'm pretty sure it is better to put the smaller table (cityweather) in the SELECT DISTINCT.
Other points:
Every InnoDB table really should have a PRIMARY KEY.
You could save a lot of space by using TINYINT UNSIGNED (1 byte), etc, instead of using 4-byte INT always.
9 decimal places for lat/lng is excessive for cities, and takes 12 bytes. I vote for DECIMAL(4,2)/(5,2) (1.6km / 1mi resolution; 5 bytes) or DECIMAL(6,4)/(7,4) (16m/52ft, 7 bytes).

Optimizing Slow, Indexed Select MySql Query

I am trying to execute a simple select query using a table indexed on src_ip like so:
SELECT * FROM netflow_nov2 WHERE src_IP=3111950672;
However this is not completed after even 4 or 5 hours. I need the response to be in the range of a few seconds. I am wondering how I can optimize it so this is the case.
Also note that source ip’s were converted to integers using the built in SQL command.
Other information about the table:
The table contains netflow data parsed from nfdump. I am using the table to get information about specific IP addresses. In other words, basically only queries like the above will be used.
Here is the relevant info as given by SHOW TABLE STATUS for this table:
Rows: 4,205,602,143 (4 billion)
Data Length: 426,564,911,104 (426 GB)
Index Length: 57,283,706,880 (57 GB)
Information about the system:
Hard disk: ~2TB, using close to maximum
RAM: 64GB
my.cnf file:
see gist: https://gist.github.com/ashtonwebster/e0af038101e1b42ca7e3
Table structure:
mysql> DESCRIBE netflow_nov2;
+-----------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+------------------+------+-----+---------+-------+
| date | datetime | YES | MUL | NULL | |
| duration | float | YES | | NULL | |
| protocol | varchar(16) | YES | | NULL | |
| src_IP | int(10) unsigned | YES | MUL | NULL | |
| src_port | int(2) | YES | | NULL | |
| dest_IP | int(10) unsigned | YES | MUL | NULL | |
| dest_port | int(2) | YES | | NULL | |
| flags | varchar(8) | YES | | NULL | |
| Tos | int(4) | YES | | NULL | |
| packets | int(8) | YES | | NULL | |
| bytes | int(8) | YES | | NULL | |
| pps | int(8) | YES | | NULL | |
| bps | int(8) | YES | | NULL | |
| Bpp | int(8) | YES | | NULL | |
| Flows | int(8) | YES | | NULL | |
+-----------+------------------+------+-----+---------+-------+
15 rows in set (0.02 sec)
I have additional info about the indexes and the results of explain, but briefly:
-The indexes are b-trees, and there are indexes for date, src_ip, and dest_ip, but only src_ip will really be used
-Based on the output of EXPLAIN, the src_ip index is being used for that particular query mentioned at the top
And the output of mysqltuner:
see gist: https://gist.github.com/ashtonwebster/cbfd98ee1799a7f6b323
SHOW CREATE TABLE output:
| netflow_nov2 | CREATE TABLE `netflow_nov2` (
`date` datetime DEFAULT NULL,
`duration` float DEFAULT NULL,
`protocol` varchar(16) DEFAULT NULL,
`src_IP` int(10) unsigned DEFAULT NULL,
`src_port` int(2) DEFAULT NULL,
`dest_IP` int(10) unsigned DEFAULT NULL,
`dest_port` int(2) DEFAULT NULL,
`flags` varchar(8) DEFAULT NULL,
`Tos` int(4) DEFAULT NULL,
`packets` int(8) DEFAULT NULL,
`bytes` int(8) DEFAULT NULL,
`pps` int(8) DEFAULT NULL,
`bps` int(8) DEFAULT NULL,
`Bpp` int(8) DEFAULT NULL,
`Flows` int(8) DEFAULT NULL,
KEY `src_IP` (`src_IP`),
KEY `dest_IP` (`dest_IP`),
KEY `date` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
Thanks in advance
Your current table structure is optimized for random writes: records are placed on disk in the order of writes.
Unfortunately the only read pattern that is well supported by such a structure is a full-table scan.
Usage of non-covering secondary indices still results in a lot of random disk seeks which are killing performance.
The best reading performance is obtained when data is read in the same order as it is located on disk, which for InnoDB means in the primary key order.
A materialized view (another InnoDB table that has an appropriate primary key) could be a possible solution. In this case a primary key that starts with src_IP is required.
upd: The idea is to achieve data locality and avoid random disk IO, aiming for sequential reading. This means that your materialized view will look like this:
CREATE TABLE `netflow_nov2_view` (
`row_id` bigint not null, -- see below
`date` datetime DEFAULT NULL,
`duration` float DEFAULT NULL,
`protocol` varchar(16) DEFAULT NULL,
`src_IP` int(10) unsigned DEFAULT NULL,
`src_port` int(2) DEFAULT NULL,
`dest_IP` int(10) unsigned DEFAULT NULL,
`dest_port` int(2) DEFAULT NULL,
`flags` varchar(8) DEFAULT NULL,
`Tos` int(4) DEFAULT NULL,
`packets` int(8) DEFAULT NULL,
`bytes` int(8) DEFAULT NULL,
`pps` int(8) DEFAULT NULL,
`bps` int(8) DEFAULT NULL,
`Bpp` int(8) DEFAULT NULL,
`Flows` int(8) DEFAULT NULL,
PRIMARY KEY (`src_IP`, `row_id`) -- you won't need other keys
) ENGINE=InnoDB DEFAULT CHARSET=latin1
where row_id has to be maintained by your materializing logic, since you don't have it in the original table (or you can introduce an explicit auto_increment field to your original table, it's how InnoDB handles it anyway).
The crucial difference is that now all data on the disk is placed in the primary key order, which means that once you locate the first record with a given 'src_IP' all other records can be obtained as sequentially as possible.
Depending on the way your data is written and adjacent application logic it can be accomplished either via triggers or by some custom external process.
If it is possible to sacrifice current write performance (or use some async queue as a buffer) then probably having a single table optimized for reading would suffice.
More on InnoDB indexing:
http://dev.mysql.com/doc/refman/5.6/en/innodb-index-types.html
I would think that reading the table without an index would take less than 5 hours. But you do have a big table. There are two "environmental" possibilities that would kill the performance:
The table is locked by another process.
The result set is huge (tens of millions of rows) and the network latency/processing time for returning the result set is causing the problem.
My first guess, though, is that the query is not using the index. I missed this at first, but you have one multi-part index. The only index this query can take advantage of is one where the first key is src_IP. So, if you index is either netflow_nov2(src_IP, date, dest_ip) or netflow_nov2(src_IP, dest_ip, date), then you are ok. If either of the other columns is first, then this index will not be used. You can easily see what is happening by putting explain in front the query to see if the index is being used.
If this is a problem, create an index with src_IP as the first (or only) key in the index.

GeoIP table join with table of IP's in MySQL

I am having a issue finding a fast way of joining the tables looking like that:
mysql> explain geo_ip;
+--------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------+------+-----+---------+-------+
| ip_start | varchar(32) | NO | | "" | |
| ip_end | varchar(32) | NO | | "" | |
| ip_num_start | int(64) unsigned | NO | PRI | 0 | |
| ip_num_end | int(64) unsigned | NO | | 0 | |
| country_code | varchar(3) | NO | | "" | |
| country_name | varchar(64) | NO | | "" | |
| ip_poly | geometry | NO | MUL | NULL | |
+--------------+------------------+------+-----+---------+-------+
mysql> explain entity_ip;
+------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+-------+
| entity_id | int(64) unsigned | NO | PRI | NULL | |
| ip_1 | tinyint(3) unsigned | NO | | NULL | |
| ip_2 | tinyint(3) unsigned | NO | | NULL | |
| ip_3 | tinyint(3) unsigned | NO | | NULL | |
| ip_4 | tinyint(3) unsigned | NO | | NULL | |
| ip_num | int(64) unsigned | NO | | 0 | |
| ip_poly | geometry | NO | MUL | NULL | |
+------------+---------------------+------+-----+---------+-------+
Please note that I am not interested in finding the needed rows in geo_ip by only ONE IP address at once, I need a entity_ip LEFT JOIN geo_ip (or similar/analogue way).
This is what I have for now (using polygons as advised on http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/):
mysql> EXPLAIN SELECT li.*, gi.country_code FROM entity_ip AS li
-> LEFT JOIN geo_ip AS gi ON
-> MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`);
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | ip_poly_index | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity AS li LEFT JOIN geo_ip AS gi ON MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`) limit 0, 20;
20 rows in set (2.22 sec)
No polygons
mysql> explain SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.`ip_num` >= gi.`ip_num_start` AND li.`ip_num` <= gi.`ip_num_end` LIMIT 0,20;
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | PRIMARY,geo_ip,geo_ip_end | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.ip_num BETWEEN gi.ip_num_start AND gi.ip_num_end limit 0, 20;
20 rows in set (2.00 sec)
(On higher number of rows in the search - there is no difference)
Currently I cannot get any faster performance from these queries as 0.1 seconds per IP is way too slow for me.
Is there any way to make it faster?
This approach has some scalability issues (should you choose to move to, say, city-specific geoip data), but for the given size of data, it will provide considerable optimization.
The problem you are facing is effectively that MySQL does not optimize range-based queries very well. Ideally you want to do an exact ("=") look-up on an index rather than "greater than", so we'll need to build an index like that from the data you have available. This way MySQL will have much fewer rows to evaluate while looking for a match.
To do this, I suggest that you create a look-up table that indexes the geolocation table based on the first octet (=1 from 1.2.3.4) of the IP addresses. The idea is that for each look-up you have to do, you can ignore all geolocation IPs which do not begin with the same octet than the IP you are looking for.
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
KEY `first_octet` (`first_octet`,`ip_numeric_start`,`ip_numeric_end`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Next, we need to take the data available in your geolocation table and produce data that covers all (first) octets the geolocation row covers: If you have an entry with ip_start = '5.3.0.0' and ip_end = '8.16.0.0', the lookup table will need rows for octets 5, 6, 7, and 8. So...
ip_geolocation
|ip_start |ip_end |ip_numeric_start|ip_numeric_end|
|72.255.119.248 |74.3.127.255 |1224701944 |1241743359 |
Should convert to:
ip_geolocation_lookup
|first_octet|ip_numeric_start|ip_numeric_end|
|72 |1224701944 |1241743359 |
|73 |1224701944 |1241743359 |
|74 |1224701944 |1241743359 |
Since someone here requested for a native MySQL solution, here's a stored procedure that will generate that data for you:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DELETE FROM ip_geolocation_lookup;
WHILE i < 256 DO
INSERT INTO ip_geolocation_lookup (first_octet, ip_numeric_start, ip_numeric_end)
SELECT i, ip_numeric_start, ip_numeric_end FROM ip_geolocation WHERE
( ip_numeric_start & 0xFF000000 ) >> 24 <= i AND
( ip_numeric_end & 0xFF000000 ) >> 24 >= i;
SET i = i + 1;
END WHILE;
END;
And then you will need to populate the table by calling that stored procedure:
CALL recalculate_ip_geolocation_lookup();
At this point you may delete the procedure you just created -- it is no longer needed, unless you want to recalculate the look-up table.
After the look-up table is in place, all you have to do is integrate it into your queries and make sure you're querying by the first octet. Your query to the look-up table will satisfy two conditions:
Find all rows which match the first octet of your IP address
Of that subset: Find the row which has the the range that matches your IP address
Because the step two is carried out on a subset of data, it is considerably faster than doing the range tests on the entire data. This is the key to this optimization strategy.
There are various ways for figuring out what the first octet of an IP address is; I used ( r.ip_numeric & 0xFF000000 ) >> 24 since my source IPs are in numeric form:
SELECT
r.*,
g.country_code
FROM
ip_geolocation g,
ip_geolocation_lookup l,
ip_random r
WHERE
l.first_octet = ( r.ip_numeric & 0xFF000000 ) >> 24 AND
l.ip_numeric_start <= r.ip_numeric AND
l.ip_numeric_end >= r.ip_numeric AND
g.ip_numeric_start = l.ip_numeric_start;
Now, admittedly I did get a little lazy in the end: You could easily get rid of ip_geolocation table altogether if you made the ip_geolocation_lookup table also contain the country data. I'm guessing dropping one table from this query would make it a bit faster.
And, finally, here are the two other tables I used in this response for reference, since they differ from your tables. I'm certain they are compatible, though.
# This table contains the original geolocation data
CREATE TABLE `ip_geolocation` (
`ip_start` varchar(16) NOT NULL DEFAULT '',
`ip_end` varchar(16) NOT NULL DEFAULT '',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
`country_code` varchar(3) NOT NULL DEFAULT '',
`country_name` varchar(64) NOT NULL DEFAULT '',
PRIMARY KEY (`ip_numeric_start`),
KEY `country_code` (`country_code`),
KEY `ip_start` (`ip_start`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
# This table simply holds random IP data that can be used for testing
CREATE TABLE `ip_random` (
`ip` varchar(16) NOT NULL DEFAULT '',
`ip_numeric` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Just wanted to give back to the community:
Here's an even better and optimized way building on Aleksi's solution:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
DELIMITER ;;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DROP TABLE `ip_geolocation_lookup`;
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` smallint(5) unsigned NOT NULL DEFAULT '0',
`startIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`endIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`locId` int(11) NOT NULL,
PRIMARY KEY (`first_octet`,`startIpNum`,`endIpNum`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT startIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT endIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
WHILE i < 1048576 DO
INSERT IGNORE INTO ip_geolocation_lookup
SELECT i, startIpNum, endIpNum, locId
FROM ip_geolocation_lookup
WHERE first_octet = i-1
AND endIpNum DIV 1048576 > i;
SET i = i + 1;
END WHILE;
END;;
DELIMITER ;
CALL recalculate_ip_geolocation_lookup();
It builds way faster than his solution and drills down more easily because we're not just taking the first 8, but the first 20 bits. Join performance: 100000 rows in 158ms. You might have to rename the table and field names to your version.
Query by using
SELECT ip, kl.*
FROM random_ips ki
JOIN `ip_geolocation_lookup` kb ON (ki.`ip` DIV 1048576 = kb.`first_octet` AND ki.`ip` >= kb.`startIpNum` AND ki.`ip` <= kb.`endIpNum`)
JOIN ip_maxmind_locations kl ON kb.`locId` = kl.`locId`;
Can't comment yet, but user1281376's answers is wrong and doesn't work. the reason you only use the first octet is because you aren't going to match all ip ranges otherwise. there's plenty of ranges that span multiple second octets which user1281376s changed query isn't going to match. And yes, this actually happens if you use the Maxmind GeoIp data.
with aleksis suggestion you can do a simple comparison on the fîrst octet, thus reducing the matching set.
I found a easy way. I noticed that all first ip in the group % 256 = 0,
so we can add a ip_index table
CREATE TABLE `t_map_geo_range` (
`_ip` int(10) unsigned NOT NULL,
`_ipStart` int(10) unsigned NOT NULL,
PRIMARY KEY (`_ip`)
) ENGINE=MyISAM
How to fill the index table
FOR_EACH(Every row of ip_geo)
{
FOR(Every ip FROM ipGroupStart/256 to ipGroupEnd/256)
{
INSERT INTO ip_geo_index(ip, ipGroupStart);
}
}
How to use:
SELECT * FROM YOUR_TABLE AS A
LEFT JOIN ip_geo_index AS B ON B._ip = A._ip DIV 256
LEFT JOIN ip_geo AS C ON C.ipStart = B.ipStart;
More than 1000 times Faster.