Efficient lookup in a range table - mysql

I have a table of 1.6M IP ranges with organization names.
The IP addresses are converted to integers. The table is in the form of:
I have a list of 2000 unique ip addresses (e.g. 321223, 531223, ....) that need to be translated to an organization name.
I loaded the translation table as a mysql table with an index on IP_from and IP_to. I looped through the 2000 IP addresses, running one query per ip address, and after 15 minutes the report was still running.
The query I'm using is
select organization from iptable where ip_addr BETWEEN ip_start AND ip_end
Is there a more efficient way to do this batch look-up? I'll use my fingers if it's a good solution. And in case someone has a Ruby-specific solution, I want to mention that I'm using Ruby.

Given that you already have an index on ip_start, this is how to use it best, assuming that you want to make one access per IP (1234 in this example):
select organization from (
select ip_end, organization
from iptable
where ip_start <= 1234
order by ip_start desc
limit 1
) subqry where 1234 <= ip_end
This will use your index to start a scan which stops immediately because of the limit 1. The cost should only be marginally higher than the one of a simple indexed access. Of course, this technique relies on the fact that the ranges defined by ip_start and ip_end never overlap.
The problem with your original approach is that mysql, being unaware of this constraint, can only use the index to determine where to start or stop the scan that (it thinks) it needs in order to find all matches for your query.

Possibly the most efficient way of doing a lookup of this kind is loading the list of addresses you want to look up into a temporary table in the database and finding the intersection with an SQL join, rather than checking each address with a separate SQL statement.
In any case you'll need to have an index on (IP_from, IP_to).

Related

mysql table performance upgrade (indexes

I am trying to find a way to improve performance for my mysql table containing ip ranges (it's gonna have up to 500 SELECT queries per second (!) in peak hours so I am little worried).
I have a table of this structure:
id smallint(5) Auto Increment
ip_start char(16)
ip_end char(16)
Coding is utf8_general_ci(on whole table and each columns except id), table is type of MyISAM (only SELECT queries, no insert/delete needed here). Indexes for this table are PRIMARY id.
At this momen table has almost 2000 rows. All of them contains ranges for ip.
For example:
ip_start 128.6.230.0
ip_end 128.6.238.255
When user comes to a website I am checking if his ip is in some of those ranges in my table. I use this query (dibi sql library):
SELECT COUNT(*)
FROM ip_ranges
WHERE %s", $user_ip, " BETWEEN ip_start AND ip_end
If result of query is not zero then the ip of the user is in one of those ranges in table - which is all i need it to do.
I was thinking maybe about putting some indexes to that table? But i am not quite sure how it works and if it's such a good idea (since there is maybe nothing to really index, right? most of those ip ranges are different).
I also had varchar type on those ip_start and ip_end columns but i switched it to just char (guess its faster?).
Anyone any ideas about how to improve this table/queries even further?
You don't want to use aggregation. Instead, check whether the following returns any rows:
SELECT 1
FROM ip_ranges
WHERE %s", $user_ip, " BETWEEN ip_start AND ip_end
LIMIT 1;
The LIMIT 1 says to stop at the first match, so it is faster.
For this query, you want an index on ip_ranges(ip_start, ip_end).
This still has a performance problem when there is no match. The entire index after the ip being tested has to be scanned. I think the following should be an improvement:
SELECT COUNT(*)
FROM (SELECT i.start, ip_end
FROM ip_ranges i
WHERE %s", $user_ip, " >= ip_start
ORDER BY ip_start
LIMIT 1
) i
WHERE $user_ip <= ip_end;
The inner subquery should use the index but pull back the first match. The outer query should should then check the end of the range. Here the count(*) is okay, because there is only one row.

optimize table and query

I have a DB in MySQL Server, with information about ip range and location, it haves the next structure:
id (integer not null, auto inc)
from_ (bigint(20))
to_ (bigint(20))
region(integer)
The field region is a foreign key of a table cities (id, city_name).
As we know, to found to which country belongs an IP address, we have to execute something like the next query:
select region from ipcountry where ip >= from_ and ip <= to_
Due to the number of records, the query is too late for what I need.
Any idea to optimize this problem?
Do you have an index on (from_, to_). That is the place to start.
Then, the next idea is to have the index and change the query to:
select region
from ipcountry
where ip >= from_
order by from_ desc
limit 1;
If that doesn't give the performance boost, then you are going to have to think about how to optimize the data structure. The extreme approach here is to list out all ip addresses with their region. But, the billions of resulting rows may actually hinder performance.
If you go down this path, you need to be smarter. One idea is to have separate tables for Type A, Type B, and Type C addresses which have constant regions. Then a separate table for ranges of Type D addresses.

MySql Explain ignoring the unique index in a particular query

I started looking into Index(es) in depth for the first time and started analyzing our db beginning from the users table for the first time. I searched SO to find a similar question but was not able to frame my search well, I guess.
I was going through a particular concept and this first observation left me wondering - The difference in these Explain(s) [Difference : First query is using 'a%' while the second query is using 'ab%']
[Total number of rows in users table = 9193]:
1) explain select * from users where email_address like 'a%';
(Actually matching columns = 1240)
2) explain select * from users where email_address like 'ab%';
(Actually matching columns = 109)
The index looks like this :
My question:
Why is the index totally ignored in the first query? Does mySql think that it is a better idea not to use the index in the case 1? If yes, why?
If the probability, based statistics mysql collects on distribution of the values, is above a certain ratio of the total rows (typically 1/11 of the total), mysql deems it more efficient to simply scan the whole table reading the disks pages in sequentially, rather than use the index jumping around the disk pages in random order.
You could try your luck with this query, which may use the index:
where email_address between 'a' and 'az'
Although doing the full scan may actually be faster.
This is not a direct answer to your question but I still want to point it out (in case you already don't know):
Try:
explain select email_address from users where email_address like 'a%';
explain select email_address from users where email_address like 'ab%';
MySQL would now use indexes in both the queries above since the columns of interest are directly available from the index.
Probably in the case where you do a "select *", index access is more costly since the optmizer has to go through the index records, find the row ids and then go back to the table to retrieve other column values.
But in the query above where you only do a "select email_address", the optmizer knows all the information desired is available right from the index and hence it would use the index irrespective of the 30% rule.
Experts, please correct me if I am wrong.

Is my mySQL query as efficient as it could be?

I have a mySQL query which takes a long time to process. I am querying a large table of IP ranges which relate to country codes to discover the country of origin for each IP in the url_click table. (IP database from from hxxp://ip-to-country.webhosting.info/)
It works brilliantly, albeit slowly.
Is there a more efficient way to write this query?
Table and output JPG: http://tiny.cx/a4e00d
SELECT ip_addr AS IP, geo_ip.ctry, count(ip_addr) as count
FROM `admin_adfly`.`url_click`,admin_adfly.geo_ip
WHERE INET_ATON (ip_addr)
BETWEEN geo_ip.ipfrom AND geo_ip.ipto
AND url_id = 165
GROUP BY ip_addr;
The use of a function in the join between the two tables is going to be slower than a normal join, so you probably want to defer that particular operation as long as possible. So, I'd summarize the data and then join it:
SELECT S.IP_Addr, G.Ctry AS Country, S.Count
FROM (SELECT ip_addr, COUNT(ip_addr) AS Count
FROM admin_adfly.url_click
WHERE url_id = 165
GROUP BY ip_addr) AS S
JOIN admin_adfly.geo_ip AS G
ON INET_ATON (ip_addr) BETWEEN geo_ip.ipfrom AND geo_ip.ipto;
If you can redesign the schema and are going to be doing a lot of this analysis, rework one of the two tables so that the join condition doesn't need to use INET_ATON().
Presumably, you have an index on the url_id column; that is the only one that will give you much benefit here.
IP addresses have a tree like structure and the ranges you have in your geo_ip table most probably respect that structure.
If your IP begins with 193.167, then you should have an index helping you filter the geo_ip table very quickly so that only the lines related to a subrange of 193.167 are manipulated.
I think that you should be able to dramatically improve the response time with this approach.
I hope this will help you
That INET_ATON worries me just a bit. It'd make any index on the ip_addr column useless. If you have a way of putting the info all in the same format, say by converting the data to a number before putting it in the DB, that might help.
Other than that, the standard advice about judicious use of indexes applies. You might want indexes on ipfrom and ipto, and/or url_id columns.
MySQL does not optimize queries like this well.
You would need to convert your ipfrom-ipto ranges into LineStrings, thus allowing building an R-Tree index over them:
ALTER TABLE
geo_ip
ADD range LINESTRING;
UPDATE geo_ip
SET range = LINESTRING(POINT(-1, ipfrom), POINT(1, ipfrom));
ALTER TABLE
geo_ip
MODIFY range LINESTRING NOT NULL;
CREATE SPATIAL INDEX
sx_geoip_range
ON geo_ip (range);
SELECT ip_addr AS IP, geo_ip.ctry, COUNT(*)
FROM `admin_adfly`.`url_click`
JOIN admin_adfly.geo_ip
ON MBRContains
(
Point(0, INET_ATON (ip_addr)),
range
)
WHERE url_id = 165
GROUP BY
ip_addr
geo_ip should be a MyISAM table.
See here for more details:
Banning IPs

The most efficient way to query ip ranges in mysql

I have a geoencoding database with ranges of integers (ip addresses equivalent) in each row
fromip(long) toip (long). the integers are created from ip addresses by php ip2long
I need to find the row in which a given ip address (converted to long) is within the range.
What would be the most efficient way to do it? (keys and query)
If I do (the naive solution) select * from ipranges where fromip <= givenip and toip >= givenip limit 1 and the key is fromip, toip. then for the case where the ip address is not in any given ranges the search goes through all the rows.
SOME MORE INFO:
explain select * from ipranges where
ipfrom <= 2130706433 and ipto >=
2130706433 order by ipfrom Asc
limit 1|
gives me 2.5M rows (total 3.6M in the table).
The key is:
PRIMARY KEY (ipfrom,ipto)
that does not seem to be efficient at all. (the ip above is in none of the ranges)
Your query is fine, put an index on (fromip, toip) which will be a covering index for the query. The table won't have to be examined at all, only the sorted index gets searched, which is as fast as you can be.
The search will not actually go through all the rows. Not only will it go through none of the rows, only the index, but it won't examine every entry in the index either. The index is stored as a sorted tree, and only one path through that tree will have to be followed to determine that your IP is not in the table.