Thee is one table Mysql Table On which simple sql where date = 'Some date' is not working
Have checked logs.
Reload the tables several times & tried.
This is proof that record exists :-
select * from TRN_RP_CONSUMPTION_DAILY limit 1;
+------------+------------------------+----------------------+------------------------+----------------+-------------------+-----------------+-------------------+----------------------+--------------------+----------------------+-------------------+-----------------+---------------+--------------+
| TRCD_DATE | TRCD_SPREAD_START_DATE | TRCD_SPREAD_END_DATE | TRCD_SOURCE_CHENNEL_ID | TRCD_CIRCLE_ID | TRCD_CATALOUGE_ID | TRCD_CONTENT_ID | TRCD_SONG_CODE_ID | TRCD_YT_CP_POLICY_ID | TRCD_YT_CHANNEL_ID | TRCD_PRODUCT_NAME_ID | TRCD_TRXN_TYPE_ID | TRCD_TRXN_COUNT | TRCD_TOP_LINE | TRCD_REVENUE |
+------------+------------------------+----------------------+------------------------+----------------+-------------------+-----------------+-------------------+----------------------+--------------------+----------------------+-------------------+-----------------+---------------+--------------+
| 2018-01-01 | 2018-01-01 | 2018-01-04 | 5 | 1 | 1 | 945723 | 1 | 1 | 1 | 211 | 180 | 1.75 | 0 | 0 |
+------------+------------------------+----------------------+------------------------+----------------+-------------------+-----------------+-------------------+----------------------+--------------------+----------------------+-------------------+-----------------+---------------+--------------+
This is proof that index on date exists :-
select * from TRN_RP_CONSUMPTION_DAILY limit 1;
CREATE TABLE `TRN_RP_CONSUMPTION_DAILY` (
`TRCD_DATE` date NOT NULL DEFAULT '0000-00-00',
`TRCD_SPREAD_START_DATE` date NOT NULL DEFAULT '0000-00-00',
`TRCD_SPREAD_END_DATE` date NOT NULL DEFAULT '0000-00-00',
`TRCD_SOURCE_CHENNEL_ID` smallint(5) unsigned NOT NULL DEFAULT '1',
`TRCD_CIRCLE_ID` smallint(5) unsigned NOT NULL DEFAULT '1',
`TRCD_CATALOUGE_ID` int(10) unsigned NOT NULL DEFAULT '1',
`TRCD_CONTENT_ID` int(10) unsigned NOT NULL DEFAULT '1',
`TRCD_SONG_CODE_ID` int(10) unsigned NOT NULL DEFAULT '1',
`TRCD_YT_CP_POLICY_ID` tinyint(3) unsigned NOT NULL DEFAULT '1',
`TRCD_YT_CHANNEL_ID` tinyint(3) unsigned NOT NULL DEFAULT '1',
`TRCD_PRODUCT_NAME_ID` smallint(5) unsigned NOT NULL DEFAULT '1',
`TRCD_TRXN_TYPE_ID` tinyint(3) unsigned NOT NULL DEFAULT '1',
`TRCD_TRXN_COUNT` double NOT NULL DEFAULT '0',
`TRCD_TOP_LINE` double NOT NULL DEFAULT '0',
`TRCD_REVENUE` double NOT NULL DEFAULT '0',
KEY `IDX_TRCD_DATE` (`TRCD_DATE`),
KEY `IDX_TRCD_SOURCE_CHENNEL_ID` (`TRCD_SOURCE_CHENNEL_ID`),
KEY `IDX_TRCD_CATALOUGE_ID` (`TRCD_CATALOUGE_ID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
This is the issue, it should give some count but its not coming (See 1st query result above data is present):-
select count(*) from TRN_RP_CONSUMPTION_DAILY where TRCD_DATE='2018-01-01';
+----------+
| count(*) |
+----------+
| 0 |
+----------+
Proof that it is using index :-
explain select count(*) from TRN_RP_CONSUMPTION_DAILY where TRCD_DATE='2018-01-01';
+----+-------------+--------------------------+------+---------------+---------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------------+------+---------------+---------------+---------+-------+------+-------------+
| 1 | SIMPLE | TRN_RP_CONSUMPTION_DAILY | ref | IDX_TRCD_DATE | IDX_TRCD_DATE | 3 | const | 1 | Using index |
+----+-------------+--------------------------+------+---------------+---------------+---------+-------+------+-------------+
Force index also not working :-
select count(*) from TRN_RP_CONSUMPTION_DAILY FORCE INDEX(IDX_TRCD_DATE) where TRCD_DATE='2018-01-01';
+----------+
| count(*) |
+----------+
| 0 |
+----------+
Yes, its huge table :-
select count(*) from TRN_RP_CONSUMPTION_DAILY;
+------------+
| count(*) |
+------------+
| 2006275044 |
+------------+
Table & Index Size :-
103G = TRN_RP_CONSUMPTION_DAILY.MYD
52G = TRN_RP_CONSUMPTION_DAILY.MYI
Surpriseingly this is working, but I can not use alway like this :-
select count(*) from TRN_RP_CONSUMPTION_DAILY where date(TRCD_DATE)='2018-01-01';
+----------+
| count(*) |
+----------+
| 1235523 |
+----------+
I know This is working because index will not come in account when using function on that indexed column.
Which Percona Server :-
Server version: 5.5.60-38.12-log Percona Server (GPL), Release 38.12,
Revision 26ef816
Nothing comes in Error or warning mysql log when query is giving zero count.
Where clouse on other column on which there is index that is working properly.
Can someone help why its not working on that date column?
I want to add some more index on this table but this is not working so I am stopping here.
Move from MyISAM to InnoDB.
Meanwhile, one of these should work. (Go down the list until you get a usable table. Most options are slow because they involve copying the table.)
CHECK TABLE TRN_RP_CONSUMPTION_DAILY; reports an error, do REPAIR TABLE TRN_RP_CONSUMPTION_DAILY;.
OPTIMIZE TABLE TRN_RP_CONSUMPTION_DAILY;
REPAIR TABLE TRN_RP_CONSUMPTION_DAILY USE_FRM;
DROP INDEX ... (for each index), then ADD INDEX ...
copy table over:
CREATE TABLE new LIKE real;
INSERT INTO new SELECT * FROM TRN_RP_CONSUMPTION_DAILY;
RENAME TABLE TRN_RP_CONSUMPTION_DAILY TO old,
new TO TRN_RP_CONSUMPTION_DAILY;
DROP TABLE old;
Restore from backup?
Those are things that are sometimes needed for MyISAM tables; InnoDB is more robust.
Have you tried STR_TO_DATE MySQL function in your where clause?
In your case: TRCD_DATE = STR_TO_DATE('2018-01-01','%Y,%m,%d') (or inverse %m and %d)
For more information: https://www.w3schools.com/sql/func_mysql_str_to_date.asp
I hope this answers to your problem/question.
You have the table as a date, but is it a date or date/time. May be failing on EXACT?
How about changing your where clause to
where TRCD_DATE >='2018-01-01' AND TRCD_DATE < '2018-01-02'
So you are getting anything from 12:00am (morning) all the way up to 11:59:59pm before the next day. Don't try to convert the data column, that will prevent use of an index.
Related
I'm faced with a MySQL database which contains an events table with ~70 million rows which has foreign keys to other tables and is used to generate reports. Constructing a performant query to select (while counting/summing values) and grouping data per day from this table is proving challenging.
The database structure is as follows:
CREATE TABLE `client` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_client_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8mb3
CREATE TABLE `class` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`client_id` int DEFAULT NULL,
`duration` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_client_id_idx` (`client_id`),
CONSTRAINT `fk_client_id` FOREIGN KEY (`client_id`) REFERENCES `client` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2606 DEFAULT CHARSET=utf8mb3
CREATE TABLE `event` (
`id` int NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`class_id` int DEFAULT NULL,
`venue_id` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_class_id_idx` (`class_id`),
KEY `fk_venue_id_idx` (`venue_id`),
KEY `idx_1` (`venue_id`,`class_id`,`start_time`),
CONSTRAINT `fk_class_id` FOREIGN KEY (`class_id`) REFERENCES `class` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `fk_venue_id` FOREIGN KEY (`venue_id`) REFERENCES `venue` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=64093231 DEFAULT CHARSET=utf8mb3
CREATE TABLE `venue` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_venue_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8mb3
The query which is fine on an events table with a few thousand rows to demonstrate the desired outcome is as follows:
SELECT
CAST(event.start_time as date) as day,
class.name,
client.name,
venue.name,
COUNT(class.name) AS occurrences,
SUM(class.duration) AS duration
FROM
class,
client,
event,
venue
WHERE
event.venue_id = venue.id
AND event.class_id = class.id
AND class.client_id = client.id
GROUP BY day, class.name, client.name, venue.name
The database isn't indexed and although I've tried indexing with things like alter table events add index idx_test (venue_id, class_id, start_time); to improve performance it's still incredibly slow (I tend to abort them when they're past the 10 minute mark so don't know for sure how long they'd take to complete).
I figured this was a good use case for a summary table (as suggested by Rick James' guide) so that I could hold a separate set of summarized data broken down into day with occurrences and total duration calculated/incremented with each addition to the table (IODKU). However I'm then also up against creating rows per day in a summary table based on what is considered a day in the database (UTC) which may not match with the application's "day" due to timezone offset.
Short of converting the start_time column to a timestamp type (which is then inconsistent with all other date types in the database) is there any way round this or is there any other optimization I could be making to the original events table resulting in a more responsive query? TIA
Update 23/05
Here's the buffer pool size:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
I've also made a bit of progress with indexing, modifying the query and creating a summary table.
I tried various ordering of columns to test indexes and found idx_event_venueid_classid_starttime (below), to be the most efficient for the event table:
SHOW INDEXES FROM EVENT;
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| event | 0 | PRIMARY | 1 | id | A | 62142912 | NULL | NULL | | BTREE | | | YES | NULL |
| event | 1 | fk_class_id_idx | 1 | class_id | A | 51286 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | fk_venue_id_idx | 1 | venue_id | A | 16275 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 1 | venue_id | A | 13378 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 2 | class_id | A | 81331 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 3 | start_time | A | 63909472 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
Here's my modified version of the query, using JOIN syntax and now uses CONVERT_TZ to convert from UTC to the timezone required for reporting and then group that by the date (discarding the time portion):
SELECT
DATE(CONVERT_TZ(event.start_time,
'UTC',
'Europe/London')) AS tz_date,
class.name,
client.name,
venue.name,
COUNT(class.id) AS occurrences,
SUM(class.duration) AS duration
FROM
event
JOIN
class ON class.id = event.class_id
JOIN
venue ON venue.id = event.venue_id
JOIN
client ON client.id = class.client_id
GROUP BY tz_date, class.name, client.name, venue.name;
And here's the output of explain for that query:
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| 1 | SIMPLE | venue | NULL | index | PRIMARY,idx_venue_id_name | idx_venue_id_name | 772 | NULL | 28 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | event | NULL | ref | fk_class_id_idx,fk_venue_id_idx,idx_event_venueid_classid_starttime | idx_event_venueid_classid_starttime | 5 | example.venue.id | 4777 | 100.00 | Using where; Using index |
| 1 | SIMPLE | class | NULL | eq_ref | PRIMARY,fk_client_id_idx | PRIMARY | 4 | example.event.class_id | 1 | 100.00 | Using where |
| 1 | SIMPLE | client | NULL | eq_ref | PRIMARY,idx_client_id_name | PRIMARY | 4 | example.class.client_id | 1 | 100.00 | NULL |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
The query takes ~1m 20s to run now so I figured I could prepend that with an insert into to populate a summary table with the dates being timezone specific and run that on a nightly basis. Summary table structure:
CREATE TABLE `summary` (
`tz_date` date NOT NULL,
`class` varchar(255) NOT NULL,
`client` varchar(255) NOT NULL,
`venue` varchar(255) NOT NULL,
`occurrences` int NOT NULL,
`duration` int NOT NULL,
PRIMARY KEY (`tz_date`,`class`,`client`,`venue`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
From the original ~60m+ rows in the event table, the aggregated summary table is populated with ~66k rows.
To then generate the reports from the summary table it takes a fraction of a second (shown below with data snipped):
SELECT * FROM SUMMARY;
66989 rows in set (0.03 sec)
I haven't looked into the impact of inserting into event while the query to populate the summary table is running - is using InnoDB likely to slow that down?
No further indexes are likely to help. It need to scan all the events table, reaching into the other tables to get the names.
Some things for us to look at:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
EXPLAIN SELECT ...
How much RAM do you have?
Do the aggregates (COUNT and SUM) look correct? In some situations involving JOIN, they can be over-inflated.
Please use the newer JOIN ... ON syntax. (Won't change performance.)
As you observed, a Summary Table may help -- but only of the older data is not being modified. Please provide the SHOW CREATE TABLE and query for it.
Yes, timezone vs "definition of day" is a thorny issue. Notice how StackOverflow defines day based on UTC.
How many new rows are there per day? Are they spread out somewhat evenly throughout the day? If the average number of rows per hour is at least 20, then the Summary Table could be based on half-hour intervals. (I picked that because of India time vs most of the rest of the world.) The 20 comes from a Rule of Thumb that says that a summary table should have one-tenth as many rows as the Fact table.
Yes, TIMESTAMP instead of DATETIME may be a workaround.
Since you are talking about moderately large tables, consider whether to change INT NULL to SMALLINT UNSIGNED NOT NULL or some other sized integer.
(As for the cliff in 2038, ask yourself how many databases have been active on the same hardware and software since 2006. That may give some perspective on whether your design must survive 16 years.)
I have a fairly simple table, containing 500 million+ rows. Very simple queries are taking over 2-3.5 minutes. I have an index on the field in the WHERE statement.
I am wondering what I can do to optimize this table and/or query?
THE QUERY & RESULTS
mysql> SELECT COUNT(emails_id) AS count FROM person_deliveries WHERE DATE(date) = '2021-08-23' ;
+--------+
| count |
+--------+
| 539438 |
+--------+
1 row in set (2 min 20.05 sec)
EXPLAIN QUERY
mysql> EXPLAIN SELECT COUNT(emails_id) AS count FROM person_deliveries WHERE DATE(date) = '2021-08-23' ;
+----+-------------+-------------------+------------+-------+---------------+----------------------+---------+------+-----------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------------+-------+---------------+----------------------+---------+------+-----------+----------+--------------------------+
| 1 | SIMPLE | person_deliveries | NULL | index | NULL | campaigns_id | 4 | NULL | 454956815 | 100.00 | Using where; Using index |
+----+-------------+-------------------+------------+-------+---------------+----------------------+---------+------+-----------+----------+--------------------------+
SHOW CREATE TABLE
-------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| person_deliveries | CREATE TABLE `person_deliveries` (
`emails_id` int unsigned NOT NULL,
`campaigns_id` int NOT NULL,
`date` datetime NOT NULL,
`vmta` varchar(255) DEFAULT NULL,
`ip_address` varchar(15) DEFAULT NULL,
`domain` varchar(255) DEFAULT NULL,
UNIQUE KEY `person_campaign_date` (`emails_id`,`campaigns_id`,`date`),
KEY `ip_address` (`ip_address`),
KEY `domain` (`domain`),
KEY `campaigns_id` (`campaigns_id`),
KEY `date` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Thank you in advance!
As Silvanu & DRapp commented, my use of the date() function was slowing the query down.
mysql> SELECT COUNT(emails_id) AS count FROM person_deliveries WHERE date >= '2021-08-23 00:00:00' AND date <= '2021-08-23 23:59:59' ;
+--------+
| count |
+--------+
| 539438 |
+--------+
1 row in set (***0.47 sec***)
I have two databases one for dev and one for staging, and they're both on the same machine too. I'm having a problem with a query for two tables. here are the schema for the tables
Table 1 schema:
Table: import_schedule_t
Create Table: CREATE TABLE `import_schedule_t` (
`id` int(11) NOT NULL,
`theater_id` int(11) NOT NULL,
`movie_code` varchar(20) NOT NULL,
`start_time` datetime NOT NULL,
`end_time` datetime NOT NULL,
`pc_url` varchar(250) NOT NULL,
`mb_url` varchar(250) NOT NULL,
`url_type` int(11) DEFAULT '0',
`active` int(11) DEFAULT '1',
`intime` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`utime` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`schedule_date` datetime NOT NULL,
`movie_name` text NOT NULL,
`screen_name` text NOT NULL,
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
and Table 2 schema:
Table: wp_postmeta
Create Table: CREATE TABLE `wp_postmeta` (
`meta_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`post_id` bigint(20) unsigned NOT NULL DEFAULT '0',
`meta_key` varchar(255) DEFAULT NULL,
`meta_value` longtext,
PRIMARY KEY (`meta_id`),
KEY `post_id` (`post_id`),
KEY `meta_key` (`meta_key`(191))
) ENGINE=MyISAM AUTO_INCREMENT=1399270 DEFAULT CHARSET=utf8
both of the tables are present in both of the databases i've mentioned. When i try to run this query:
SELECT DISTINCT movie_code,post_id
FROM import_schedule_t
INNER JOIN wp_postmeta
ON wp_postmeta.meta_value = import_schedule_t.movie_code
AND wp_postmeta.meta_key='update_movie_id'
WHERE DATE_FORMAT(start_time, '%Y-%m-%d')>= DATE_FORMAT(NOW(),'%Y-%m-%d')
dev database would finish the query in 20 seconds but the staging database would only run it for 1.4 seconds.
here's a sample data:
wp_postmeta table
+---------+---------+-----------------+------------+
| meta_id | post_id | meta_key | meta_value |
+---------+---------+-----------------+------------+
| 45150 | 74572 | update_movie_id | 74572 |
+---------+---------+-----------------+------------+
import_schedule_t table (omitted some of the fields)
+--------+------------+---------------------+---------------------+
| id | movie_code | start_time | end_time |
+--------+------------+---------------------+---------------------+
| 120884 | 74572 | 2015-07-04 12:50:00 | 2015-07-04 15:05:00 |
+--------+------------+---------------------+---------------------+
i already tried looking at the indexes and optimizing the tables but with no success, the query time on the dev database is still 20 seconds.
EXPLAIN EXTENDED on dev
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| 1 | SIMPLE | import_schedule_t | ALL | NULL | NULL | NULL | NULL | 23597 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | wp_postmeta | ALL | NULL | NULL | NULL | NULL | 1461731 | 100.00 | Using where; Using join buffer |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
EXPLAIN EXTENDED on staging
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| 1 | SIMPLE | import_schedule_t | ALL | NULL | NULL | NULL | NULL | 9311 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | wp_postmeta | ALL | NULL | NULL | NULL | NULL | 1461384 | 100.00 | Using where; Using join buffer |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
If both DBs are running on the same machine, with the same MySQL version, in the same harddrive, with the very same structure and data then it might be a fragmentation issue on the OS level. Take the servers down and defrag your disk.
On a side note: don't compare dates as strings, since dates are numbers internally in the DB, and they are compared much more efficiently (WHERE start_time >= curdate() ).
Also you can save some storage space if you define smaller ints for some fields (like the 'active' field). An int is a 4 byte number while a tinyint is 1 byte.
Sorry, can't comment cos I don't have enough reputation, BUT, I would expect the dev system has a lot more data in its tables.
On another point you should not use DATE_FORMAT - I would guess that is turning your dates into strings which are really inefficient to compare. Dates are just integers (internal to MySQL) so they can be compared in one cycle.. the string comparison could easily be 1000 (or more) cycles. You should probably index the start_time field as well to save it having to scan the table.
Anytime you have a query taking 20 seconds you should be suspicious you are doing something wrong! MySQL can do A LOT in 20 seconds.
I am having a issue finding a fast way of joining the tables looking like that:
mysql> explain geo_ip;
+--------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------+------+-----+---------+-------+
| ip_start | varchar(32) | NO | | "" | |
| ip_end | varchar(32) | NO | | "" | |
| ip_num_start | int(64) unsigned | NO | PRI | 0 | |
| ip_num_end | int(64) unsigned | NO | | 0 | |
| country_code | varchar(3) | NO | | "" | |
| country_name | varchar(64) | NO | | "" | |
| ip_poly | geometry | NO | MUL | NULL | |
+--------------+------------------+------+-----+---------+-------+
mysql> explain entity_ip;
+------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+-------+
| entity_id | int(64) unsigned | NO | PRI | NULL | |
| ip_1 | tinyint(3) unsigned | NO | | NULL | |
| ip_2 | tinyint(3) unsigned | NO | | NULL | |
| ip_3 | tinyint(3) unsigned | NO | | NULL | |
| ip_4 | tinyint(3) unsigned | NO | | NULL | |
| ip_num | int(64) unsigned | NO | | 0 | |
| ip_poly | geometry | NO | MUL | NULL | |
+------------+---------------------+------+-----+---------+-------+
Please note that I am not interested in finding the needed rows in geo_ip by only ONE IP address at once, I need a entity_ip LEFT JOIN geo_ip (or similar/analogue way).
This is what I have for now (using polygons as advised on http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/):
mysql> EXPLAIN SELECT li.*, gi.country_code FROM entity_ip AS li
-> LEFT JOIN geo_ip AS gi ON
-> MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`);
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | ip_poly_index | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity AS li LEFT JOIN geo_ip AS gi ON MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`) limit 0, 20;
20 rows in set (2.22 sec)
No polygons
mysql> explain SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.`ip_num` >= gi.`ip_num_start` AND li.`ip_num` <= gi.`ip_num_end` LIMIT 0,20;
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | PRIMARY,geo_ip,geo_ip_end | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.ip_num BETWEEN gi.ip_num_start AND gi.ip_num_end limit 0, 20;
20 rows in set (2.00 sec)
(On higher number of rows in the search - there is no difference)
Currently I cannot get any faster performance from these queries as 0.1 seconds per IP is way too slow for me.
Is there any way to make it faster?
This approach has some scalability issues (should you choose to move to, say, city-specific geoip data), but for the given size of data, it will provide considerable optimization.
The problem you are facing is effectively that MySQL does not optimize range-based queries very well. Ideally you want to do an exact ("=") look-up on an index rather than "greater than", so we'll need to build an index like that from the data you have available. This way MySQL will have much fewer rows to evaluate while looking for a match.
To do this, I suggest that you create a look-up table that indexes the geolocation table based on the first octet (=1 from 1.2.3.4) of the IP addresses. The idea is that for each look-up you have to do, you can ignore all geolocation IPs which do not begin with the same octet than the IP you are looking for.
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
KEY `first_octet` (`first_octet`,`ip_numeric_start`,`ip_numeric_end`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Next, we need to take the data available in your geolocation table and produce data that covers all (first) octets the geolocation row covers: If you have an entry with ip_start = '5.3.0.0' and ip_end = '8.16.0.0', the lookup table will need rows for octets 5, 6, 7, and 8. So...
ip_geolocation
|ip_start |ip_end |ip_numeric_start|ip_numeric_end|
|72.255.119.248 |74.3.127.255 |1224701944 |1241743359 |
Should convert to:
ip_geolocation_lookup
|first_octet|ip_numeric_start|ip_numeric_end|
|72 |1224701944 |1241743359 |
|73 |1224701944 |1241743359 |
|74 |1224701944 |1241743359 |
Since someone here requested for a native MySQL solution, here's a stored procedure that will generate that data for you:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DELETE FROM ip_geolocation_lookup;
WHILE i < 256 DO
INSERT INTO ip_geolocation_lookup (first_octet, ip_numeric_start, ip_numeric_end)
SELECT i, ip_numeric_start, ip_numeric_end FROM ip_geolocation WHERE
( ip_numeric_start & 0xFF000000 ) >> 24 <= i AND
( ip_numeric_end & 0xFF000000 ) >> 24 >= i;
SET i = i + 1;
END WHILE;
END;
And then you will need to populate the table by calling that stored procedure:
CALL recalculate_ip_geolocation_lookup();
At this point you may delete the procedure you just created -- it is no longer needed, unless you want to recalculate the look-up table.
After the look-up table is in place, all you have to do is integrate it into your queries and make sure you're querying by the first octet. Your query to the look-up table will satisfy two conditions:
Find all rows which match the first octet of your IP address
Of that subset: Find the row which has the the range that matches your IP address
Because the step two is carried out on a subset of data, it is considerably faster than doing the range tests on the entire data. This is the key to this optimization strategy.
There are various ways for figuring out what the first octet of an IP address is; I used ( r.ip_numeric & 0xFF000000 ) >> 24 since my source IPs are in numeric form:
SELECT
r.*,
g.country_code
FROM
ip_geolocation g,
ip_geolocation_lookup l,
ip_random r
WHERE
l.first_octet = ( r.ip_numeric & 0xFF000000 ) >> 24 AND
l.ip_numeric_start <= r.ip_numeric AND
l.ip_numeric_end >= r.ip_numeric AND
g.ip_numeric_start = l.ip_numeric_start;
Now, admittedly I did get a little lazy in the end: You could easily get rid of ip_geolocation table altogether if you made the ip_geolocation_lookup table also contain the country data. I'm guessing dropping one table from this query would make it a bit faster.
And, finally, here are the two other tables I used in this response for reference, since they differ from your tables. I'm certain they are compatible, though.
# This table contains the original geolocation data
CREATE TABLE `ip_geolocation` (
`ip_start` varchar(16) NOT NULL DEFAULT '',
`ip_end` varchar(16) NOT NULL DEFAULT '',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
`country_code` varchar(3) NOT NULL DEFAULT '',
`country_name` varchar(64) NOT NULL DEFAULT '',
PRIMARY KEY (`ip_numeric_start`),
KEY `country_code` (`country_code`),
KEY `ip_start` (`ip_start`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
# This table simply holds random IP data that can be used for testing
CREATE TABLE `ip_random` (
`ip` varchar(16) NOT NULL DEFAULT '',
`ip_numeric` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Just wanted to give back to the community:
Here's an even better and optimized way building on Aleksi's solution:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
DELIMITER ;;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DROP TABLE `ip_geolocation_lookup`;
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` smallint(5) unsigned NOT NULL DEFAULT '0',
`startIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`endIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`locId` int(11) NOT NULL,
PRIMARY KEY (`first_octet`,`startIpNum`,`endIpNum`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT startIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT endIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
WHILE i < 1048576 DO
INSERT IGNORE INTO ip_geolocation_lookup
SELECT i, startIpNum, endIpNum, locId
FROM ip_geolocation_lookup
WHERE first_octet = i-1
AND endIpNum DIV 1048576 > i;
SET i = i + 1;
END WHILE;
END;;
DELIMITER ;
CALL recalculate_ip_geolocation_lookup();
It builds way faster than his solution and drills down more easily because we're not just taking the first 8, but the first 20 bits. Join performance: 100000 rows in 158ms. You might have to rename the table and field names to your version.
Query by using
SELECT ip, kl.*
FROM random_ips ki
JOIN `ip_geolocation_lookup` kb ON (ki.`ip` DIV 1048576 = kb.`first_octet` AND ki.`ip` >= kb.`startIpNum` AND ki.`ip` <= kb.`endIpNum`)
JOIN ip_maxmind_locations kl ON kb.`locId` = kl.`locId`;
Can't comment yet, but user1281376's answers is wrong and doesn't work. the reason you only use the first octet is because you aren't going to match all ip ranges otherwise. there's plenty of ranges that span multiple second octets which user1281376s changed query isn't going to match. And yes, this actually happens if you use the Maxmind GeoIp data.
with aleksis suggestion you can do a simple comparison on the fîrst octet, thus reducing the matching set.
I found a easy way. I noticed that all first ip in the group % 256 = 0,
so we can add a ip_index table
CREATE TABLE `t_map_geo_range` (
`_ip` int(10) unsigned NOT NULL,
`_ipStart` int(10) unsigned NOT NULL,
PRIMARY KEY (`_ip`)
) ENGINE=MyISAM
How to fill the index table
FOR_EACH(Every row of ip_geo)
{
FOR(Every ip FROM ipGroupStart/256 to ipGroupEnd/256)
{
INSERT INTO ip_geo_index(ip, ipGroupStart);
}
}
How to use:
SELECT * FROM YOUR_TABLE AS A
LEFT JOIN ip_geo_index AS B ON B._ip = A._ip DIV 256
LEFT JOIN ip_geo AS C ON C.ipStart = B.ipStart;
More than 1000 times Faster.
Given this table on local MySQL instance 5.1 with query caching off:
show create table product_views\G
*************************** 1. row ***************************
Table: product_views
Create Table: CREATE TABLE `product_views` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`dateCreated` datetime NOT NULL,
`dateModified` datetime DEFAULT NULL,
`hibernateVersion` bigint(20) DEFAULT NULL,
`brandName` varchar(255) DEFAULT NULL,
`mfrModel` varchar(255) DEFAULT NULL,
`origin` varchar(255) NOT NULL,
`price` float DEFAULT NULL,
`productType` varchar(255) DEFAULT NULL,
`rebateDetailsViewed` tinyint(1) NOT NULL,
`rebateSearchZipCode` int(11) DEFAULT NULL,
`rebatesFoundAmount` float DEFAULT NULL,
`rebatesFoundCount` int(11) DEFAULT NULL,
`siteSKU` varchar(255) DEFAULT NULL,
`timestamp` datetime NOT NULL,
`uiContext` varchar(255) DEFAULT NULL,
`siteVisitId` bigint(20) NOT NULL,
`efficiencyLevel` varchar(255) DEFAULT NULL,
`siteName` varchar(255) DEFAULT NULL,
`clicks` varchar(1024) DEFAULT NULL,
`rebateFormDownloaded` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `siteVisitId` (`siteVisitId`,`siteSKU`),
KEY `FK52C29B1E3CAB9CC4` (`siteVisitId`),
KEY `rebateSearchZipCode_idx` (`rebateSearchZipCode`),
KEY `FIND_UNPROCESSED_IDX` (`siteSKU`,`siteVisitId`,`timestamp`),
CONSTRAINT `FK52C29B1E3CAB9CC4` FOREIGN KEY (`siteVisitId`) REFERENCES `site_visits` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=32909504 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
This query takes ~3s:
SELECT pv.id, pv.siteSKU
FROM product_views pv
CROSS JOIN site_visits sv
WHERE pv.siteVisitId = sv.id
AND pv.siteSKU = 'foo'
AND sv.siteId = 'bar'
AND sv.postProcessed = 1
AND pv.timestamp >= '2011-05-19 00:00:00'
AND pv.timestamp < '2011-06-18 00:00:00';
But this one (non-indexed column added to SELECT) takes ~65s:
SELECT pv.id, pv.siteSKU, pv.hibernateVersion
FROM product_views pv
CROSS JOIN site_visits sv
WHERE pv.siteVisitId = sv.id
AND pv.siteSKU = 'foo'
AND sv.siteId = 'bar'
AND sv.postProcessed = 1
AND pv.timestamp >= '2011-05-19 00:00:00'
AND pv.timestamp < '2011-06-18 00:00:00';
Nothing in 'where' or 'from' clauses is different. All the extra time is spent in 'sending data':
mysql> show profile for query 1;
+--------------------+-----------+
| Status | Duration |
+--------------------+-----------+
| starting | 0.000155 |
| Opening tables | 0.000029 |
| System lock | 0.000007 |
| Table lock | 0.000019 |
| init | 0.000072 |
| optimizing | 0.000032 |
| statistics | 0.000316 |
| preparing | 0.000034 |
| executing | 0.000002 |
| Sending data | 63.530402 |
| end | 0.000044 |
| query end | 0.000005 |
| freeing items | 0.000091 |
| logging slow query | 0.000002 |
| logging slow query | 0.000109 |
| cleaning up | 0.000004 |
+--------------------+-----------+
16 rows in set (0.00 sec)
I understand that using a non-indexed column in where clause would slow things down, but why here? What can be done to improve the latter case - given that I will actually want to SELECT(*) from product_views?
EXPLAIN Output
explain extended select pv.id, pv.siteSKU from product_views pv cross join site_visits sv where pv.siteVisitId=sv.id and pv.siteSKU='foo' and sv.sit eId='bar' and sv.postProcessed=1 and pv.timestamp>='2011-05-19 00:00:00' and pv.timestamp<'2011-06-18 00:00:00';
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filt ered | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ | 1 | SIMPLE | pv | ref | siteVisitId,FK52C29B1E3CAB9CC4,FIND_UNPROCESSED_IDX | FIND_UNPROCESSED_IDX | 258 | const | 41810 | 10
0.00 | Using where; Using index | | 1 | SIMPLE | sv | eq_ref | PRIMARY,post_processed_idx | PRIMARY | 8 | clabs.pv.siteVisitId | 1 | 10
0.00 | Using where |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ 2 rows in set, 1 warning (0.00 sec)
mysql> explain extended select pv.id, pv.siteSKU, pv.hibernateVersion from product_views pv cross join site_visits sv where pv.siteVisitId=sv.id and pv.siteSKU= 'foo' and sv.siteId='bar' and sv.postProcessed=1 and pv.timestamp>='2011-05-19 00:00:00' and pv.timestamp<'2011-06-18 00:00:00';
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filt ered | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ | 1 | SIMPLE | pv | ref | siteVisitId,FK52C29B1E3CAB9CC4,FIND_UNPROCESSED_IDX | FIND_UNPROCESSED_IDX | 258 | const | 41810 | 10
0.00 | Using where | | 1 | SIMPLE | sv | eq_ref | PRIMARY,post_processed_idx | PRIMARY | 8 | clabs.pv.siteVisitId | 1 | 10
0.00 | Using where |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ 2 rows in set, 1 warning (0.00 sec)
UPDATE1: Splitting into 2 queries brings total time down to ~30s range
Not sure why, but splitting the latter query into the following reduces lat. from 65s to ~30s:
1) SELECT pv.id .... //from, where clauses same as above
2) SELECT * FROM product_views where id in (idList); //idList
UPDATE2: TABLE SIZE
table has on the order of 10M rows
query returns about 3k rows
When you select only indexed columns, MySQL does read only the index, and does not need to read the table data. This, as far as I remember, is called index-covered query. However, when there are columns, that are not present in the used index, MySQL needs to open the table and read the data from it. This is the reason index-covered queries to be much faster.
See Using Covering Indexes to Improve Query Performance.
As for the improvement, how many rows are in the table, how much the query returns and what is your buffer pool size, how much RAM is available, etc.?
From what I have read about show profile, 'sending data' is a portion of execution process, and has almost nothing to do with sending actual data to the client. You can take a look on this thread
Also, mysql docs says about "Sending data" :
The thread is reading and processing rows for a SELECT statement, and sending data to the client. Because operations occurring during this state tend to perform large amounts of disk access (reads), it is often the longest-running state over the lifetime of a given query.
In my opinion, mysql would better not mix together "reading and processing rows for a SELECT statement" and "sending data" in one state, especially in state called "sending" data" which causes lots of confusion.
I'm don't know MySQL internals at all, but Darhazer's explanation looks like the winner to me. When the non-indexed field is added, the entire row must be retrieved. And your rows are very wide. I can't quite tell from the names how (if at all) it is denormalized, but I suspect it is. site name and site sku smell like they belong in a site lookup table with an FK. rebates found amount and rebates found count sound like statistics that should be coming from a join to a separate product rebate table. etc.