Can this SQL query be optimized? - mysql

This is a query for a Postfix table lookup (smtpd_sender_login_maps) in MariaDB (MySQL). Given an email address it returns the users allowed to use that address. I am using two SQL tables to store accounts and aliases that need to be searched. Postfix requires a single query to return a single result set hence the UNION SELECT. I know there is unionmap:{} in postfix but i do not want to go that route and prefer the union select. The emails.email column is the username that is returned for Postfix SASL authentication. The %s in the query is where Postfix inserts the email address to search for. The reason for matching everything back to the emails.postfixPath is because that is the physical inbox, if two accounts share the same inbox they should both have access to use all the same emails including aliases.
Table: emails
+-------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| email | varchar(100) | NO | PRI | NULL | |
| postfixPath | varchar(100) | NO | MUL | NULL | |
| password | varchar(50) | YES | | NULL | |
| acceptMail | tinyint(1) | NO | | 1 | |
| allowLogin | tinyint(1) | NO | | 1 | |
| mgrLogin | tinyint(1) | NO | | 0 | |
+-------------+--------------+------+-----+---------+-------+
.
Table: aliases
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| email | varchar(100) | NO | PRI | NULL | |
| forwardTo | varchar(100) | NO | | NULL | |
| acceptMail | tinyint(1) | NO | | 1 | |
+------------+--------------+------+-----+---------+-------+
.
SELECT email
FROM emails
WHERE postfixPath=(
SELECT postfixPath
FROM emails
WHERE email='%s'
AND acceptMail=1
LIMIT 1)
AND password IS NOT NULL
AND allowLogin=1
UNION SELECT email
FROM emails
WHERE postfixPath=(
SELECT postfixPath
FROM emails
WHERE email=(
SELECT forwardTo
FROM aliases
WHERE email='%s'
AND acceptMail=1)
LIMIT 1)
AND password IS NOT NULL
AND allowLogin=1
AND acceptMail=1
This query works, it just looks heavy to me and i feel like it should be more streamlined / efficient. Does anyone have a better way to write this or is this as good as it gets?
I added CREATE INDEX index_postfixPath ON emails (postfixPath) per #The Impaler's suggestion.
#Rick James here is the additional table info:
Table: emails
Create Table: CREATE TABLE `emails` (
`email` varchar(100) NOT NULL,
`postfixPath` varchar(100) NOT NULL,
`password` varchar(50) DEFAULT NULL,
`acceptMail` tinyint(1) NOT NULL DEFAULT 1,
`allowLogin` tinyint(1) NOT NULL DEFAULT 1,
`mgrLogin` tinyint(1) NOT NULL DEFAULT 0,
PRIMARY KEY (`email`),
KEY `index_postfixPath` (`postfixPath`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Table: aliases
Create Table: CREATE TABLE `aliases` (
`email` varchar(100) NOT NULL,
`forwardTo` varchar(100) NOT NULL,
`acceptMail` tinyint(1) NOT NULL DEFAULT 1,
PRIMARY KEY (`email`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
.

Part 1:
SELECT email
FROM emails
WHERE postfixPath=
(
SELECT postfixPath
FROM emails
WHERE email='%s'
AND acceptMail = 1
LIMIT 1
)
AND password IS NOT NULL
AND allowLogin = 1
With indexes:
emails: (email, acceptMail, password)
I assume acceptMail has only 2 values? The Optimizer cannot know that, so it sees AND acceptMail as a range test. AND acceptMail = 1 fixes that. (No, > 0, != 0, etc, can't be optimized.)
Part 2:
This has 3 layers, and is probably where the inefficiency is.
SELECT e.email
FROM ( SELECT forwardTo ... ) AS c
JOIN ( SELECT postfixPath ... ) AS d ON ...
JOIN emails AS e ON e.postfixPath = d.postfixPath
This is how the Optimizer might optimize your version. But I am not sure it did, so I changed it to encourage it to do so.
Again, use =1 when testing for "true". Then have these indexes:
aliases: (email, acceptMail, forwardTo)
emails: (email, postfixPath)
emails: (postfixPath, allowLogin, acceptMail, password, email)
Finally, the UNION:
( SELECT ... part 1 ... )
UNION ALL
( SELECT ... part 2 ... )
I added parentheses to avoid ambiguities about what clauses belong to the Selects versus to the Union.
UNION ALL is faster than UNION (which is UNION DISTINCT), but you might get the same email twice. However, that may be nonsense -- forwarding an email to yourself??
The order of columns in each index is important. (However, some variants are equivalent.)
I think all the indexes I provided are "covering", thereby giving an extra performance boost.
Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE. "MUL" is especially ambiguous.
(Caveat: I threw this code together rather hastily; it may not be quite correct, but principles should help.)
For further optimization, please do like I did in splitting it into 3 steps. Check the performance of each.

The following three indexes will make the query faster:
create index ix1 on emails (allowLogin, postfixPath, acceptMail, password, email);
create index ix2 on emails (email, acceptMail);
create index ix3 on aliases (email, acceptMail);

Related

Optimize and speed up MySQL query selection

I'm trying to figure out which is the best way to optimize my current selection query on a MySQL database.
I have 2 MySQL tables with a relationship one-to-many. One is the user table that contains the unique list of users and It has around 22krows. One is the linedata table which contains all the possible coordinates for each user and it has around 490k rows.
In this case we can assume the foreign key between the 2 tables is the id value. In the case of the user table the id is also the auto-increment primary key, while in the linedata table it's not primary key cause we can have more rows for the same user.
The CREATE STMT structure
CREATE TABLE `user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`isActive` tinyint(4) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`gender` varchar(45) COLLATE utf8_unicode_ci NOT NULL,
`age` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=21938 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `linedata` (
`id` int(11) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`timestamp` datetime NOT NULL,
`x` float NOT NULL,
`y` float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The selection query
SELECT
u.id,
u.isActive,
u.userId,
u.name,
u.gender,
u.age,
GROUP_CONCAT(CONCAT_WS(', ',timestamp,x, y)
ORDER BY timestamp ASC SEPARATOR '; '
) as linedata_0
FROM user u
JOIN linedata l
ON u.id=l.id
WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
GROUP BY userId;
The EXPLAIN output
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| ID | SELECT_TYPE | TABLE | TYPE | POSSIBLE_KEYS | KEY | KEY_LEN | REF | ROWS | EXTRA |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| 1 | SIMPLE | l | ALL | NULL | NULL | NULL | NULL | 491157 | "Using where; Using temporary; Using filesort" |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | l.id | 1 | NULL |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
The selection query works if for example I add another WHERE condition for filter single users. Let's say that I want to select just 200 user, then I got around 14 seconds as execution time. Around 7 seconds if I select just the first 100 users. But in case of having only datetime range condition it seems loading without an ending point. Any suggestions?
UPDATE
After following the Rick's suggestions now the query benchmark is around 14 seconds. Here below the EXPLAIN EXTENDED:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,PRIMARY,u,index,PRIMARY,PRIMARY,4,NULL,21959,100.00,NULL
1,PRIMARY,l,ref,id_timestamp_index,id_timestamp_index,4,u.id,14,100.00,"Using index condition"
2,"DEPENDENT SUBQUERY",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,"No tables used"
I have changed a bit some values of the tables:
Where the id in user table can be joined with userId in linedata table. And they are integer now. We will have string type just for the userId value in user table cause it is a sort of long string identifier like 0000309ab2912b2fd34350d7e6c079846bb6c5e1f97d3ccb053d15061433e77a_0.
So, just for make a quick example we will have in user and in linedata table:
+-------+-----------+-----------+-------------------+--------+---+
| id | isActive | userId | name | gender |age|
+-------+-----------+-----------+-------------------+--------+---+
| 1 | 1 | x4by4d | john | m | 22|
| 2 | 1 | 3ub3ub | bob | m | 50|
+-------+-----------+-----------+-------------------+--------+---+
+-------+-----------+-----------+------+---+
| id | userId |timestamp | x | y |
+-------+-----------+-----------+------+----+
| 1 | 1 | somedate | 30 | 10 |
| 2 | 1 | somedate | 45 | 15 |
| 3 | 1 | somedate | 50 | 20 |
| 4 | 2 | somedate | 20 | 5 |
| 5 | 2 | somedate | 25 | 10 |
+-------+-----------+-----------+------+----+
I have added a compound index made of userId and timestamp values in linedata table.
Maybe instead of having as primary key an ai id value for linedata table, if I add a composite primary key made of userId+timestamp? Should increase the performance or maybe not?
I need to help you fix several bugs before discussing performance.
First of all, '2018-02-28T20:00:00.000Z' won't work in MySQL. It needs to be '2018-02-28 20:00:00.000' and something needs to be done about the timezone.
Then, don't "hide a column in a function". That is DATEDIFF(l.timestamp ...) cannot use any indexing on timestamp.
So, instead of
WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
do something like
WHERE l.timestamp >= '2018-02-28 20:00:00.000'
AND l.timestamp < '2018-11-20 09:20:08.218'
I'm confused about the two tables. Both have id and userid, yet you join on id. Perhaps instead of
CREATE TABLE `linedata` (
`id` int(11) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
...
you meant
CREATE TABLE `linedata` (
`id` int(11) NOT NULL AUTO_INCREMENT, -- (the id for `linedata`)
`userId` int NOT NULL, -- to link to the other table
...
PRIMARY KEY(id)
...
Then there could be several linedata rows for each user.
At that point, this
JOIN linedata l ON u.id=l.id
becomes
JOIN linedata l ON u.id=l.userid
Now, for performance: linedata needs INDEX(userid, timestamp) - in that order.
Now, think about the output. You are asking for up to 22K rows, with possibly hundreds of "ts,x,y" strung together in one of the columns. What will receive this much data? Will it choke on it?
And GROUP_CONCAT has a default limit of 1024 bytes. That will allow for about 50 points. If a 'user' can be in more than 50 spots in 9 days, consider increasing group_concat_max_len before running the query.
To make it work even faster, reformulate it this way:
SELECT u.id, u.isActive, u.userId, u.name, u.gender, u.age,
( SELECT GROUP_CONCAT(CONCAT_WS(', ',timestamp, x, y)
ORDER BY timestamp ASC
SEPARATOR '; ')
) as linedata_0
FROM user u
JOIN linedata l ON u.id = l.userid
WHERE l.timestamp >= '2018-02-28 20:00:00.000'
AND l.timestamp < '2018-11-20 09:20:08.218';
Another thing. You probably want to be able to look up a user by name; so add INDEX(name)
Oh, what the heck is the VARCHAR(255) for userID?? Ids are normally integers.

Subquery for faster result

I have this query which takes me more than 117 seconds on a mysql database.
select users.*, users_oauth.* FROM users LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id WHERE (
(MATCH (user_email) AGAINST ('sometext')) OR
(MATCH (user_firstname) AGAINST ('sometext')) OR
(MATCH (user_lastname) AGAINST ('sometext')) )
ORDER BY user_date_accountcreated DESC LIMIT 1400, 50
How can I use a subquery in order to optimize it ?
The 3 fields are fulltext :
ALTER TABLE `users` ADD FULLTEXT KEY `email_fulltext` (`user_email`);
ALTER TABLE `users` ADD FULLTEXT KEY `firstname_fulltext` (`user_firstname`);
ALTER TABLE `users` ADD FULLTEXT KEY `lastname_fulltext` (`user_lastname`);
There is only one search input in a website to search in different table users fields.
If the limit is for example LIMIT 0,50, the query will run in less than 3 seconds but when the LIMIT increase the query becomes very slow.
Thanks.
Use a single FULLTEXT index:
FULLTEXT(user_email, user_firstname, user_lastname)
And change the 3 matches to just one:
MATCH (user_email, user_firstname, user_lastname) AGAINST ('sometext')
Here's another issue: ORDER BY ... DESC LIMIT 1400, 50. Read about the evils of pagination via OFFSET . That has a workaround, but I doubt if it would apply to your statement.
Do you really have thousands of users matching the text? Does someone (other than a search engine robot) really page through 29 pages? Think about whether it makes sense to really have such a long-winded UI.
And a 3rd issue. Consider "lazy eval". That is, find the user ids first, then join back to users and users_oauth to get the rest of the columns. It would be a single SELECT with the MATCH in a derived table, then JOIN to the two tables. If the ORDER BY an LIMIT can be in the derived table, it could be a big win.
Please indicate which table each column belongs to -- my last paragraph is imprecise because of not knowing about the date column.
Update
In your second attempt, you added OR, which greatly slows things down. Let's turn that into a UNION to try to avoid the new slowdown. First let's debug the UNION:
( SELECT * -- no mention of oauth columns
FROM users -- No JOIN
WHERE users.user_id LIKE ...
ORDER BY user_id DESC
LIMIT 0, 50
)
UNION ALL
( SELECT * -- no mention of oauth columns
FROM users
WHERE MATCH ...
ORDER BY user_id DESC
LIMIT 0, 50
)
Test it by timing each SELECT separately. If one of the is still slow, then let's focus on it. Then test the UNION. (This is a case where using the mysql commandline tool may be more convenient than PHP.)
By splitting, each SELECT can use an optimal index. The UNION has some overhead, but possibly less than the inefficiency of OR.
Now let's fold in users_oauth.
First, you seem to be missing a very important INDEX(oauth_user_id). Add that!
Now let's put them together.
SELECT u.*
FROM ( .... the entire union query ... ) AS u
LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id
ORDER BY user_id DESC -- yes, repeat
LIMIT 0, 50 -- yes, repeat
Yes #Rick
I changed the index fulltext to:
ALTER TABLE `users`
ADD FULLTEXT KEY `fulltext_adminsearch` (`user_email`,`user_firstname`,`user_lastname`);
And now there is some php conditions, $_POST['search'] can be empty:
if(!isset($_POST['search'])) {
$searchId = '%' ;
} else {
$searchId = $_POST['search'] ;
}
$searchMatch = '+'.str_replace(' ', ' +', $_POST['search']);
$sqlSearch = $dataBase->prepare(
'SELECT users.*, users_oauth.*
FROM users
LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id
WHERE ( users.user_id LIKE :id OR
(MATCH (user_email, user_firstname, user_lastname)
AGAINST (:match IN BOOLEAN MODE)) )
ORDER BY user_id DESC LIMIT 0,50') ;
$sqlSearch->execute(array('id' => $searchId,
'match' => $searchMatch )) ;
The users_oauth table has a column with user_id:
Table users:
+--------------------------+-----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------+-----------------+------+-----+---------+----------------+
| user_id | int(8) unsigned | NO | PRI | NULL | auto_increment |
| user_activation_key | varchar(40) | YES | | NULL | |
| user_email | varchar(40) | NO | UNI | | |
| user_login | varchar(30) | YES | | NULL | |
| user_password | varchar(40) | YES | | NULL | |
| user_firstname | varchar(30) | YES | | NULL | |
| user_lastname | varchar(50) | YES | | NULL | |
| user_lang | varchar(2) | NO | | en
+--------------------------+-----------------+------+-----+---------+----------------+
Table users_oauth:
+----------------------+-----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+-----------------+------+-----+---------+----------------+
| oauth_id | int(8) unsigned | NO | PRI | NULL | auto_increment |
| oauth_user_id | int(8) unsigned | NO | | NULL | |
| oauth_google_id | varchar(30) | YES | UNI | NULL | |
| oauth_facebook_id | varchar(30) | YES | UNI | NULL | |
| oauth_windowslive_id | varchar(30) | YES | UNI | NULL | |
+----------------------+-----------------+------+-----+---------+----------------+
The Left Join is long, the request takes 3 seconds with, 0,0158 seconds wihtout.
It would be more rapid to make a sql request for each 50 rows.
Would it be more rapid with a subquery ? How to make it with a subquery ?
Thanks

Need help optimizing outer join SQL query

I am hoping to get some advice on how to optimize the performance of this query I have with an outer join. First I will explain what I am trying to do and then I'll show the code and results.
I have an Accounts table that has a list of all customer accounts. And I have a datausage table which keeps track of how much data each customer is using. A backend process running on multiple servers inserts records into the datausage table each day to keep track of how much usage occurred that day for each customer on that server.
The backend process works like this - if there is no activity on that server for an account on that day, no records are written for that account. If there is activity, one record is written with a "LogDate" of that day. This is happening on multiple servers. So collectively the datausage table winds up with no rows (no activity at all for that customer each day), one row (activity was only on one server for that day), or multiple rows (activity was on multiple servers for that day).
We need to run a report that lists ALL customers, along with their usage for a specific date range. Some customers may have no usage at all (nothing whatsoever in the datausage table). Some customers may have no usage at all for the current period (but usage in other periods).
Regardless of whether there is any usage or not (ever, or for the selected period) we need EVERY customer in the Accounts table to be listed in the report, even if they show no usage. Therefore it seems this required an outer join.
Here is the query I am using:
SELECT
Accounts.accountID as AccountID,
IFNULL(Accounts.name,Accounts.accountID) as AccountName,
AccountPlans.plantype as AccountType,
Accounts.status as AccountStatus,
date(Accounts.created_at) as Created,
sum(IFNULL(datausage.Core,0) + (IFNULL(datausage.CoreDeluxe,0) * 3)) as 'CoreData'
FROM `Accounts`
LEFT JOIN `datausage` on `Accounts`.`accountID` = `datausage`.`accountID`
LEFT JOIN `AccountPlans` on `AccountPlans`.`PlanID` = `Accounts`.`PlanID`
WHERE
(
(`datausage`.`LogDate` >= '2014-06-01' and `datausage`.`LogDate` < '2014-07-01')
or `datausage`.`LogDate` is null
)
GROUP BY Accounts.accountID
ORDER BY `AccountName` asc
This query takes about 2 seconds to run. However it only takes 0.3 seconds to run if the "or datausage.LogDate is NULL" is removed. However, it seems I must have that clause in there, because accounts with no usage are excluded from the result set if that does not appear.
Here is the table data:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+---------------------------------------------------------+---------+---------+----------------------+------- +----------------------------------------------------+
| 1 | SIMPLE | Accounts | ALL | PRIMARY,accounts_planid_foreign,accounts_cardid_foreign | NULL | NULL | NULL | 57 | Using temporary; Using filesort |
| 1 | SIMPLE | datausage | ALL | NULL | NULL | NULL | NULL | 96805 | Using where; Using join buffer (Block Nested Loop) |
| 1 | SIMPLE | AccountPlans | eq_ref | PRIMARY | PRIMARY | 4 | mydb.Accounts.planID | 1 | NULL |
The indexes on Accounts table are as follows:
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Accounts | 0 | PRIMARY | 1 | accountID | A | 57 | NULL | NULL | | BTREE | | |
| Accounts | 1 | accounts_planid_foreign | 1 | planID | A | 5 | NULL | NULL | | BTREE | | |
| Accounts | 1 | accounts_cardid_foreign | 1 | cardID | A | 0 | NULL | NULL | YES | BTREE | | |
The index on the datausage table is as follows:
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| datausage | 0 | PRIMARY | 1 | UsageID | A | 96805 | NULL | NULL | | BTREE | | |
I tried creating different indexes on datausage to see if it would help, but nothing did. I tried an index on AccountID, an index on AccountID, LogData, and index on LogData, AccountID, and an index on LogData. None of these made any difference.
I also tried using a UNION ALL with one of the queries with the logdata range and the other query just where logdata is null, but the result was about the same (actually a bit worse).
Can someone please help me understand what may be going on and the ways in which I can optimize the query execution time? Thank you!!
UPDATE: At Philipxy's request, here are the table definitions. Note that I removed some columns and constraints that are not related to this query to help keep things as tight and clean as possible.
CREATE TABLE `Accounts` (
`accountID` varchar(25) NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`status` int(11) NOT NULL,
`planID` int(10) unsigned NOT NULL DEFAULT '1',
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
PRIMARY KEY (`accountID`),
KEY `accounts_planid_foreign` (`planID`),
KEY `acctname_id_ndx` (`name`,`accountID`),
CONSTRAINT `accounts_planid_foreign` FOREIGN KEY (`planID`) REFERENCES `AccountPlans` (`planID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `datausage` (
`UsageID` int(11) NOT NULL AUTO_INCREMENT,
`Core` int(11) DEFAULT NULL,
`CoreDelux` int(11) DEFAULT NULL,
`AccountID` varchar(25) DEFAULT NULL,
`LogDate` date DEFAULT NULL
PRIMARY KEY (`UsageID`),
KEY `acctusage` (`AccountID`,`LogDate`)
) ENGINE=MyISAM AUTO_INCREMENT=104303 DEFAULT CHARSET=latin1
CREATE TABLE `AccountPlans` (
`planID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`params` text COLLATE utf8_unicode_ci NOT NULL,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`plantype` varchar(25) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`planID`),
KEY `acctplans_id_type_ndx` (`planID`,`plantype`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
First, you can simplify the query by moving the where clause to the on clause:
SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
ap.plantype as AccountType, a.status as AccountStatus,
date(a.created_at) as Created,
sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN
datausage du
on a.accountID = du.`accountID` AND
du.`LogDate` >= '2014-06-01' and du.`LogDate` < '2014-07-01'
LEFT JOIN
AccountPlans ap
on ap.`PlanID` = a.`PlanID`
GROUP BY a.accountID
ORDER BY AccountName asc ;
(I also introduced table aliases to make the query easier to read.)
This version should make better uses of indexes because it eliminates the or in the where clause. However, it still won't use an index for the outer sort. The following might be better:
SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
ap.plantype as AccountType, a.status as AccountStatus,
date(a.created_at) as Created,
sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN
datausage du
on a.accountID = du.`accountID` AND
du.LogDate >= '2014-06-01' and du.LogDate < '2014-07-01'LEFT JOIN
AccountPlans ap
on ap.PlanID = a.PlanID
GROUP BY a.accountID
ORDER BY a.name, a.accountID ;
For this, I would recommend the following indexes:
Accounts(name, AccountId)
Datausage(AccountId, LogDate)
AccountPlans(PlanId, PlanType)
When you left join with datausage you should restrict the output as much as possible right there. (JOIN means AND means WHERE means ON. Put the conditions in essentially whatever order will be clear and/or optimize when necessary.) The result will be a null-extended row when there was no usage; you want to leave that row in.
When you join with AccountPlans you don't want to introduce null rows (which can't happen anyway) so that's just an inner join.
The version below has the AccountPlan join as an inner join and put first. (Indexed) Accounts FK PlanID to AccountPlan means the DBMS knows the inner join will only ever generate one row per Accounts PK. So the output has key AccountId. That row can be immediately inner joined to datausage. (An index on its AccountID should help, eg for a merge join.) For the other way around there is no PlanID key/index on the outer join result to join with AccountPlan.
SELECT
a.accountID as AccountID,
IFNULL(a.name,a.accountID) as AccountName,
ap.plantype as AccountType,
a.status as AccountStatus,
date(a.created_at) as Created,
sum(IFNULL(du.Core,0) + (IFNULL(du.CoreDeluxe,0) * 3)) as CoreData
FROM Accounts a
JOIN AccountPlans ap ON ap.PlanID = a.PlanID
LEFT JOIN datausage du ON a.accountID = du.accountID AND du.LogDate >= '2014-06-01' AND du.LogDate < '2014-07-01'
GROUP BY a.accountID

Optimizing an InnoDB table and a problematic query

I have a biggish InnoDB table which at this moment contains about 20 million rows with ~20000 new rows inserted every day. They contain messages for different topics.
CREATE TABLE IF NOT EXISTS `Messages` (
`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`TopicID` bigint(20) unsigned NOT NULL,
`DATESTAMP` int(11) DEFAULT NULL,
`TIMESTAMP` int(10) unsigned NOT NULL,
`Message` mediumtext NOT NULL,
`Checksum` varchar(50) DEFAULT NULL,
`Nickname` varchar(80) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `TopicID` (`TopicID`,`Checksum`),
KEY `DATESTAMP` (`DATESTAMP`),
KEY `Nickname` (`Nickname`),
KEY `TIMESTAMP` (`TIMESTAMP`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=25195126 ;
NOTE: The Cheksum stores an MD5 checksum which prevents same messages inserted twice in the same topics. (nickname + timestamp + topicid + last 20 chars of message)
The site I'm building has a newsfeed in which users can select to view newest messages from different Nicknames from different forums. The query is as follows:
SELECT
Messages.ID AS MessageID,
Messages.Message,
Messages.TIMESTAMP,
Messages.Nickname,
Topics.ID AS TopicID,
Topics.Title AS TopicTitle,
Forums.Title AS ForumTitle
FROM Messages
JOIN FollowedNicknames ON FollowedNicknames.UserID = 'MYUSERID'
JOIN Forums ON Forums.ID = FollowedNicknames.ForumID
JOIN Subforums ON Subforums.ForumID = Forums.ID
JOIN Topics ON Topics.SubforumID = Subforums.ID
WHERE
Messages.Nickname = FollowedNicknames.Nickname AND
Messages.TopicID = Topics.ID AND Messages.DATESTAMP = '2013619'
ORDER BY Messages.TIMESTAMP DESC
The TIMESTAMP contains an unix timestamp and DATESTAMP is simply a date generated from the unix timestamp for faster access via '=' operator instead of range scans with unix timestamps.
The problem is, this query takes about 13 seconds ( or more ) unbuffered. That is of course unacceptable for the intented usage. Adding the DATESTAMP seemed to speed things up, but not by much.
At this point, I don't really know what should I do. I've read about composite primary keys, but I am still unsure whether they would do any good and how to correctly implement one in this particular case.
I know that using BIGINTs may be a little overkill, but do they affect that much?
EXPLAIN:
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | FollowedNicknames | ALL | UserID,ForumID,Nickname | NULL | NULL | NULL | 8 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | Forums | eq_ref | PRIMARY | PRIMARY | 8 | database.FollowedNicknames.ForumiID | 1 | NULL |
| 1 | SIMPLE | Messages | ref | TopicID,DATETIME,Nickname | Nickname | 242 | database.FollowedNicknames.Nickname | 15 | Using where |
| 1 | SIMPLE | Topics | eq_ref | PRIMARY,SubforumID | PRIMARY | 8 | database.Messages.TopicID | 1 | NULL |
| 1 | SIMPLE | Subforums | eq_ref | PRIMARY,ForumID | PRIMARY | 8 | database.Topics.SubforumID | 1 | Using where |
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
You shouldn't be JOINing on a VARCHAR column (Nickname); you should use the user ID to join those tables. That is definitely slowing the query down and is probably the biggest issue. It would also be easier to follow if you wrote all of the JOINs explicitly instead of at the end in the WHERE clause like this:
SELECT
Messages.ID AS MessageID,
Messages.Message,
Messages.TIMESTAMP,
Messages.Nickname,
Topics.ID AS TopicID,
Topics.Title AS TopicTitle,
Forums.Title AS ForumTitle
FROM Messages
JOIN FollowedNicknames ON Messages.Nickname = FollowedNicknames.Nickname
AND FollowedNicknames.UserID = 'MYUSERID'
JOIN Forums ON Forums.ID = FollowedNicknames.ForumID
JOIN Subforums ON Subforums.ForumID = Forums.ID
JOIN Topics ON Messages.TopicID = Topics.ID
AND Topics.SubforumID = Subforums.ID
WHERE Messages.DATESTAMP = '2013619'
ORDER BY Messages.TIMESTAMP DESC
Instead of INT as the data type for the DATESTAMP column, I would use DATE. The Checksum column should probably use latin1_general_ci as the collation. I would use INT for the ID columns as long as their values are less than 2,000,000,000 since INT UNSIGNED can store values up to roughly 4,000,000,000. InnoDB is affected by the primary key much more than MyISAM and it could make a noticeable difference.

GeoIP table join with table of IP's in MySQL

I am having a issue finding a fast way of joining the tables looking like that:
mysql> explain geo_ip;
+--------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------+------+-----+---------+-------+
| ip_start | varchar(32) | NO | | "" | |
| ip_end | varchar(32) | NO | | "" | |
| ip_num_start | int(64) unsigned | NO | PRI | 0 | |
| ip_num_end | int(64) unsigned | NO | | 0 | |
| country_code | varchar(3) | NO | | "" | |
| country_name | varchar(64) | NO | | "" | |
| ip_poly | geometry | NO | MUL | NULL | |
+--------------+------------------+------+-----+---------+-------+
mysql> explain entity_ip;
+------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+-------+
| entity_id | int(64) unsigned | NO | PRI | NULL | |
| ip_1 | tinyint(3) unsigned | NO | | NULL | |
| ip_2 | tinyint(3) unsigned | NO | | NULL | |
| ip_3 | tinyint(3) unsigned | NO | | NULL | |
| ip_4 | tinyint(3) unsigned | NO | | NULL | |
| ip_num | int(64) unsigned | NO | | 0 | |
| ip_poly | geometry | NO | MUL | NULL | |
+------------+---------------------+------+-----+---------+-------+
Please note that I am not interested in finding the needed rows in geo_ip by only ONE IP address at once, I need a entity_ip LEFT JOIN geo_ip (or similar/analogue way).
This is what I have for now (using polygons as advised on http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/):
mysql> EXPLAIN SELECT li.*, gi.country_code FROM entity_ip AS li
-> LEFT JOIN geo_ip AS gi ON
-> MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`);
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | ip_poly_index | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity AS li LEFT JOIN geo_ip AS gi ON MBRCONTAINS(gi.`ip_poly`, li.`ip_poly`) limit 0, 20;
20 rows in set (2.22 sec)
No polygons
mysql> explain SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.`ip_num` >= gi.`ip_num_start` AND li.`ip_num` <= gi.`ip_num_end` LIMIT 0,20;
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
| 1 | SIMPLE | li | ALL | NULL | NULL | NULL | NULL | 2470 | |
| 1 | SIMPLE | gi | ALL | PRIMARY,geo_ip,geo_ip_end | NULL | NULL | NULL | 155183 | |
+----+-------------+-------+------+---------------------------+------+---------+------+--------+-------+
mysql> SELECT li.*, gi.country_code FROM entity_ip AS li LEFT JOIN geo_ip AS gi ON li.ip_num BETWEEN gi.ip_num_start AND gi.ip_num_end limit 0, 20;
20 rows in set (2.00 sec)
(On higher number of rows in the search - there is no difference)
Currently I cannot get any faster performance from these queries as 0.1 seconds per IP is way too slow for me.
Is there any way to make it faster?
This approach has some scalability issues (should you choose to move to, say, city-specific geoip data), but for the given size of data, it will provide considerable optimization.
The problem you are facing is effectively that MySQL does not optimize range-based queries very well. Ideally you want to do an exact ("=") look-up on an index rather than "greater than", so we'll need to build an index like that from the data you have available. This way MySQL will have much fewer rows to evaluate while looking for a match.
To do this, I suggest that you create a look-up table that indexes the geolocation table based on the first octet (=1 from 1.2.3.4) of the IP addresses. The idea is that for each look-up you have to do, you can ignore all geolocation IPs which do not begin with the same octet than the IP you are looking for.
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
KEY `first_octet` (`first_octet`,`ip_numeric_start`,`ip_numeric_end`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Next, we need to take the data available in your geolocation table and produce data that covers all (first) octets the geolocation row covers: If you have an entry with ip_start = '5.3.0.0' and ip_end = '8.16.0.0', the lookup table will need rows for octets 5, 6, 7, and 8. So...
ip_geolocation
|ip_start |ip_end |ip_numeric_start|ip_numeric_end|
|72.255.119.248 |74.3.127.255 |1224701944 |1241743359 |
Should convert to:
ip_geolocation_lookup
|first_octet|ip_numeric_start|ip_numeric_end|
|72 |1224701944 |1241743359 |
|73 |1224701944 |1241743359 |
|74 |1224701944 |1241743359 |
Since someone here requested for a native MySQL solution, here's a stored procedure that will generate that data for you:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DELETE FROM ip_geolocation_lookup;
WHILE i < 256 DO
INSERT INTO ip_geolocation_lookup (first_octet, ip_numeric_start, ip_numeric_end)
SELECT i, ip_numeric_start, ip_numeric_end FROM ip_geolocation WHERE
( ip_numeric_start & 0xFF000000 ) >> 24 <= i AND
( ip_numeric_end & 0xFF000000 ) >> 24 >= i;
SET i = i + 1;
END WHILE;
END;
And then you will need to populate the table by calling that stored procedure:
CALL recalculate_ip_geolocation_lookup();
At this point you may delete the procedure you just created -- it is no longer needed, unless you want to recalculate the look-up table.
After the look-up table is in place, all you have to do is integrate it into your queries and make sure you're querying by the first octet. Your query to the look-up table will satisfy two conditions:
Find all rows which match the first octet of your IP address
Of that subset: Find the row which has the the range that matches your IP address
Because the step two is carried out on a subset of data, it is considerably faster than doing the range tests on the entire data. This is the key to this optimization strategy.
There are various ways for figuring out what the first octet of an IP address is; I used ( r.ip_numeric & 0xFF000000 ) >> 24 since my source IPs are in numeric form:
SELECT
r.*,
g.country_code
FROM
ip_geolocation g,
ip_geolocation_lookup l,
ip_random r
WHERE
l.first_octet = ( r.ip_numeric & 0xFF000000 ) >> 24 AND
l.ip_numeric_start <= r.ip_numeric AND
l.ip_numeric_end >= r.ip_numeric AND
g.ip_numeric_start = l.ip_numeric_start;
Now, admittedly I did get a little lazy in the end: You could easily get rid of ip_geolocation table altogether if you made the ip_geolocation_lookup table also contain the country data. I'm guessing dropping one table from this query would make it a bit faster.
And, finally, here are the two other tables I used in this response for reference, since they differ from your tables. I'm certain they are compatible, though.
# This table contains the original geolocation data
CREATE TABLE `ip_geolocation` (
`ip_start` varchar(16) NOT NULL DEFAULT '',
`ip_end` varchar(16) NOT NULL DEFAULT '',
`ip_numeric_start` int(10) unsigned NOT NULL DEFAULT '0',
`ip_numeric_end` int(10) unsigned NOT NULL DEFAULT '0',
`country_code` varchar(3) NOT NULL DEFAULT '',
`country_name` varchar(64) NOT NULL DEFAULT '',
PRIMARY KEY (`ip_numeric_start`),
KEY `country_code` (`country_code`),
KEY `ip_start` (`ip_start`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
# This table simply holds random IP data that can be used for testing
CREATE TABLE `ip_random` (
`ip` varchar(16) NOT NULL DEFAULT '',
`ip_numeric` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Just wanted to give back to the community:
Here's an even better and optimized way building on Aleksi's solution:
DROP PROCEDURE IF EXISTS recalculate_ip_geolocation_lookup;
DELIMITER ;;
CREATE PROCEDURE recalculate_ip_geolocation_lookup()
BEGIN
DECLARE i INT DEFAULT 0;
DROP TABLE `ip_geolocation_lookup`;
CREATE TABLE `ip_geolocation_lookup` (
`first_octet` smallint(5) unsigned NOT NULL DEFAULT '0',
`startIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`endIpNum` int(10) unsigned NOT NULL DEFAULT '0',
`locId` int(11) NOT NULL,
PRIMARY KEY (`first_octet`,`startIpNum`,`endIpNum`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT startIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
INSERT IGNORE INTO ip_geolocation_lookup
SELECT endIpNum DIV 1048576 as first_octet, startIpNum, endIpNum, locId
FROM ip_geolocation;
WHILE i < 1048576 DO
INSERT IGNORE INTO ip_geolocation_lookup
SELECT i, startIpNum, endIpNum, locId
FROM ip_geolocation_lookup
WHERE first_octet = i-1
AND endIpNum DIV 1048576 > i;
SET i = i + 1;
END WHILE;
END;;
DELIMITER ;
CALL recalculate_ip_geolocation_lookup();
It builds way faster than his solution and drills down more easily because we're not just taking the first 8, but the first 20 bits. Join performance: 100000 rows in 158ms. You might have to rename the table and field names to your version.
Query by using
SELECT ip, kl.*
FROM random_ips ki
JOIN `ip_geolocation_lookup` kb ON (ki.`ip` DIV 1048576 = kb.`first_octet` AND ki.`ip` >= kb.`startIpNum` AND ki.`ip` <= kb.`endIpNum`)
JOIN ip_maxmind_locations kl ON kb.`locId` = kl.`locId`;
Can't comment yet, but user1281376's answers is wrong and doesn't work. the reason you only use the first octet is because you aren't going to match all ip ranges otherwise. there's plenty of ranges that span multiple second octets which user1281376s changed query isn't going to match. And yes, this actually happens if you use the Maxmind GeoIp data.
with aleksis suggestion you can do a simple comparison on the fîrst octet, thus reducing the matching set.
I found a easy way. I noticed that all first ip in the group % 256 = 0,
so we can add a ip_index table
CREATE TABLE `t_map_geo_range` (
`_ip` int(10) unsigned NOT NULL,
`_ipStart` int(10) unsigned NOT NULL,
PRIMARY KEY (`_ip`)
) ENGINE=MyISAM
How to fill the index table
FOR_EACH(Every row of ip_geo)
{
FOR(Every ip FROM ipGroupStart/256 to ipGroupEnd/256)
{
INSERT INTO ip_geo_index(ip, ipGroupStart);
}
}
How to use:
SELECT * FROM YOUR_TABLE AS A
LEFT JOIN ip_geo_index AS B ON B._ip = A._ip DIV 256
LEFT JOIN ip_geo AS C ON C.ipStart = B.ipStart;
More than 1000 times Faster.