mysql select query optimization [closed] - mysql

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have two table testa & testb.
CREATE TABLE `testa` (
`id` INT(10) NOT NULL AUTO_INCREMENT,
`name` VARCHAR(50) DEFAULT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE `testb` (
`id` INT(10) NOT NULL AUTO_INCREMENT,
`name` VARCHAR(50) DEFAULT NULL,
`aid1` INT(10) DEFAULT NULL,
`aid2` INT(10) DEFAULT NULL,
`aid3` INT(10) DEFAULT NULL,
PRIMARY KEY (`id`)
);
Currently I am running below query for retrieving all rows where id in testa table matches with any columns of aid1,aid2,aid3 in tableb. The query is retreiving acurate result but it is taking minimum 30 seconds to execute which is too much. I have also tried to optimise my query using UNION but failed to do so.
SELECT a.id, a.name, b.name, b.id
FROM testb b
INNER JOIN testa a ON b.aid1 = a.id OR b.aid2 = a.id OR b.aid3 = a.id ;
How do i optimize my query so it's total execution time is within 2-3 seconds?
Thanks in advance...
Result of EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE b ALL idx_aid1,idx_aid2,idx_aid3 (NULL) (NULL) (NULL) 10940
1 SIMPLE a ALL PRIMARY (NULL) (NULL) (NULL) 7512 Using where; Using join buffer

Because you permit for aid1, aid2, aid3 to be NULL (and apparently, they are mostly NULL per your explanation), your join condition is effectively not indexable.
Why? SQL expression b.aid1 = a.id OR b.aid2 = a.id OR b.aid3 = a.id
evaluates to NULL if any of aid1, aid2 or aid3 is NULL, and this is why MySQL planner does not show using an index.
Solution: do not use NULLs for aid1, aid2, aid3. Instead, invent special id (say 0) which is guaranteed to not exist in testa.
Then, make sure that testb.aid[123] are NOT NULL (and assign it to 0 where it was NULL before).
EDIT: Adding alternative approach to this problem.
You can also solve this problem if you can afford to change your schema by adding one more table. This new table will contain list of aid's you currently store in table testb, and testb will contain just one id linking to new table. This should be similar to what is explained in this answer. Additional advantage to this is that you can permit arbitrary number of aid's (not just 3 as you have now).

In addition to the indexing that others have suggested, make sure you ANALYZE your tables so that the statistics on the tables are up-to-date. If the statistics are wildly different from what's actually in the table, then the query planner will make bad choices.

you should index on the following columns to avoid fulltable scan
`aid1` INT(10) DEFAULT NULL,
`aid2` INT(10) DEFAULT NULL,
`aid3` INT(10) DEFAULT NULL,
if you want to alter the tables
ALTER TABLE testb ADD INDEX (aid1);
ALTER TABLE testb ADD INDEX (aid2);
ALTER TABLE testb ADD INDEX (aid3);

Have You tried joining with IN instead of OR?
SELECT a.id, a.name, b.name, b.id
FROM testb b
INNER JOIN testa a ON a.id IN (b.aid1, b.aid2, b.aid3) ;

Related

2 Days ago mysql query was fine, today it's super slow

Okay, so as the title says, the queries were running fine 2 days ago, then all of a sudden yesterday the site was loading very slow. I tracked it down to a couple queries. One I was able to add an index which seems to have helped, but this one I just can't figure out. I tried running a repair and optimize on the tables, and that didn't help. I don't know what could have changed so much that would make it go from less than a second to query to 20+ seconds. Any help would be much appreciated.
SELECT city
FROM listings LEFT JOIN agencies
ON listings.agencyid_fk = agencies.agencyid
WHERE listingstatus IN (1,3) AND appid_fk = 101 AND active = 1
AND auction IS NULL AND agencies.multilist = 1
AND notagency IS NULL
GROUP BY city
ORDER BY city;
I wasn't sure how to export the explain query result to make it readable on here, so I just put it in a code snippet. click run to see it in an html table.
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE listings ref appid_fk,listingstatus appid_fk 2 const 21699 Using where; Using temporary; Using filesort
1 SIMPLE agencies eq_ref PRIMARY,active PRIMARY 2 mls2.listings.agencyid_fk 1 Using where
And here are the tables...
listings table:
CREATE TABLE mls2.listings (
listingid INT(11) AUTO_INCREMENT NOT NULL,
appid_fk SMALLINT(3) NOT NULL DEFAULT '0',
agencyid_fk SMALLINT(3) NOT NULL DEFAULT '0',
listingstatus SMALLINT(3),
city VARCHAR(30) CHARACTER SET latin1 COLLATE latin1_swedish_ci,
multilist TINYINT(1),
auction TINYINT(1),
PRIMARY KEY (listingid)
) ENGINE = myisam ROW_FORMAT = DEFAULT CHARACTER SET latin1;
agencies table:
CREATE TABLE mls2.agenciesx (
agencyid SMALLINT(6) AUTO_INCREMENT NOT NULL,
multilist TINYINT(4) DEFAULT '0',
notagency TINYINT(1),
active TINYINT(1),
PRIMARY KEY (agencyid)
) ENGINE = myisam ROW_FORMAT = DEFAULT CHARACTER SET latin1;
Once you've taken on board the comments above, try adding the following indexes to your tables...
INDEX(city,listingstatus,appid_fk,auction)
INDEX(active,multilist,notagency)
In both cases, the order in which columns are arranged in the index may make a difference, so play around with those, although there are so few rows in the agencies table, that that one won't really matter.
Next, get rid of the GROUP BY clause, and write your query as follows.
SELECT DISTINCT l.city
FROM listings l
JOIN agencies a
ON a.agencyid = l.agencyid_fk
WHERE l.listingstatus IN (1,3)
AND l.appid_fk = 101
AND a.active = 1
AND l.auction IS NULL
AND a.multilist = 1
AND a.notagency IS NULL
ORDER
BY city;
Note: Although irrelevant for this particular problem, your original question showed that this schema is desperately in need of normalisation.

SQL query evaluates COUNT(*) differently if tables are defined as MyISAM or InnoDB

I am running a MySQL database.
I have the following script:
DROP TABLE IF EXISTS `org_apiinteg_assets`;
DROP TABLE IF EXISTS `assessmentinstances`;
CREATE TABLE `org_apiinteg_assets` (
`id` varchar(20) NOT NULL default '0',
`instance_id` varchar(20) default NULL,
PRIMARY KEY (`id`)
) ENGINE= MyISAM DEFAULT CHARSET=utf8 PACK_KEYS=1;
CREATE TABLE `assessmentinstances` (
`id` varchar(20) NOT NULL default '0',
`title` varchar(180) default NULL,
PRIMARY KEY (`id`)
) ENGINE= MyISAM DEFAULT CHARSET=utf8 PACK_KEYS=1;
INSERT INTO assessmentinstances(id, title) VALUES ('14026lvplotw6','One radio question survey');
INSERT INTO org_apiinteg_assets(id, instance_id) VALUES ('8kp9wgx43jflrgjfe','14026lvplotw6');
Looks like this
assessmentinstances
+---------------+---------------------------+
| id | title |
+---------------+---------------------------+
| 14026lvplotw6 | One radio question survey |
+---------------+---------------------------+
org_apiinteg_assets
+-------------------+---------------+
| id | instance_id |
+-------------------+---------------+
| 8kp9wgx43jflrgjfe | 14026lvplotw6 |
+-------------------+---------------+
And I then have the following query (I reduced it to the simplest failing query)
SELECT ai.id, COUNT(*) AS `count`
FROM assessmentinstances ai, org_apiinteg_assets a
WHERE a.instance_id = ai.id
AND ai.id = '14026lvplotw6'
AND a.id != '8kp9wgx43jflrgjfe';
When I run the query I get this
null, 0
Until now, all is good. Now, here is my issue, when I recreate both tables with ENGINE=InnoDB instead of ENGINE=MyISAM and run the same query again, I get this:
'14026lvplotw6','0'
So 2 things are confusing me:
Why don't I get the same result?
How can the COUNT(*) return 0 in the second case when it actually returns values for the row, and should therefore be 1?
I am lost, I'd appreciate if anybody could explain this behaviour to me.
EDIT:
Interestingly, if I add GROUP BY ai.id at the end of the query, it works fine in both cases and return no rows.
This happen because you are using aggregation function without GROUP BY .. in this case the result for non aggregated column is unpredictable .. (typically is show the first value encountered during the query)
Try adding a GROUP BY
SELECT ai.id, COUNT(*) AS `count`
FROM assessmentinstances ai, org_apiinteg_assets a
WHERE a.instance_id = ai.id
AND a.id != '8kp9wgx43jflrgjfe'
AND ai.id = '14026lvplotw6'
GROUP BY ai.id;
Remember that the use of aggregation in presence of column not mentioned in group by is deprecated in SQL and is not allowed in most of the db and in the more recent version of mysql (starting from 5.7)
EXPLAIN SELECT for MyISAM returns: Impossible WHERE noticed after reading const tables. So MyISAM isn't processing any data at all.
For the InnoDB there are two rows of EXPLAIN results: one Using Index and one Using where. So InnoDB data is being scanned and bits of it slip into the output as there is no aggregate function specified for the first column and AFAIK its not specified what should happen in such situation. If you directly specify some aggregate function, then if there are no matching rows, it will return NULL. So, for example, SELECT min(ai.id), COUNT(*) ... would return NULL, 0.

Use index for ORDER BY in "SELECT .. FROM .. WHERE column IN (...) ORDER BY"

Is there any way to make the following query use an index and not use filesort:
SELECT c1 FROM table WHERE c2 IN (val_1, val_2, ..., val_n) ORDER BY c3
I guess chances are bad so if it is not possible is there any way to make the following problem use indexes (or be fast):
The table contains comments from users:
CREATE TABLE `comments` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL,
`comment` varchar(180) CHARACTER SET utf8 NOT NULL,
`timestamp` int(11) unsigned NOT NULL)
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I want to output the comments of specific users (for example the ones who user_x is following) ordered by timestamp (compare query above).
The only way I can imagine making this query fast is to insert a new variable that is set to 1 for the last let's say 15 entries of a single user. So the first query would just get a maximum of 15 rows per user so the maximum amount of rows mysql has to order is 15*n, where n is the amount of users the comments are selected from.
Edit: This is what EXPLAIN outputs:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE comments range idx_comments_user_id_timestamp idx_comments_user_id_timestamp 4 NULL 1113 Using where; Using index; Using filesort

Real-time aggregation on a table with millions of records

I'm dealing with an ever growing table which contains about 5 million records at the moment. About a 100000 new records are added daily.
The table contains information about ad campaigns, and is joined on query with another table:
CREATE TABLE `statistics` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ip_range_id` int(11) DEFAULT NULL,
`campaign_id` int(11) DEFAULT NULL,
`payout` decimal(5,2) DEFAULT NULL,
`is_converted` tinyint(1) unsigned NOT NULL DEFAULT '0',
`converted` datetime DEFAULT NULL,
`created` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `created` (`created`),
KEY `converted` (`converted`),
KEY `campaign_id` (`campaign_id`),
KEY `ip_range_id` (`ip_range_id`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The other table contains IP ranges:
CREATE TABLE `ip_ranges` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ip_range` varchar(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `ip_range` (`ip_range`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The aggregation query is as follows:
SELECT
SUM(`payout`) AS `revenue`,
(SELECT COUNT(*) FROM `statistics` WHERE `ip_range_id` = `IpRange`.`id`) AS `clicks`,
(SELECT COUNT(*) FROM `statistics` WHERE `ip_range_id` = `IpRange`.`id` AND `is_converted` = 1) AS `conversions`
FROM `ip_ranges` AS `IpRange`
INNER JOIN `statistics` AS `Statistic` ON `IpRange`.`id` = `Statistic`.`ip_range_id`
GROUP BY `IpRange`.`id`
ORDER BY `clicks` DESC
LIMIT 20
The query takes about 20 seconds to complete.
This is what EXPLAIN returns:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY ip_range index PRIMARY PRIMARY 4 NULL 306552 Using index; Using temporary; Using filesort
1 PRIMARY statistic ref ip_range_id ip_range_id 5 db.ip_range.id 8 Using where
3 DEPENDENT SUBQUERY statistics ref ip_range_id ip_range_id 5 func 8 Using where
2 DEPENDENT SUBQUERY statistics ref ip_range_id ip_range_id 5 func 8 Using where; Using index
Caching the clicks and conversions in the ip_ranges table as extra columns is not an option, because I need to be able to also filter on the campaign_id column (and possibly other columns in the future). So these aggregations need to be somewhat real-time.
What is the best strategy to do aggregation on large tables on multiple dimensions and near real-time?
Note that I'm not necessarily looking to just make the query better, but I'm also interested in strategies that might involve other database systems (NoSQL) and/or distributing the data over different servers, etc
Your query looks overly complicated. There is no need to query the same table again and again:
select
sum(payout) as revenue,
count(*) as clicks,
sum(s.is_converted = 1) as conversions
from ip_ranges r
inner join statistics s on r.id = s.ip_range_id
group by r.id
order by clicks desc
limit 20;
EDIT (after acceptance): As to your actual question on how to deal with a task like this:
You want to look at all the data in your table and you want your result to be up-to-date. Then there is no other option than to read all data (full table scans). If the tables are wide (i.e. have many columns) you may want to create covering indexes (i.e. indexes that contain all columns involved), so instead of reading the table, the index would be read. Well, what else? On full table scans it is recommendable to use parallel access, which MySQL doesn't provide, as far as I know. So you might want to switch to another DBMS. Then see what else the DBMS offers. Maybe the parallel querying would benefit from partitioning the tables. The last thing that comes to mind is hardware, i.e. more CPUs, faster drives etc.
Another option might be to remove old data from your tables. Say you need the details of the current year, but only the aggregated data for previous years. Then have another table old_statistics holding only the sums and counts needed, e.g.
table old_statistics
(
ip_range_id,
revenue,
conversions
);
Then you'd aggregate the data from statistics, which would be much smaller then, because it would hold only data of the current year, and add old_statistics to get the results.
Try this
SELECT
SUM(`payout`) AS `revenue`,
SUM(case when `ip_range_id` = `IpRange`.`id` then 1 else 0 end) AS `clicks`,
SUM(case when `ip_range_id` = `IpRange`.`id` and `is_converted` = 1 then 1 else 0 end)
AS `conversions`
FROM `ip_ranges` AS `IpRange`
INNER JOIN `statistics` AS `Statistic` ON `IpRange`.`id` = `Statistic`.`ip_range_id`
GROUP BY `IpRange`.`id`
ORDER BY `clicks` DESC
LIMIT 20

MySQL Query Optimization

I have web application that use a similar table scheme like below. simply I want to optimize the selection of articles. articles are selected based on the tag given. for example, if the tag is 'iphone' , the query should output all open articles about 'iphone' from the last month.
CREATE TABLE `article` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(100) NOT NULL,
`body` varchar(200) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
`author_id` int(11) NOT NULL,
`section` varchar(30) NOT NULL,
`status` int(1) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
CREATE TABLE `tags` (
`name` varchar(30) NOT NULL,
`article_id` int(11) NOT NULL,
PRIMARY KEY (`name`,`article_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
CREATE TABLE `users` (
`id` int(11) NOT NULL auto_increment,
`username` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=3 ;
The following is my MySQL query
explain select article.id,users.username,article.title
from article,users,tags
where article.id=tags.article_id and tags.name = 'iphone4'
and article.author_id=users.id and article.status = '1'
and article.section = 'mobile'
and article.date > '2010-02-07 13:25:46'
ORDER BY tags.article_id DESC
the output is
id select_type table type possible_keys key key_len ref rows Extra <br>
1 SIMPLE tags ref PRIMARY PRIMARY 92 const 55 Using where; Using index <br>
1 SIMPLE article eq_ref PRIMARY PRIMARY 4 test.tags.article_id 1 Using where <br>
1 SIMPLE users eq_ref PRIMARY PRIMARY 4 test.article.author_id 1 <br>
is it possible to optimize it more?
This query may be optimized, depending on which condition is more selective: tags.name = 'iphone4' or article.date > '2010-02-07 13:25:46'
If there are less articles tagged iphone than those posted after Feb 7, then your original query is nice.
If there are many articles tagged iphone, but few those posted after Feb 7, then this query will be more efficient:
SELECT article.id, users.username, article.title
FROM tags
JOIN article
ON article.id = tags.article_id
AND article.status = '1'
AND article.section = 'mobile'
AND article.date > '2010-02-07 13:25:46'
JOIN users
ON users.id = article.author_id
WHERE tags.name = 'iphone4'
ORDER BY
tags.article_date DESC, tags.article_id DESC
Note that the ORDER BY condition has changed. This may or may not be what you want, however, generally the orders of id and date correspond to each other.
If you really need your original ORDER BY condition you may leave it but it will add a filesort (or just revert to your original plan).
In either case, create an index on
article (status, section, date, id)
the query should output all open articles about 'iphone' from the last month.
So the only query you are going to run on this data uses the tag and the date. You've got a index for the tag in the tags table, but the date is stored in a different table (article - you're a bit inconsistent with your naming schema). Adding an index on the article table using date would be no benefit at all. Using id,date (in that order) would help a little - but really the date needs to be denormalised into the tags table to get the query running really fast.
Unless you're regularly moving around bulk data sets - just add a datetime column with a default of the current timestamp to the tags table.
I expect that you may be wanting to interact with the data in lots of other ways - really you should set a low (no?) threshold for slow query logging then analyse the resulting data to identify where you're performance problems are (try looking at the queries with the highest values for duration^2*frequency first).
There's a script at the URL below which is useful for this analysis:
http://www.retards.org/projects/mysql/
You could index the additional fields in article that you are referencing in your select statement. In this case, I would suggest you create an index in article like this:
CREATE INDEX article_idx ON article (author_id, status, section, date);
Creating that index should speed up your query depending on how many overall records you are dealing with. From my understanding, properly creating indexes involves looking at the queries you've written and indexing the columns that are a part of your where clause. This helps the query optimizer better process the query in general. That does not mean create an index on each individual column, however, as its both inefficient to do so and ineffective. When possible, create multiple column indexes that represent your select statement.