GROUP BY query optimization - mysql

Dears,
I need your help to optimize the below query. I have two tables, one for storing books data and the second table for mapping the books to tags related to it. I want to count how many books form a certain publisher in every category. This query do the job but I need to optimize it:
select count(book.id),publisher from book,tag where book.id=tag.book_id AND publisher ='Addison-Wesley Professional' AND tag.name='PHP' group by category
the result of explain is
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tag ref PRIMARY PRIMARY 92 const 1 Using where; Using index; Using temporary; Using f...
1 SIMPLE book eq_ref PRIMARY PRIMARY 4 test.tag.book_id 1 Using where
the tables are:
--
-- Table structure for table `book`
--
CREATE TABLE `book` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(30) NOT NULL,
`ISBN` varchar(10) NOT NULL,
`category` varchar(30) NOT NULL,
`publisher` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=3 ;
--
-- Dumping data for table `book`
--
INSERT INTO `book` VALUES (1, 'PHP and MySQL Web Development', '9780672329', 'Web Development', 'Addison-Wesley Professional');
INSERT INTO `book` VALUES (2, 'JavaScript Patterns', '0596806752', 'Web Development', 'O''Reilly Media');
--
-- Table structure for table `tag`
--
CREATE TABLE `tag` (
`name` varchar(30) NOT NULL,
`book_id` int(11) NOT NULL,
PRIMARY KEY (`name`,`book_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
--
-- Dumping data for table `tag`
--
INSERT INTO `tag` VALUES ('MySQL', 1);
INSERT INTO `tag` VALUES ('PHP', 1);

You need a composite index on (publisher, category) in your book table, in that exact order, so the subquery doing grouping on category can restrict results fast to those having exact publisher and then use second part of the index to group on category.
ALTER TABLE book ADD INDEX publ_cat( publisher, category );

Related

Selecting from a table with two separate indexes is not using a key depending on value in where clause

I am tuning our database indices and discovered some strange behavior in Mysql 5.7.32. Here is a script to replicate the issue.
I have a table employees with three columns id, firstname and lastname. There are two indexes on the table for each of the varchar columns. For one of the SELECT statements below, the output is unexpectedly not using the key.
Why is one of those queries not using the index? Is it because Miller is the first value in the table? Or is this an inaccuracy of EXPLAIN?
DROP TABLE if EXISTS `employee`;
CREATE TABLE `employee` (
`id` INT(11) NOT NULL auto_increment,
`firstname` VARCHAR(50) NOT NULL,
`lastname` VARCHAR(50) NOT NULL,
PRIMARY KEY (`id`),
INDEX `index_firstname` (`firstname`),
INDEX `index_lastname` (`lastname`)
);
INSERT INTO `employee` (firstname,lastname) VALUES('alice','Miller');
INSERT INTO `employee` (firstname,lastname) VALUES('bob','Miller');
INSERT INTO `employee` (firstname,lastname) VALUES('charlie','Miller');
INSERT INTO `employee` (firstname,lastname) VALUES('doyle','Miller');
INSERT INTO `employee` (firstname,lastname) VALUES('evan','Smith');
INSERT INTO `employee` (firstname,lastname) VALUES('franz','Smith');
INSERT INTO `employee` (firstname,lastname) VALUES('gloria','Smith');
INSERT INTO `employee` (firstname,lastname) VALUES('helga','Unique');
EXPLAIN SELECT * FROM employee WHERE firstname='alice'; # uses the key 'index_firstname'
EXPLAIN SELECT * FROM employee WHERE lastname='Smith'; # uses the key 'index_lastname'
EXPLAIN SELECT * FROM employee WHERE lastname='Unique'; # uses the key 'index_lastname'
EXPLAIN SELECT * FROM employee WHERE lastname='Miller'; # does not use the key 'index_lastname'
Where a sampling of index values has over a ~25% ratio (not exact, see below) of sampling of the given value, the index isn't used.
There is a cost calculation that works out that scanning the full table is faster than using the secondary index (which needs to fetch from the primary table to retrieve *).

Improving query performance with Join, Full Table Scan

I am trying to improve the query performance on a stats reporting website for a Battlefield game, and am having a little bit of trouble with a very specific query. The issue I am having is that EXPLAIN is stating this query is doing a full table scan. This is troublesome because I expect this table to get very large (potentially 1 million rows or more). I am using MySQL 5.7 as my database of choice.
Here is my table and Query: http://pastebin.com/DsiGe2UB
--
-- Table structure for table `player_kit`
--
CREATE TABLE `player_kit` (
`id` TINYINT UNSIGNED NOT NULL,
`pid` INT UNSIGNED NOT NULL,
`time` INT UNSIGNED NOT NULL DEFAULT 0,
`kills` MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
`deaths` MEDIUMINT UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY(`pid`,`id`),
FOREIGN KEY(`pid`) REFERENCES player(`id`) ON DELETE CASCADE ON UPDATE CASCADE,
FOREIGN KEY(`id`) REFERENCES kit(`id`) ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `bf2stats`.`player_kit` ADD INDEX `reverse_ids` (`id`, `pid`);
--
-- My Full Scanning Query
-- SELECTS players, ordering them by kills and time in kit
--
SELECT p.name, p.rank, p.country, k.pid, k.kills, k.deaths, k.time
FROM player_kit AS k
INNER JOIN player AS p ON k.pid = p.id
WHERE k.id = 0 AND k.kills > 0
ORDER BY kills DESC, time DESC
LIMIT 0, 40
--
-- EXPLAIN results by MySQL
--
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE k NULL ref PRIMARY 1 const 75 32.11 Using index condition; Using where; Using filesort
1 SIMPLE p NULL eq_ref PRIMARY PRIMARY 4 bf2stats.k.pid 1 100.00 NULL
--
-- Additional Tables just in case, for reference
--
--
-- Table structure for table `kit`
--
CREATE TABLE `kit` (
`id` TINYINT UNSIGNED,
`name` VARCHAR(32) NOT NULL,
PRIMARY KEY(`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--
-- Table structure for table `player`
--
CREATE TABLE `player` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(32) UNIQUE NOT NULL,
`rank` TINYINT NOT NULL DEFAULT 0,
`country` CHAR(2) NOT NULL DEFAULT 'xx',
PRIMARY KEY(`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Here is the explain from phpMyAdmin:
I am hoping that one of you can help me improve the performance of this query, since any kind of index I have put on it does not seem to help much.
For this query:
SELECT p.name, p.rank, p.country, k.pid, k.kills, k.deaths, k.time
FROM player_kit k INNER JOIN
player p
ON k.pid = p.id
WHERE k.id = 0 AND k.kills > 0
ORDER BY kills DESC, time DESC
LIMIT 0, 40;
The optimal indexes are:
player_kit(id, kills, pid)
player(id) -- if this is not already there
You can also add the other columns in the index to get a covering index for the query.

EXPLAIN SELECT.., why TYPE = ALL?

Having these 3 tables:
users
CREATE TABLE `users` (
`user_id` MEDIUMINT(8) UNSIGNED NOT NULL AUTO_INCREMENT,
`first_name` VARCHAR(64) NOT NULL,
`last_name` VARCHAR(64) NOT NULL,
PRIMARY KEY (`user_id`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=1;
posts
CREATE TABLE `posts` (
`post_id` MEDIUMINT(8) UNSIGNED NOT NULL AUTO_INCREMENT,
`category_id` MEDIUMINT(8) UNSIGNED NOT NULL,
`author_id` MEDIUMINT(8) UNSIGNED NOT NULL,
`title` VARCHAR(128) NOT NULL,
`text` TEXT NOT NULL,
PRIMARY KEY (`post_id`),
INDEX `FK_posts__category_id` (`category_id`),
INDEX `FK_posts__author_id` (`author_id`),
CONSTRAINT `FK_posts__author_id` FOREIGN KEY (`author_id`) REFERENCES `users` (`user_id`) ON UPDATE CASCADE,
CONSTRAINT `FK_posts__category_id` FOREIGN KEY (`category_id`) REFERENCES `categories` (`category_id`) ON UPDATE CASCADE ON DELETE CASCADE
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=1;
categories
CREATE TABLE `categories` (
`category_id` MEDIUMINT(8) UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(64) NOT NULL,
PRIMARY KEY (`category_id`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=1;
And data in tables:
INSERT INTO `users` (`user_id`, `first_name`, `last_name`) VALUES
(1, 'John', 'Doe'),
(2, 'Pen', 'Poe'),
(3, 'Robert', 'Roe');
INSERT INTO `categories` (`category_id`, `name`) VALUES
(1, 'Category 1'),
(2, 'Category 2'),
(3, 'Category 3'),
(4, 'Category 4');
INSERT INTO `posts` (`post_id`, `category_id`, `author_id`, `title`, `text`) VALUES
(1, 1, 1, 'title 1', 'text 1'),
(2, 1, 2, 'title 2', 'text 2');
I want to make a simple select (and let MySQL EXPLAIN it):
EXPLAIN SELECT p.post_id, p.title, p.text, c.category_id, c.name, u.user_id, u.first_name, u.last_name
FROM posts AS p
JOIN categories AS c
ON c.category_id = p.category_id
JOIN users AS u
ON u.user_id = p.author_id
WHERE p.category_id = 1
I got this:
What I don't understand is, why has MySQL to do a full table scan at u (users). I mean there will be only two users it has to retrieve data about (with id 1 and 2), and these two can be found by primary key user_id. Can somebody with more experience help me to understand this? Is there a better way of creating indexes so MySQL don't has to make a full scan on the users table to retrieve data about the post authors?
Thanks you!
So with such a small amount a index search is going to be slower than a sequential search. Thus MySQL is choosing to use a simple table read.
It has to do with operational efficiency here. Lets simply the operations that MySQL has to do to read the entire table vs using a index.
Full read:
Open table
Read each line one at a time and match criteria
Return result set
That is 5 operations.
Index Read
Open table
For the criteria read the index for each row
Using the index pointer locate the row on disk for each row
Return resultset
In this case 8 operations.
This is very simplified but unless you have enough data your indexes can slow you down. As the table grows MySQL might choose a different query path. That is why you dont force the use of indexes.
You only have ~3 rows in your users table, according to your test data and your EXPLAIN report.
The optimizer can produce skewed results if you have too few rows in the tables. It may do a table-scan for a tiny table, even if it would use an index for the same query against the same tables with a few hundred or a few thousand rows.
So when doing development, it's important to have a non-trivial amount of test data in your tables if you want to get accurate optimizer reports.

mysql merge scripts to create table

I've came across an issue and I cant think of a way to solve it.
I need to insert country names in several languages into a table on my mysql db.
I found these links link1 (en) , link2 (de) etc but I dont know how to proceed in order to finally have a table looking like this:
CREATE TABLE `country` (
`id` varchar(2) NOT NULL,
`en` varchar(64) NOT NULL,
`de` varchar(64) NOT NULL,
...
...
PRIMARY KEY (`id`)
) ENGINE=MYISAM DEFAULT CHARSET=utf8;
Well, I finally figured it out so I'm posting to maybe help others.
I created 2 tables (country_en) and (country_de) and then ran the following statement:
DROP table if exists `countries`;
CREATE TABLE `countries` (
id varchar(2), el varchar(100), de varchar(100)
);
INSERT INTO `countries`
SELECT country_en.id, el, de
FROM country_en
JOIN country_de ON (country_en.id = country_de.id);
which creates the table countries and joins the other 2 tables on their common key id
I can suggest you another table design. Create languages table, and modify a little country table: add lang_id field and create foreign key - FOREIGN KEY (lang_id)
REFERENCES languages (id). Then populate languages and country tables.
For example:
CREATE TABLE languages(
id VARCHAR(2) NOT NULL,
name VARCHAR(64) NOT NULL,
PRIMARY KEY (id)
) ENGINE = INNODB;
CREATE TABLE country(
id VARCHAR(2) NOT NULL,
lang_id VARCHAR(2) NOT NULL DEFAULT '',
name VARCHAR(64) NOT NULL,
PRIMARY KEY (id, lang_id),
CONSTRAINT FK_country_languages_id FOREIGN KEY (lang_id)
REFERENCES languages (id) ON DELETE RESTRICT ON UPDATE RESTRICT
)
ENGINE = INNODB;
-- Populate languages
INSERT INTO languages VALUES
('en', 'English'),
('de', 'German');
-- Populate names from 'en' table
INSERT INTO country SELECT id, 'en', name FROM country_en;
-- Populate names from 'de' table
INSERT INTO country SELECT id, 'de', name FROM country_de;
...Where country_en and country_deare tables from your links.

MySQL using IN/FIND_IN_SET to read multiple rows in sub query

I have two tables, locations and location groups
CREATE TABLE locations (
location_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(63) UNIQUE NOT NULL
);
INSERT INTO locations (name)
VALUES
('london'),
('bristol'),
('exeter');
CREATE TABLE location_groups (
location_group_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
location_ids VARCHAR(255) NOT NULL,
user_ids VARCHAR(255) NOT NULL,
name VARCHAR(63) NOT NULL,
);
INSERT INTO location_groups (location_ids, user_ids, name)
VALUES
('1', '1,2,4', 'south east'),
('2,3', '2', 'south west');
What I am trying to do is return all location_ids for all of the location_groups where the given user_id exists. I'm using CSV to store the location_ids and user_ids in the location_groups table. I know this isn't normalised, but this is how the database is and it's out of my control.
My current query is:
SELECT location_id
FROM locations
WHERE FIND_IN_SET(location_id,
(SELECT location_ids
FROM location_groups
WHERE FIND_IN_SET(2,location_groups.user_ids)) )
Now this works fine if the user_id = 1 for example (as only 1 location_group row is returned), but if i search for user_id = 2, i get an error saying the sub query returns more than 1 row, which is expected as user 2 is in 2 location_groups. I understand why the error is being thrown, i'm trying to work out how to solve it.
To clarify when searching for user_id 1 in location_groups.user_ids the location_id 1 should be returned. When searching for user_id 2 the location_ids 1,2,3 should be returned.
I know this is a complicated query so if anything isn't clear just let me know. Any help would be appreciated! Thank you.
You could use GROUP_CONCAT to combine the location_ids in the subquery.
SELECT location_id
FROM locations
WHERE FIND_IN_SET(location_id,
(SELECT GROUP_CONCAT(location_ids)
FROM location_groups
WHERE FIND_IN_SET(2,location_groups.user_ids)) )
Alternatively, use the problems with writing the query as an example of why normalization is good. Heck, even if you do use this query, it will run more slowly than a query on properly normalized tables; you could use that to show why the tables should be restructured.
For reference (and for other readers), here's what a normalized schema would look like (some additional alterations to the base tables are included).
The compound fields in the location_groups table could simply be separated into additional rows to achieve 1NF, but this wouldn't be in 2NF, as the name column would be dependent on only the location part of the (location, user) candidate key. (Another way of thinking of this is the name is an attribute of the regions, not the relations between regions/groups, locations and users.)
Instead, these columns will be split off into two additional tables for 1NF: one to connect locations and regions, and one to connect users and regions. It may be that the latter should be a relation between users and locations (rather than regions), but that's not the case with the current schema (which could be another problem of the current, non-normalized schema). The region-location relation is one-to-many (since each location is in one region). From the sample data, we see the region-user relation is many-many. The location_groups table then becomes the region table.
-- normalized from `location_groups`
CREATE TABLE regions (
`id` INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
`name` VARCHAR(63) UNIQUE NOT NULL
);
-- slightly altered from original
CREATE TABLE locations (
`id` INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
`name` VARCHAR(63) UNIQUE NOT NULL
);
-- missing from original sample
CREATE TABLE users (
`id` INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
`name` VARCHAR(63) UNIQUE NOT NULL
);
-- normalized from `location_groups`
CREATE TABLE location_regions (
`region` INT UNSIGNED,
`location` INT UNSIGNED UNIQUE NOT NULL,
PRIMARY KEY (`region`, `location`),
FOREIGN KEY (`region`)
REFERENCES regions (id)
ON DELETE restrict ON UPDATE cascade,
FOREIGN KEY (`location`)
REFERENCES locations (id)
ON DELETE cascade ON UPDATE cascade
);
-- normalized from `location_groups`
CREATE TABLE user_regions (
`region` INT UNSIGNED NOT NULL,
`user` INT UNSIGNED NOT NULL,
PRIMARY KEY (`region`, `user`),
FOREIGN KEY (`region`)
REFERENCES regions (id)
ON DELETE restrict ON UPDATE cascade,
FOREIGN KEY (`user`)
REFERENCES users (id)
ON DELETE cascade ON UPDATE cascade
);
Sample data:
INSERT INTO regions
VALUES
('South East'),
('South West'),
('North East'),
('North West');
INSERT INTO locations (`name`)
VALUES
('London'),
('Bristol'),
('Exeter'),
('Hull');
INSERT INTO users (`name`)
VALUES
('Alice'),
('Bob'),
('Carol'),
('Dave'),
('Eve');
------ Location-Region relation ------
-- temporary table used to map natural keys to surrogate keys
CREATE TEMPORARY TABLE loc_rgns (
`location` VARCHAR(63) UNIQUE NOT NULL
`region` VARCHAR(63) NOT NULL,
);
-- Hull added to demonstrate correctness of desired query
INSERT INTO loc_rgns (region, location)
VALUES
('South East', 'London'),
('South West', 'Bristol'),
('South West', 'Exeter'),
('North East', 'Hull');
-- map natural keys to surrogate keys for final relationship
INSERT INTO location_regions (`location`, `region`)
SELECT loc.id, rgn.id
FROM locations AS loc
JOIN loc_rgns AS lr ON loc.name = lr.location
JOIN regions AS rgn ON rgn.name = lr.region;
------ User-Region relation ------
-- temporary table used to map natural keys to surrogate keys
CREATE TEMPORARY TABLE usr_rgns (
`user` INT UNSIGNED NOT NULL,
`region` VARCHAR(63) NOT NULL,
UNIQUE (`user`, `region`)
);
-- user 3 added in order to demonstrate correctness of desired query
INSERT INTO usr_rgns (`user`, `region`)
VALUES
(1, 'South East'),
(2, 'South East'),
(2, 'South West'),
(3, 'North West'),
(4, 'South East');
-- map natural keys to surrogate keys for final relationship
INSERT INTO user_regions (`user`, `region`)
SELECT user, rgn.id
FROM usr_rgns AS ur
JOIN regions AS rgn ON rgn.name = ur.region;
Now, the desired query for the normalized schema:
SELECT DISTINCT loc.id
FROM locations AS loc
JOIN location_regions AS lr ON loc.id = lr.location
JOIN user_regions AS ur ON lr.region = ur.region
;
Result:
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
+----+