Why would this query run so slow? - mysql

I have two MySQL tables say A and B. A contains just one varchar column (lets call that one A1) with about 23000 records. Table B (70000 records) has some more columns, one of the corresponding with A1 from table A (lets call that one B1). I want to know which values in A are not in the corresponding column in B, so I use:
SELECT A1
FROM A
LEFT JOIN B
ON A1 = B1
WHERE B1 IS NULL
Both columns A1 and B1 have indices defined on them. Still this query runs very slow. I've run explain, this is the output:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE A index \N PRIMARY 767 \N 23269 Using index
1 SIMPLE B ALL \N \N \N \N 70041 Using where; Not exists
UPDATE: SHOW CREATE TABLE for both tables (changed the original names);
CREATE TABLE `A` (
`A1` varchar(255) NOT NULL,
PRIMARY KEY (`A1`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
CREATE TABLE `B` (
`col1` int(10) unsigned NOT NULL auto_increment,
`col2` datetime NOT NULL,
`col3` datetime default NULL,
`col4` datetime NOT NULL,
`col5` varchar(30) NOT NULL,
`col6` int(10) default NULL,
`col7` int(11) default NULL,
`col8` varchar(20) NOT NULL,
`B1` varchar(255) default NULL,
`col10` tinyint(1) NOT NULL,
`col11` varchar(255) default NULL,
PRIMARY KEY (`col1`),
KEY `NewIndex1` (`B1`)
) ENGINE=MyISAM AUTO_INCREMENT=70764 DEFAULT CHARSET=latin1
'nother edit: data_length and index_length from SHOW TABLE STATUS
table data_length index_length
A 465380 435200
B 5177996 1344512

The character sets of the two columns that you are comparing in an OUTER JOIN differ. I am not sure if this is the cause so I tested and got these results:
SELECT A1
FROM A
LEFT JOIN B ON A1 = B1
WHERE B1 IS NULL
-- Table A..: 23258 rows, collation = utf8_general_ci
-- Table B..: 70041 rows, collation = latin1_swedish_ci
-- Time ....: I CANCELLED THE QUERY AFTER 20 MINUTES
-- Table A..: 23258 rows, collation = latin1_swedish_ci
-- Table B..: 70041 rows, collation = latin1_swedish_ci
-- Time ....: 0.187 sec
-- Table A..: 23258 rows, collation = utf8_general_ci
-- Table B..: 70041 rows, collation = utf8_general_ci
-- Time ....: 0.344 sec
Solution: make the character sets of the two tables (or the two columns atleast) same.

This query will scan all rows of table A, but if you have an index on B1 then most likely it will not scan table B:
select A1
from A
where not exists (
select *
from B
where B.B1 = A.A1
)
Before running this or your original query you may try to run ANALYZE TABLE in order to update key distribution information for those tables:
ANALYZE TABLE A, B
If this doesn't help then you can try to play with indexes, for instance:
select A1
from A ignore index (PRIMARY)
where not exists (
select *
from B force index (NewIndex1)
where B.B1 = A.A1
)

It seems A1 and B1 are large feilds.
You created indices for both A1 and B1
Make sure that they are indexed!
SELECT A1
FROM A
WHERE A1 NOT IN (
SELECT B1 AS A1 From B;
)

If I use your CREATE TABLES statements and run an EXPLAIN on your SELECT statement, I get this result:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE A index NULL PRIMARY 767 NULL 2 Using index
1 SIMPLE B index NULL NewIndex1 258 NULL 4 Using where; Using index
On my MySQL version (5.1.41) the index is used as expected, so I think this might be an already fixed bug in MySQL assuming your index is set like in your create table statement posted. What MySQL version do you use?

try this query:
SELECT B1
FROM B
WHERE not B1 in (
select A1
from a
)

Related

mysql query optimization: select with counted subquery extremely slow

I have the following tables:
mysql> show create table rsspodcastitems \G
*************************** 1. row ***************************
Table: rsspodcastitems
Create Table: CREATE TABLE `rsspodcastitems` (
`id` char(20) NOT NULL,
`description` mediumtext,
`duration` int(11) default NULL,
`enclosure` mediumtext NOT NULL,
`guid` varchar(300) NOT NULL,
`indexed` datetime NOT NULL,
`published` datetime default NULL,
`subtitle` varchar(255) default NULL,
`summary` mediumtext,
`title` varchar(255) NOT NULL,
`podcast_id` char(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `podcast_id` (`podcast_id`,`guid`),
UNIQUE KEY `UKfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `IDXkcqf7wi47t3epqxlh34538k7c` (`indexed`),
KEY `IDXt2ofice5w51uun6w80g8ou7hc` (`podcast_id`,`published`),
KEY `IDXfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `published` (`published`),
FULLTEXT KEY `title` (`title`),
FULLTEXT KEY `summary` (`summary`),
FULLTEXT KEY `subtitle` (`subtitle`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> show create table station_cache \G
*************************** 1. row ***************************
Table: station_cache
Create Table: CREATE TABLE `station_cache` (
`Station_id` char(36) NOT NULL,
`item_id` char(20) NOT NULL,
`item_type` int(11) NOT NULL,
`podcast_id` char(20) NOT NULL,
`published` datetime NOT NULL,
KEY `Station_id` (`Station_id`,`published`),
KEY `IDX12n81jv8irarbtp8h2hl6k4q3` (`Station_id`,`published`),
KEY `item_id` (`item_id`,`item_type`),
KEY `IDXqw9yqpavo9fcduereqqij4c80` (`item_id`,`item_type`),
KEY `podcast_id` (`podcast_id`,`published`),
KEY `IDXkp2ehbpmu41u1vhwt7qdl2fuf` (`podcast_id`,`published`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
The "item_id" column of the second refers to the "id" column of the former (there isn't a foreign key between the two because the relationship is polymorphic, i.e. the second table may have references to entities that aren't in the first but in other tables that are similar but distinct).
I'm trying to get a query that lists the most recent items in the first table that do not have any corresponding items in the second. The highest performing query I've found so far is:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from rsspodcastitems i
having stations = 0
order by published desc
I've also considered using a where not exists (...) subquery to perform the restriction, but this was actually slower than the one I have above. But this is still taking a substantial length of time to complete. MySQL's query plan doesn't seem to be using the available indices:
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| 1 | PRIMARY | i | ALL | NULL | NULL | NULL | NULL | 106978 | Using filesort |
| 2 | DEPENDENT SUBQUERY | station_cache | ALL | NULL | NULL | NULL | NULL | 44227 | Using where |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
Note that neither portion of the query is using a key, whereas it ought to be able to use KEY published (published) from the primary table and KEY item_id (item_id,item_type) for the subquery.
Any suggestions how I can get an appropriate result without waiting for several minutes?
I would expect the fastest query to be:
select i.*
from rsspodcastitems i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
)
order by published desc;
This would take advantage of an index on station_cache(item_id) and perhaps rsspodcastitems(published, id).
Your query could be faster, if your query returns a significant number of rows. Your phrasing of the query allows the index on rsspodcastitems(published) to avoid the file sort. If you remove the group by, the exists version should be faster.
I should note that I like your use of the having clause. When faced with this in the past, I have used a subquery:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from (select i.*
from rsspodcastitems i
order by published desc
) i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
);
This allows one index for sorting.
I prefer a slight variation on your method:
select i.*,
(exists (select 1
from station_cache sc
where sc.item_id = i.id
)
) as has_station
from rsspodcastitems i
having has_station = 0
order by published desc;
This should be slightly faster than the version with count().
You might want to detect and remove redundant indexes from your tables. Reviewing your CREATE TABLE information for both tables with help you discover several, including podcast_id,guid and Station_id,published, item_id,item_type and podcast_id,published there may be more.
My eventual solution was to delete the full text indices and use an externally generated index table (produced by iterating over the words in the text, filtering stop words, and applying a stemming algorithm) to allow searching. I don't know why the full text indices were causing performance problems, but they seemed to slow down every query that touched the table even if they weren't used.

Use index for ORDER BY in "SELECT .. FROM .. WHERE column IN (...) ORDER BY"

Is there any way to make the following query use an index and not use filesort:
SELECT c1 FROM table WHERE c2 IN (val_1, val_2, ..., val_n) ORDER BY c3
I guess chances are bad so if it is not possible is there any way to make the following problem use indexes (or be fast):
The table contains comments from users:
CREATE TABLE `comments` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL,
`comment` varchar(180) CHARACTER SET utf8 NOT NULL,
`timestamp` int(11) unsigned NOT NULL)
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I want to output the comments of specific users (for example the ones who user_x is following) ordered by timestamp (compare query above).
The only way I can imagine making this query fast is to insert a new variable that is set to 1 for the last let's say 15 entries of a single user. So the first query would just get a maximum of 15 rows per user so the maximum amount of rows mysql has to order is 15*n, where n is the amount of users the comments are selected from.
Edit: This is what EXPLAIN outputs:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE comments range idx_comments_user_id_timestamp idx_comments_user_id_timestamp 4 NULL 1113 Using where; Using index; Using filesort

Validate fields from one table to another in MySQL

The problem:
I have 1 table of aprox 5000 rows called imported_cities
I have 1 table of aprox 800 000 rows called postal_codes containing postal codes cities
I need to validate each distinct city from imported_cities against the cities in postal codes table based on city name and its province. See tables structure below.
If they match exactly (yes, exactly. The rest of cities are manually validated) I have to update a column on imported_city and
enter both city from imported_cities and city from postal_codes (side by side) into a third table called imported_cities_equiv
What I have tried:
Adding indexes to tables and make query below. It takes forever... :(
explain SELECT DISTINCT ic.destinationCity, pc.city FROM (imported_cities ic, postalcodes pc)
WHERE LOWER(ic.destinationCity) = LOWER(pc.city)
the result
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE ip index NULL company_city 478 NULL 4221 Using index; Using temporary
1 SIMPLE pc index NULL city_prov 160 NULL 765407 Using where; Using index; Using join buffer (Block...
--
-- Table structure for table postalcodes
CREATE TABLE IF NOT EXISTS `postalcodes` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`code` varchar(11) NOT NULL,
`city` varchar(50) NOT NULL,
`province` varchar(50) NOT NULL,
`provinceISO` varchar(2) NOT NULL,
`latitude` decimal(17,13) NOT NULL,
`longitude` decimal(17,13) NOT NULL,
PRIMARY KEY (`id`),
KEY `code` (`code`),
KEY `city_prov` (`city`,`provinceISO`)
--
-- Table structure for table imported_cities
CREATE TABLE IF NOT EXISTS `imported_cities` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`companyName` varchar(30) CHARACTER SET utf8 NOT NULL,
`destinationCity` varchar(128) CHARACTER SET utf8 NOT NULL,
`destinationProvince` varchar(20) CHARACTER SET utf8 NOT NULL,
`equivCity` varchar(128) CHARACTER SET utf8 DEFAULT NULL,
`minAmount` decimal(6,2) NOT NULL
PRIMARY KEY (`id`),
KEY `company_city` (`companyName`,`destinationCity`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=7933 ;
--
-- Table structure for table imported_cities_equiv
CREATE TABLE IF NOT EXISTS `imported_cities_equiv` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`imported_city` varchar(128) CHARACTER SET utf8 NOT NULL,
`pc_city` varchar(128) CHARACTER SET utf8 NOT NULL,
`province` varchar(20) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=149 ;
Any help or suggestion is appreciated. Thank you.
The query you want to get your information is:
SELECT ip.*, (pc.city is not null) as exact match
FROM imported_prices ip left join
postalcodes pc
on LOWER(ip.destinationCity) = LOWER(pc.city) and
lower(ip.province) = lower(pc.province);
However, this will have really bad performance. Getting rid of the lower() would help:
SELECT ip.*, (pc.city is not null) as exact match
FROM imported_prices ip left join
postalcodes pc
on(ip.destinationCity) =(pc.city) and
(ip.province) = (pc.province);
Because then you can add an index on postalcodes(city, province).
If you cannot use remove lower(), then alter the table to add new columns and put the lower-case values in those columns. Then build an index on the new columns and use them in the join.
Thank you all for pointing me on the right direction.
Some changes have been made following your advices:
added indexes on imported_cities table on destinationCity and destinationProvince columns
added indexes on postalcodes table on city and provinceISO columns
JOIN clause have only one side upper since the field ic.destinationCity is already in uppercase
limit query by province on WHERE for performance
The final SQL is:
SELECT DISTINCT pc.city, pc.provinceISO
FROM postalcodes pc
LEFT JOIN imported_cities ic
ON upper(pc.city) = ic.destinationCity AND
pc.provinceISO = ic.destinationProvince
WHERE ic.destinationProvince = 'QC';
AND the EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE pc ref province province 8 const 278115 Using index condition; Using temporary
1 SIMPLE ip ref destinationCity,destinationProvince destinationCity 386 func 1 Using index condition; Using where; Distinct
Going forward I can now construct the INSERT query on PHP and make one INSERT query to insert all equivalent cities on the 3rd table. Thank you all.

MySQL query optimization with between or larger than > condition

Problem: slow query.
table1 has about 5 000 rows
table2 has about 50 000 rows
timestamp format is int(11)
MySQL - 20 seconds (with indexes)
PostgreSQL - 0,04 seconds (with indexes)
SELECT *
FROM table1
LEFT JOIN table2
ON table2_timestamp BETWEEN table1_timestamp - 500
AND table1_timestamp + 500 ;
Can anybody help me with optimize this query for MySQL?
Explain:
1 SIMPLE a index a 9 2 Using index
1 SIMPLE b index b b 9 5 Using index
Tables:
CREATE TABLE `a` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`table1_timestamp` bigint(20) NULL DEFAULT NULL ,
PRIMARY KEY (`id`),
INDEX `a` (`table1_timestamp`) USING BTREE
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
AUTO_INCREMENT=3
ROW_FORMAT=COMPACT
;
CREATE TABLE `b` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`table2_timestamp` bigint(20) NULL DEFAULT NULL ,
PRIMARY KEY (`id`),
INDEX `a` (`table2_timestamp`) USING BTREE
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
AUTO_INCREMENT=3
ROW_FORMAT=COMPACT
;
A couple of points spring to mind but both feel like long-shots. Realistically it looks as though there shouldn't be much you can do to your query assuming your example is an accurate representation.
1 : You are using BIGINT which has a maximum value of 9x10^18 (SIGNED). INT has a max value of 4x10^9 (UNSIGNED), compared to days timestamp which is around 1.4x10^9 (all values approximate) and so consider changing the data type of that column in both tables from BIGINT to INT UNSIGNED or DATETIME
2 : The ROW_FORMAT is COMPACT which may cause issues with BTREE indexes (source). You are dealing with INT data types and so a ROW_FORMAT of FIXED would suffice so try changing to ROW_FORMAT=FIXED on both tables
3 : If always expecting rows to be returned from table2 for table1 rows then INNER JOIN would be more efficient than LEFT JOIN

How to improve search performance in MySQL

I have a table that contains two bigint columns: beginNumber, endNumber, defined as UNIQUE. The ID is the Primary Key.
ID | beginNumber | endNumber | Name | Criteria
The second table contains a number. I want to retrieve the record from table1 when the Number from table2 is found to be between any two numbers. The is the query:
select distinct t1.Name, t1.Country
from t1
where t2.Number
BETWEEN t1.beginIpNum AND t1.endNumber
The query is taking too much time as I have so many records. I don't have experience in DB. But, I read that indexing the table will improve the search so MySQL does not have to pass through every row searching about m Number and this can be done by, for example, having UNIQE values. I made the beginNumber & endNumber in table1 as UNIQUE. Is this all what I can do ? Is there any possible way to improve the time ? Please, provide detailed answers.
EDIT:
table1:
CREATE TABLE `t1` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`beginNumber` bigint(20) DEFAULT NULL,
`endNumber` bigint(20) DEFAULT NULL,
`Name` varchar(255) DEFAULT NULL,
`Criteria` varchar(455) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `beginNumber_UNIQUE` (`beginNumber`),
UNIQUE KEY `endNumber_UNIQUE` (`endNumber `)
) ENGINE=InnoDB AUTO_INCREMENT=327 DEFAULT CHARSET=utf8
table2:
CREATE TABLE `t2` (
`id2` int(11) NOT NULL AUTO_INCREMENT,
`description` varchar(255) DEFAULT NULL,
`Number` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id2`),
UNIQUE KEY ` description _UNIQUE` (`description `)
) ENGINE=InnoDB AUTO_INCREMENT=433 DEFAULT CHARSET=utf8
This is a toy example of the tables but it shows the concerned part.
I'd suggest an index on t2.Number like this:
ALTER TABLE t2 ADD INDEX numindex(Number);
Your query won't work as written because it won't know which t2 to use. Try this:
SELECT DISTINCT t1.Name, t1.Criteria
FROM t1
WHERE EXISTS (SELECT * FROM t2 WHERE t2.Number BETWEEN t1.beginNumber AND t1.endNumber);
Without the t2.Number index EXPLAIN gives this query plan:
1 PRIMARY t1 ALL 1 Using where; Using temporary
2 DEPENDENT SUBQUERY t2 ALL 1 Using where
With an index on t2.Number, you get this plan:
PRIMARY t1 ALL 1 Using where; Using temporary
DEPENDENT SUBQUERY t2 index numindex numindex 9 1 Using where; Using index
The important part to understand is that an ALL comparison is slower than an index comparison.
This is a good place to use binary tree index (default is hashmap). Btree indexes are best when you often sort or use between on column.
CREATE INDEX index_name
ON table_name (column_name)
USING BTREE