I'm joining two tables and counting returned rows with simple MySQL query:
SELECT SQL_NO_CACHE count(parc2.id)
FROM SHIP__shipments AS ship
JOIN SHIP__shipments_parcels AS parc2 ON ship.shipmentId = parc2.shipmentId
It takes approx. 2 seconds to provide result, which is around 800k rows. Primary table has cca. 700k rows, joined table has cca. 800k rows.
Both tables have indexes and all that stuff. Join without counting is very fast, cca. 0.005s.
Counting just one table is also very fast, something like 0.01s.
Once counting and join is in the same query, we are dropping to 2s with 99% of time in "sending data" by profiler.
Output from explain:
1 SIMPLE ship index PRIMARY senderId 4 NULL 738700 Using index
1 SIMPLE parc2 ref shippmentId,shipmentId shippmentId 4 ship.shipmentId 1 Using index
I did tons of tries during testing. Using for example combined keys, using count(*), forcing index to use.. also more exotic ways like using subqueries, etc. Nothing really helps, it's always that slow.
Tables:
CREATE TABLE `SHIP__shipments` (
`shipmentId` int(11) NOT NULL COMMENT 'generated ID',
`externalId` varchar(255) DEFAULT NULL COMMENT 'spedition number',
`senderId` int(11) NOT NULL COMMENT 'FK - sender address',
`recipientId` int(11) DEFAULT NULL COMMENT 'Fk - recipient address',
`customerId` int(11) NOT NULL COMMENT 'FK - custromer',
`packageCount` int(11) NOT NULL COMMENT 'number of parcels',
`shipmentPickupDate` datetime NOT NULL COMMENT 'when to pickup shipent',
`shipmenmtDescription` varchar(255) DEFAULT NULL COMMENT 'free description',
`codAmount` double DEFAULT NULL COMMENT 'COD to take',
`codReference` varchar(255) DEFAULT NULL COMMENT 'customer''s COD refference',
`codCurrencyCode` varchar(50) DEFAULT NULL COMMENT 'FK - currency',
`codConfirmed` tinyint(1) NOT NULL COMMENT 'COD confirmed by spedition',
`codSent` tinyint(1) NOT NULL COMMENT 'COD paid to customer? 1/0',
`trackingCountryCode` varchar(50) NOT NULL COMMENT 'FK - country of shippment tracking',
`subscriptionDate` datetime NOT NULL COMMENT 'when to enter to the sped. system',
`speditionCode` varchar(50) NOT NULL COMMENT 'FK - spedition',
`shipmentType` enum('DIRECT','WAREHOUSE') NOT NULL DEFAULT 'WAREHOUSE' COMMENT 'internal OLZA flag',
`weight` decimal(10,3) NOT NULL COMMENT 'sum weight of parcells',
`billingPrice` decimal(10,2) NOT NULL COMMENT 'stored price of delivery',
`billingCurrencyCode` varchar(50) NOT NULL COMMENT 'storred currency of delivery price',
`invoiceCreated` tinyint(1) NOT NULL COMMENT 'invoicing has been done? 1/0',
`invoicingDate` datetime NOT NULL COMMENT 'date of creating invoice',
`pickupPlaceId` varchar(100) DEFAULT NULL COMMENT 'pickup place ID, if applicable for shipment',
`created` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`modified` datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
`lastCheckDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'last date of status check'
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='shippment details';
ALTER TABLE `SHIP__shipments`
ADD PRIMARY KEY (`shipmentId`),
ADD UNIQUE KEY `senderId` (`senderId`) USING BTREE,
ADD UNIQUE KEY `externalId` (`externalId`,`trackingCountryCode`,`speditionCode`),
ADD UNIQUE KEY `recipientId_2` (`recipientId`),
ADD KEY `recipientId` (`recipientId`),
ADD KEY `customerId` (`customerId`),
ADD KEY `codCurrencyCode` (`codCurrencyCode`),
ADD KEY `trackingCountryCode` (`trackingCountryCode`),
ADD KEY `speditionCode` (`speditionCode`);
ALTER TABLE `SHIP__shipments`
MODIFY `shipmentId` int(11) NOT NULL AUTO_INCREMENT COMMENT 'generated ID';
ALTER TABLE `SHIP__shipments`
ADD CONSTRAINT `SHIP__shipments_ibfk_3` FOREIGN KEY (`recipientId`) REFERENCES `SHIP__recipient_list` (`recipientId`),
ADD CONSTRAINT `SHIP__shipments_ibfk_4` FOREIGN KEY (`customerId`) REFERENCES `CUST__customer_list` (`customerId`),
ADD CONSTRAINT `SHIP__shipments_ibfk_5` FOREIGN KEY (`codCurrencyCode`) REFERENCES `SYS__currencies` (`code`),
ADD CONSTRAINT `SHIP__shipments_ibfk_6` FOREIGN KEY (`trackingCountryCode`) REFERENCES `SYS__countries` (`code`),
ADD CONSTRAINT `SHIP__shipments_ibfk_7` FOREIGN KEY (`speditionCode`) REFERENCES `SYS__speditions` (`code`),
ADD CONSTRAINT `SHIP__shipments_ibfk_8` FOREIGN KEY (`senderId`) REFERENCES `SHIP__sender_list` (`senderId`);
CREATE TABLE `SHIP__shipments_parcels` (
`id` int(11) NOT NULL COMMENT 'generated ID',
`shipmentId` int(11) NOT NULL COMMENT 'FK - shippment',
`externalNumber` varchar(255) DEFAULT NULL COMMENT 'number from spedition',
`externalBarcode` varchar(255) DEFAULT NULL COMMENT 'Barcode ID - external reference',
`status` varchar(100) DEFAULT NULL COMMENT 'FK - current status',
`weigth` decimal(10,3) NOT NULL COMMENT 'weight of parcel',
`weightConfirmed` tinyint(1) NOT NULL COMMENT 'provided weight has been confirmed/updated by measuring',
`parcelType` varchar(255) NOT NULL COMMENT 'foreign key',
`created` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`modified` datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='data and relations between shippment and it''s parcels';
ALTER TABLE `SHIP__shipments_parcels`
ADD PRIMARY KEY (`id`),
ADD KEY `shippmentId` (`shipmentId`,`status`),
ADD KEY `status` (`status`),
ADD KEY `parcelType` (`parcelType`),
ADD KEY `externalBarcode` (`externalBarcode`),
ADD KEY `weightConfirmed` (`weightConfirmed`),
ADD KEY `externalNumber` (`externalNumber`),
ADD KEY `shipmentId` (`shipmentId`);
ALTER TABLE `SHIP__shipments_parcels`
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'generated ID';
ALTER TABLE `SHIP__shipments_parcels`
ADD CONSTRAINT `SHIP__shipments_parcels_ibfk_2` FOREIGN KEY (`status`) REFERENCES `SHIP__statuses` (`statusCode`),
ADD CONSTRAINT `SHIP__shipments_parcels_ibfk_3` FOREIGN KEY (`shipmentId`) REFERENCES `SHIP__shipments` (`shipmentId`),
ADD CONSTRAINT `SHIP__shipments_parcels_ibfk_4` FOREIGN KEY (`parcelType`) REFERENCES `SHIP__parcel_types` (`parcelType`);
Server is running on SSD disks and we are not talking about a lot of data here.
Am I missing something here? Or 2 seconds is real time of row counting?
Can I have count result in "normal" time like 0.01s?
We are running MariaDB 10.
Analysis
Let's dissect some columns and the EXPLAIN:
`shipmentId` int(11) (*3) NOT NULL COMMENT 'generated ID',
`senderId` int(11) (*3) NOT NULL COMMENT 'FK - sender address',
1 SIMPLE ship index PRIMARY
senderId (*2) 4 NULL 738700 Using index (*1)
1 SIMPLE parc2 ref shippmentId,shipmentId
shippmentId (*4) 4 ship.shipmentId 1 Using index (*1)
SELECT ... count(parc2.id) (*5) ... STRAIGHT_JOIN (*6) ...
Notes:
*1 -- Both are Using index; this is likely to help a lot.
*2 -- INDEX(senderId) is probably the "smallest" index. Note that you are using InnoDB. The PK is "clustered" with the data, so it is not "small". Every secondary index has the PK implicitly tacked on, so that is effectively (senderId, shipmentId). This explains why the Optimizer mysteriously picked INDEX(senderId).
*3 -- INT takes 4 bytes, allowing numbers up to +/- 2 billion. Do you expect to have that many senders and shipments? Shrinking the datatype (and making it UNSIGNED will save some space and I/O, and therefore may speed things up a little.
*4 -- INDEX(shipmentId) is actually like INDEX(shipmentId, id), again 2 INTs.
*5 -- COUNT(x) checks x for being NOT NULL. This is probably unnecessary in your application. Change to COUNT(*) unless you do need the null check. (The performance difference will be minor.)
*6 -- It probably does not matter which table it picks first, except perhaps for what indexes are available. Hence, STRAIGHT_JOIN did not help.
Now let's discuss how the JOIN works. Virtually all JOINs in MySQL are "NLJ" (Nested Loop Join). This is where the code walks through one of the tables (actually just an index for one table), then reaches into the other table (also, just into an index) for each row found.
To do a COUNT(*) it only needs to check for the existence of the row.
So, it walked through the 2-column INDEX(senderId, shipmentId) to find a list of all shipmentIds in the first table. It did not waste time sorting or dedupping that list. And, since shipmentId is the PK, (hence UNIQUE), there won't be any dups.
For each shipmentId, it then looked up all the rows in the second table. That was efficient to do because of INDEX(shipmentId, id).
I/O (or not)
Let's digress into another issue. Was there any I/O? Were all those rows of those two indexes fully cached in RAM? What is the value of innodb_buffer_pool_size?
The way InnoDB fetches a row (from a table or from an index) is to first check to see if it is in the "buffer pool". If it is not there, then it must bump something out of the buffer pool and read the desired 16KB block into the buffer pool.
At one extreme, nothing is in the buffer pool and all the blocks must be read from disk. At the other extreme, all are cached, and no I/O is needed. Since you tried all sorts of things, I assume that all the relevant blocks (those two indexes) were in RAM.
2 INTs * (800K + 700K rows) + some overhead = maybe 50MB. Assuming innodb_buffer_pool_size is more than that, and no swapping occurred, then it is reasonable for there to be no I/O.
So, how long should it take to touch 1.5M rows that are fully cached, in a JOIN? Alas, 2 seconds seems reasonable.
User expectations
It is rare to need an exact, up-to-the-second count that is in the millions. Rethink the User requirement. Or we can discuss ways to pre-compute the value. Or dead-reckon it.
Side notes
(These do not impact the question at hand.)
Don't blindly use 255 for all strings.
UNIQUE(x) is an INDEX, so don't also have INDEX(x).
Having more than 2 PRIMARY or UNIQUE indexes is usually a design error in the schema.
Some columns could (should?) be normalized. Example: parcelType?
Don't use FLOAT or DOUBLE for monetary values; use DECIMAL. (weight could be floating.)
Related
hope you will allow me to pick your brains so I can gain some knowledge in the process.
We have 3 tables - data_product, data_issuer, data_accountbalance
CREATE TABLE `data_issuer` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`issuer_name` varchar(128) NOT NULL
PRIMARY KEY (`id`)
) ENGINE=InnoDB
CREATE TABLE `data_product` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL,
`issuer_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `data_product_name_issuer_id_260fec65_uniq` (`name`,`issuer_id`),
KEY `data_product_issuer_id_d07fa696_fk_data_issuer_id` (`issuer_id`),
CONSTRAINT `data_product_issuer_id_d07fa696_fk_data_issuer_id` FOREIGN KEY
(`issuer_id`) REFERENCES `data_issuer` (`id`)
) ENGINE=InnoDB
CREATE TABLE `data_accountbalance` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`date` date NOT NULL,
`nominee_name` varchar(128) NOT NULL,
`beneficiary_name` varchar(128) NOT NULL,
`nominee_id` varchar(128) NOT NULL,
`account_id` varchar(16) NOT NULL,
`product_id` int(11) NOT NULL,
`register_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `data_accountbalance_date_product_id_nominee__7b8d2c6a_uniq` (`date`,`product_id`,`nominee_id`,`beneficiary_name`),
KEY `data_accountbalance_product_id_nominee_id_date_8ef8754f_idx` (`product_id`,`nominee_id`,`date`),
KEY `data_accountbalance_register_id_4e78ec16_fk_data_register_id` (`register_id`),
KEY `data_accountbalance_product_id_date_nominee_i_c3a41e39_idx` (`product_id`,`date`,`nominee_id`,`beneficiary_name`,`balance_amount`),
CONSTRAINT `data_accountbalance_product_id_acfb18f6_fk_data_product_id` FOREIGN KEY (`product_id`) REFERENCES `data_product` (`id`),
CONSTRAINT `data_accountbalance_register_id_4e78ec16_fk_data_register_id` FOREIGN KEY (`register_id`) REFERENCES `data_register` (`id`)
) ENGINE=InnoDB
When running the query below, the system takes about an hour to respond -
SELECT SQL_NO_CACHE *
from data_product
INNER JOIN `data_issuer` ON (`data_issuer`.`id` = `data_product`.`issuer_id`)
INNER JOIN `data_accountbalance` ON (`data_accountbalance`.`product_id` = `data_product`.`id`)
LIMIT 100000000;
Both data_issuer and data_product only have few 100 records in them, but the data_accountbalance is huge with about 15,384,358 records.
The explain plan produced is below -
# id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE data_product ALL PRIMARY,data_product_issuer_id_d07fa696_fk_data_issuer_id 459 100
1 SIMPLE data_issuer eq_ref PRIMARY PRIMARY 4 pnl.data_product.issuer_id 1 100
1 SIMPLE data_accountbalance ref data_accountbalance_product_id_nominee_id_date_8ef8754f_idx,data_accountbalance_product_id_date_nominee_i_c3a41e39_idx data_accountbalance_product_id_date_nominee_i_c3a41e39_idx 4 pnl.data_product.id 493 100
Can someone help tune the query so it does not take an hour to run please? Appreciate any pointers you might have for me.
If your query is literally what you are showing there... Then thats the problem. It has no WHERE clause.
That query would literally return 15,384,358 results. As the two smaller tables are typical domain tables with NOT NULL relations all the way across, it will return 1 to 1 results for every row in data_accountbalance.
The actual time cost will probably be in creating a Massive temp table (tho I'm not sure about that). Just to download the entire database, all 3 tables, you could look into optimize your temp table MySQL config to possibly speed this up, OR preferably make it so that when you start executing the query that you can read the results as MySQL gets them ready (avoids a temp table). Alternatively, maybe your script that runs this query is trying to read the whole data set into memory, which takes a long time?
Is there a particular reason to download All the data? Usually you just download the data you are meaning to operate on. Or have MySQL do the grouping, summing, etc then return the answer you wanted based on All the data.
How many rows did you expect the query to return? If you are thinking something less than 15 million, then the answer is to add some kind of WHERE statement, or an aggregate function. Depending on what table and column in you use to reduce the result set, those columns will have to be indexed.
I hope this helps. :)
I have got this table:
CREATE TABLE `pertemba_client_raw_data` (
`line_id` int(11) NOT NULL AUTO_INCREMENT,
`feed_id` int(11) NOT NULL COMMENT 'References pertemba_client_feed_log.feed_id',
`data_line` int(11) NOT NULL COMMENT 'Eg. The CSV line number or JSON object index.',
`property_title` varchar(255) NOT NULL COMMENT 'Eg. The CSV header or JSON key.',
`property_value` varchar(255) NOT NULL COMMENT 'Eg. The CSV field value or JSON object value.',
`date_updated` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`line_id`),
UNIQUE KEY `pertemba_client_raw_data_line_id_pk` (`line_id`),
KEY `feed_id` (`feed_id`),
CONSTRAINT `pertemba_client_raw_data_ibfk_1` FOREIGN KEY (`feed_id`) REFERENCES `pertemba_client_feed_log` (`feed_id`)
) ENGINE=InnoDB AUTO_INCREMENT=113121 DEFAULT CHARSET=utf8
Which currently contains about 110,000 records, but will become much larger.
I have a php process running against this table that is running very slowly - the run time is currently 10+ minutes. When I repeatedly run show PROCESSLIST; this query in the process is always running:
SELECT COUNT(pcr.line_id) AS result FROM pertemba_client_raw_data AS pcr
WHERE pcr.feed_id = :feedId
AND pcr.property_title = :title
AND pcr.property_value = :optionLink
I would appreciate any optimisations that can be suggested for beating this problem.
First step is to identify the problem. Try
EXPLAIN SELECT COUNT(pcr.line_id) AS result
FROM pertemba_client_raw_data AS pcr
WHERE pcr.feed_id = :feedId
AND pcr.property_title = :title
AND pcr.property_value = :optionLink
For your query, as pointed out by juergen, I believe you can improve performance if you add index to property_title and property_value as composite index such as:
KEY `feed_id` (`feed_id`, `property_title`, `property_value`)
After that, try to execute EXPLAIN again to confirm that performance issue is solved or not.
i have this table :
CREATE TABLE `messenger_contacts` (
`number` varchar(15) NOT NULL,
`has_telegram` tinyint(1) NOT NULL DEFAULT '0',
`geo_state` int(11) NOT NULL DEFAULT '0',
`geo_city` int(11) NOT NULL DEFAULT '0',
`geo_postal` int(11) NOT NULL DEFAULT '0',
`operator` tinyint(1) NOT NULL DEFAULT '0',
`type` tinyint(1) NOT NULL DEFAULT '0'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `messenger_contacts`
ADD PRIMARY KEY (`number`),
ADD KEY `geo_city` (`geo_city`),
ADD KEY `geo_postal` (`geo_postal`),
ADD KEY `type` (`type`),
ADD KEY `type1` (`operator`),
ADD KEY `has_telegram` (`has_telegram`),
ADD KEY `geo_state` (`geo_state`);
with about 11 million records.
A simple count select on this table takes about 30 to 60 seconds to complete witch seems very high.
select count(number) from messenger_contacts where geo_state=1
I am not a Database pro so beside setting indexes i don't know what else i can do to make the query faster?
UPDATE:
OK , i made some changes to column type and size:
CREATE TABLE IF NOT EXISTS `messenger_contacts` (
`number` bigint(13) unsigned NOT NULL,
`has_telegram` tinyint(1) NOT NULL DEFAULT '0' ,
`geo_state` int(2) NOT NULL DEFAULT '0',
`geo_city` int(4) NOT NULL DEFAULT '0',
`geo_postal` int(10) NOT NULL DEFAULT '0',
`operator` tinyint(1) NOT NULL DEFAULT '0' ,
`type` tinyint(1) NOT NULL DEFAULT '0' ,
PRIMARY KEY (`number`),
KEY `has_telegram` (`has_telegram`,`geo_state`),
KEY `geo_city` (`geo_city`),
KEY `geo_postal` (`geo_postal`),
KEY `type` (`type`),
KEY `type1` (`operator`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Now the query only takes 4 to 5 seconds with * and number
Tanks every one for your help, even the guy that gave me -1. this would be good enough for now considering that my server is a low end hardware and i will be caching the select count results.
Maybe
select count(geo_state) from messenger_contacts where geo_state=1
as it will give the same result but will not use number column from the clustered index?
If this does not help, I would try to change number column into INT type, which should reduce the index size, or try to increase amount of memory MySQL could use for caching indexes.
You did not change the datatypes. INT(11) == INT(2) == INT(100) -- each is a 4-byte signed integer. You probably want 1-byte unsigned TINYINT UNSIGNED or 2-byte SMALLINT UNSIGNED.
It is a waste to index "flags", which I assume type and has_telegram are. The optimizer will never use them because it will less efficient than simply doing a table scan.
The standard coding pattern is:
select count(*)
from messenger_contacts
where geo_state=1
unless you need to not count NULLs, which is what COUNT(geo_state) implies.
Once you have the index on geo_state (or an index starting with geo_state), the query will scan the index (which is a separate BTree structure) starting with the first occurrence of geo_state=1 until the last, counting as it goes. That is, it will touch 1.1 millions index entries. So, a few seconds is to be expected. Counting a 'rare' geo_state will run much faster.
The reason for 30-60 seconds versus 4-5 seconds is very likely to be caching. The former had to read stuff from disk; the latter did not. Run the query twice.
Using the geo_state index will be faster for that query than using the PRIMARY KEY unless there are caching differences.
INDEX(number,geo_state) is virtually useless for any of the SELECTs mentioned -- geo_state should be first. This is an example of a "covering" index for the select count(number)... case.
More on building indexes.
Sorry fot long post but this is really strange and I am close to give it up. 2 tables:
CREATE TABLE `endu_results` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`base_name` varchar(200) NOT NULL,
`base_nr` int(11) DEFAULT NULL,
`base_yob` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `endu_results_206a6355` (`base_name`),
KEY `endu_results_63df4402` (`base_nr`),
KEY `base_yob` (`base_yob`)
) ENGINE=InnoDB AUTO_INCREMENT=3424028 DEFAULT CHARSET=utf8;enter code here
and 2nd:
CREATE TABLE `endu_resultinterest` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`result_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `endu_resultinterest_3b529087` (`result_id`),
CONSTRAINT `result_id_refs_id_19e24435` FOREIGN KEY (`result_id`) REFERENCES `endu_results` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=48590 DEFAULT CHARSET=utf8;
There are about 2mln records in endu_resultstable and less then 100K i endu_resultinterest. I have slow query:
explain select base_yob from endu_resultinterest
inner join endu_results
on (endu_results.id = endu_resultinterest.result_id)
order by endu_results.base_yob;
1 SIMPLE endu_resultinterest index endu_resultinterest_3b529087 endu_resultinterest_3b529087 4 NULL 47559 Using index; Using temporary; Using filesort
The question is: Why mysql is using this index: endu_resultinterest_3b529087 - but it should use base_yob - this is where sorting is requested ?
To test it further I have manaully created 2 additional identical tables endu_testresults and endu_testresultintrest and filled those with some records:
CREATE TABLE `endu_testresults` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`base_yob` int(11) DEFAULT NULL,
`base_name` varchar(200) NOT NULL,
`base_nr` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `endu_testresults_a65b2616` (`base_yob`),
KEY `endu_testresults_ba0ab39c` (`base_name`),
KEY `endu_testresults_d75ba04d` (`base_nr`)
) ENGINE=InnoDB AUTO_INCREMENT=20 DEFAULT CHARSET=utf8;
So I go again for explain:
explain select base_yob from endu_testresultinterest
inner join endu_testresults
on (endu_testresults.id = endu_testresultinterest.result_id)
order by endu_testresults.base_yob;
and suprise suprise:
1 SIMPLE endu_testresults index PRIMARY endu_testresults_a65b2616 5 NULL 19 Using index
Index sort column base_yob (endu_testresults_a65b2616) is now used.
Why is that in one case index is used and in other I got 'using filesort;using temporary ? Does size matters ? I will try to copy records from one to another but do not get it with indexes. MySql is 5.6.16
Short answer: Because it is faster.
Long answer...
Your EXPLAINs seem to be incomplete -- I would expect 2 lines in each.
The first table is 20 (70?) times as big as the second. The optimizer picked the smaller table to start with. Hence it is initially doing 1/20th the amount of work. The sort that comes later (ORDER BY ...) is much less work than if it had to do 20 times as much work to start with.
The output is only 48K rows, correct? And that is how many rows in the 2nd table, correct?
Your test tables did not have the same bigger/smaller ratio, did they? Hence the different EXPLAIN.
I have simple categories table. Category can have parent category (par_cat column) or null if it is main category and with the same parent category there shouldn't be 2 or more categories with the same name or url.
Code for this table:
CREATE TABLE IF NOT EXISTS `categories` (
`id` int(10) unsigned NOT NULL,
`par_cat` int(10) unsigned DEFAULT NULL,
`lang` varchar(2) COLLATE utf8_unicode_ci NOT NULL DEFAULT 'pl',
`name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`url` varchar(120) COLLATE utf8_unicode_ci NOT NULL,
`active` tinyint(3) unsigned NOT NULL DEFAULT '1',
`accepted` tinyint(3) unsigned NOT NULL DEFAULT '1',
`priority` int(10) unsigned NOT NULL DEFAULT '1000',
`entries` int(10) unsigned NOT NULL DEFAULT '0',
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=3 ;
ALTER TABLE `categories`
ADD PRIMARY KEY (`id`),
ADD UNIQUE KEY `categories_name_par_cat_unique` (`name`,`par_cat`),
ADD UNIQUE KEY `categories_url_par_cat_unique` (`url`,`par_cat`),
ADD KEY `categories_par_cat_foreign` (`par_cat`);
ALTER TABLE `categories`
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=3;
ALTER TABLE `categories`ADD CONSTRAINT `categories_par_cat_foreign`
FOREIGN KEY (`par_cat`) REFERENCES `categories` (`id`);
The problem is that even if I have unique keys it doesn't work. If I try to insert into database 2 categories that have par_cat set to null and same name and url, those 2 categories can be inserted into database without a problem (and they shouldn't). However if I select for those categories other par_cat (for example 1 assuming category with id 1 exists), only first record will be inserted (and that's desired behaviour).
Question - how to handle this case? I read that:
A UNIQUE index creates a constraint such that all values in the index
must be distinct. An error occurs if you try to add a new row with a
key value that matches an existing row. This constraint does not apply
to NULL values except for the BDB storage engine. For other engines, a
UNIQUE index permits multiple NULL values for columns that can contain
NULL. If you specify a prefix value for a column in a UNIQUE index,
the column values must be unique within the prefix.
however if I have unique on multiple columns I expected it's not the case (only par_cat can be null, name and url cannot be null). Because par_cat references to id of the same table but some categories don't have parent category it should allow null values.
This works as defined by the SQL standard. NULL means unknown. If you have two records of par_cat = NULL and name = 'X', then the two NULLs are not regarded to hold the same value. Thus they don't violate the unique key constraint. (Well, one could argue that the NULLs still might mean the same value, but applying this rule would make working with unique indexes and nullable fields almost impossible, for NULL could as well mean 1, 2 or whatever other value. So they did well to define it such as they did in my opinion.)
As MySQL does not support functional indexes where you could have an index on ISNULL(par_cat,-1), name, your only option is to make par_cat a NOT NULL column with 0 or -1 or whatever for "no parent", if you want your constraints to work.
I see that this was asked in 2014.
However it is often requested from MySQL: https://bugs.mysql.com/bug.php?id=8173 and https://bugs.mysql.com/bug.php?id=17825 for example.
People can click on affects me to try and get attention from MySQL.
Since MySQL 5.7 we can now use the following workaround:
ALTER TABLE categories
ADD generated_par_cat INT UNSIGNED AS (ifNull(par_cat, 0)) NOT NULL,
ADD UNIQUE INDEX categories_name_generated_par_cat (name, generated_par_cat),
ADD UNIQUE INDEX categories_url_generated_par_cat (url, generated_par_cat);
The generated_par_cat is a virtual generated column, so it has no storage space. When a user inserts (or updates) then the unique indexes cause the value of generated_par_cat to be generated on the fly which is a very quick operation.
Just in case you come from Laravel...
This is Laravel's Migration version for Virtual Column to workaround the UNIQUE issue when one of the columns is NULL in value
$table->integer('generated_par_cat')->virtualAs('ifNull(par_cat, 0)');
$table->unique(['name', 'generated_par_cat'], 'name_par_cat_unique');