MySQL Search for duplicate of a one to many relationship

MySQL Search for duplicate of a one to many relationship - mysql

I need to find the most efficient way in MySQL to compare 2 different instances of a one to many relationship. Take this table
CREATE TABLE `Table` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`ParentID` int(11) NOT NULL,
`ChildID` int(11) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `pach` (`ParentID`,`ChildID`),
KEY `ParentID` (`ParentID`),
KEY `ChildID` (`ChildID`)
) ENGINE=InnoDB AUTO_INCREMENT=1;
insert into `Table` (`ID`,`ParentID`,`ChildID`) values
(1,1,1),
(2,1,2),
(3,1,3),
(4,1,4),
(5,2,1),
(6,2,3),
(7,3,1),
(8,3,3),
(9,3,4),
(10,4,1),
(11,4,4),
(12,4,3);
ParentID 3 has an identical set of children to ParentID 4, and that's what i need my query to be able to identify - Given ParentID=4, return ParentID 3 because it has exaclty the same children.
So far the only thing i can come up with is a very ugly group_concat query (see below). What would be a better approach to solve this problem?
select distinct(b.ParentID)
from `Table` a, `Table` b where
(select group_concat(ChildID order by ParentID asc) from `Table` where ParentID=a.ParentID )
=
(select group_concat(ChildID order by ParentID asc) from `Table` where ParentID=b.ParentID )
and b.ParentID!=a.ParentID
and a.parentID=4;
+----------+
| ParentID |
+----------+
| 3 |
+----------+

Related

2 where clauses slower than separated queries

I'm using Google Cloud SQL (the micro server version) to run a couple of performance tests.
I want to do the following query:
select count(*) from table where A = valueA and B like "%input_string%";
+----------+
| count(*) |
+----------+
| 512997 |
+----------+
1 row in set (9.64 sec)
If I run them separately, I get:
select count(*) from table where A = valueA;
+----------+
| count(*) |
+----------+
| 512998 |
+----------+
1 row in set (0.18 sec)
select count(*) from table where B like "%input_string%";
+----------+
| count(*) |
+----------+
| 512997 |
+----------+
1 row in set (1.43 sec)
How is that difference in performance possible???
Both A and B columns have indexes as they are used to order tables in a web application.
Thx!
EDIT:
table schema
table | CREATE TABLE `table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`A` varchar(9) DEFAULT NULL,
`B` varchar(50) DEFAULT NULL,
`C` varchar(10) DEFAULT NULL,
`D` varchar(50) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `A` (`A`),
KEY `B` (`B`)
) ENGINE=InnoDB AUTO_INCREMENT=512999 DEFAULT CHARSET=utf8

A option might be using a FULLTEXT INDEX and using MATCH() on it.
CREATE TABLE `table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`A` varchar(9) DEFAULT NULL,
`B` varchar(50) DEFAULT NULL,
`C` varchar(10) DEFAULT NULL,
`D` varchar(50) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY(A),
FULLTEXT INDEX(B)
) ENGINE=InnoDB AUTO_INCREMENT=512999 DEFAULT CHARSET=utf8
And a query rewrite
SELECT
count(*)
FROM
`table`
WHERE
A = 'A'
AND
B IN (
SELECT
B
FROM
`table`
WHERE
MATCH(B) AGAINST('+input_string' IN BOOLEAN MODE)
)
The inner SQL will filter down a possible result based on the FULLTEXT index.
And the outer SQL will do the other filtering.
You could also use a UNION ALL now i think about it.
It should work on this questions CREATE TABLE statement .
The general idea is to get two counts for both filters and pick the lowest as valid count.
Query
SELECT
MIN(counted) AS 'COUNT(*)' # Result 512997
FROM (
select count(*) AS counted from `table` where A = 'A' # Result 512998
UNION ALL
select count(*) from `table` where B like "%input_string%" # Result 512997
) AS counts

Did you run each timing twice? If not, there could be caching involved that confuses you.
where A = valueA and B like "%input_string%"; begs for INDEX(A, B). Note: That composite index is not equivalent to your two separate indexes.
If you go with a FULLTEXT index on B, then this would be simpler:
SELECT COUNT(*) FROM t
WHERE MATCH(B) AGAINST('+input_string' IN BOOLEAN MODE)
AND A = valueA
(The use of a subquery should be unnecessary and slower.)

MySql group by optimization - avoid tmp table and/or filesort

I have a slow query, without the group by is fast (0.1-0.3 seconds), but with the (required) group by the duration is around 10-15s.
The query joins two tables, events (near 50 million rows) and events_locations (5 million rows).
Query:
SELECT `e`.`id` AS `event_id`,`e`.`time_stamp` AS `time_stamp`,`el`.`latitude` AS `latitude`,`el`.`longitude` AS `longitude`,
`el`.`time_span` AS `extra`,`e`.`entity_id` AS `asset_name`, `el`.`other_id` AS `geozone_id`,
`el`.`group_alias` AS `group_alias`,`e`.`event_type_id` AS `event_type_id`,
`e`.`entity_type_id`AS `entity_type_id`, el.some_id
FROM events e
INNER JOIN events_locations el ON el.event_id = e.id
WHERE 1=1
AND el.other_id = '1'
AND time_stamp >= '2018-01-01'
AND time_stamp <= '2019-06-02'
GROUP BY `e`.`event_type_id` , `el`.`some_id` , `el`.`group_alias`;
Table events:
CREATE TABLE `events` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`event_type_id` int(11) NOT NULL,
`entity_type_id` int(11) NOT NULL,
`entity_id` varchar(64) NOT NULL,
`alias` varchar(64) NOT NULL,
`time_stamp` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `entity_id` (`entity_id`),
KEY `event_type_idx` (`event_type_id`),
KEY `idx_events_time_stamp` (`time_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Table events_locations
CREATE TABLE `events_locations` (
`event_id` bigint(20) NOT NULL,
`latitude` double NOT NULL,
`longitude` double NOT NULL,
`some_id` bigint(20) DEFAULT NULL,
`other_id` bigint(20) DEFAULT NULL,
`time_span` bigint(20) DEFAULT NULL,
`group_alias` varchar(64) NOT NULL,
KEY `some_id_idx` (`some_id`),
KEY `idx_events_group_alias` (`group_alias`),
KEY `idx_event_id` (`event_id`),
CONSTRAINT `fk_event_id` FOREIGN KEY (`event_id`) REFERENCES `events` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The explain:
+----+-------------+-------+--------+---------------------------------+---------+---------+-------------------------------------------+----------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------------------+---------+---------+-------------------------------------------+----------+------------------------------------------------+
| 1 | SIMPLE | ea | ALL | 'idx_event_id' | NULL | NULL | NULL | 5152834 | 'Using where; Using temporary; Using filesort' |
| 1 | SIMPLE | e | eq_ref | 'PRIMARY,idx_events_time_stamp' | PRIMARY | '8' | 'name.ea.event_id' | 1 | |
+----+-------------+----------------+---------------------------------+---------+---------+-------------------------------------------+----------+------------------------------------------------+
2 rows in set (0.08 sec)
From the doc:
Temporary tables can be created under conditions such as these:
If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.
DISTINCT combined with ORDER BY may require a temporary table.
If you use the SQL_SMALL_RESULT option, MySQL uses an in-memory temporary table, unless the query also contains elements (described later) that require on-disk storage.
I already tried:
Create an index by 'el.some_id , el.group_alias'
Decrease the varchar size to 20
Increase the size of sort_buffer_size and read_rnd_buffer_size;
Any suggestions for performance tuning would be much appreciated!

In your case events table has time_span as indexing property. So before joining both tables first select required records from events table for specific date range with required details. Then join the event_location by using table relation properties.
Check your MySql Explain keyword to check how does your approach your table records. It will tell you how much rows are scanned for before selecting required records.
Number of rows that are scanned also involve in query execution time. Use my below logic to reduce the number of rows that are scanned.
SELECT
`e`.`id` AS `event_id`,
`e`.`time_stamp` AS `time_stamp`,
`el`.`latitude` AS `latitude`,
`el`.`longitude` AS `longitude`,
`el`.`time_span` AS `extra`,
`e`.`entity_id` AS `asset_name`,
`el`.`other_id` AS `geozone_id`,
`el`.`group_alias` AS `group_alias`,
`e`.`event_type_id` AS `event_type_id`,
`e`.`entity_type_id` AS `entity_type_id`,
`el`.`some_id` as `some_id`
FROM
(select
`id` AS `event_id`,
`time_stamp` AS `time_stamp`,
`entity_id` AS `asset_name`,
`event_type_id` AS `event_type_id`,
`entity_type_id` AS `entity_type_id`
from
`events`
WHERE
time_stamp >= '2018-01-01'
AND time_stamp <= '2019-06-02'
) AS `e`
JOIN `events_locations` `el` ON `e`.`event_id` = `el`.`event_id`
WHERE
`el`.`other_id` = '1'
GROUP BY
`e`.`event_type_id` ,
`el`.`some_id` ,
`el`.`group_alias`;

The relationship between these tables is 1:1, so, I asked me why is a group by required and I found some duplicated rows, 200 in 50000 rows. So, somehow, my system is inserting duplicates and someone put that group by (years ago) instead of seek of the bug.
So, I will mark this as solved, more or less...

Slow MySql query with order by limit with index

I have a query generated by Entity Framework, that looks like this:
SELECT
`Extent1`.`Id`,
`Extent1`.`Name`,
`Extent1`.`ExpireAfterUTC`,
`Extent1`.`FileId`,
`Extent1`.`FileHash`,
`Extent1`.`PasswordHash`,
`Extent1`.`Size`,
`Extent1`.`TimeStamp`,
`Extent1`.`TimeStampOffset`
FROM `files` AS `Extent1` INNER JOIN `containers` AS `Extent2` ON `Extent1`.`ContainerId` = `Extent2`.`Id`
ORDER BY
`Extent1`.`Id` ASC LIMIT 0,10
It runs painfully slow.
I have indexes on files.Id (PK), files.ContainerId(FK), containers.Id(PK) and I don't understand why mysql seems to be doing a full sort before returning the required records, even though there already is an index on the Id column.
Further more, this data is displayed in a grid which supports filters, sorts and pagination and a good use of the indexes is highly required.
Here are the table definitions:
CREATE TABLE `files` (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`FileId` varchar(100) NOT NULL,
`ContainerId` int(11) NOT NULL,
`ContainerGuid` binary(16) NOT NULL,
`Guid` binary(16) NOT NULL,
`Name` varchar(1000) NOT NULL,
`ExpireAfterUTC` datetime DEFAULT NULL,
`PasswordHash` binary(32) DEFAULT NULL,
`FileHash` tinyblob NOT NULL,
`Size` bigint(20) NOT NULL,
`TimeStamp` double NOT NULL,
`TimeStampOffset` double NOT NULL,
`FilePostId` int(11) NOT NULL,
`FilePostGuid` binary(16) NOT NULL,
`AttributeId` int(11) NOT NULL,
PRIMARY KEY (`Id`),
UNIQUE KEY `FileId_UNIQUE` (`FileId`),
KEY `Files_ContainerId_FK` (`ContainerId`),
KEY `Files_AttributeId_FK` (`AttributeId`),
KEY `Files_FileId_index` (`FileId`),
KEY `Files_FilePostId_index` (`FilePostId`),
KEY `Files_Guid_index` (`Guid`),
CONSTRAINT `Files_AttributeId_FK` FOREIGN KEY (`AttributeId`) REFERENCES `attributes` (`Id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `Files_ContainerId_FK` FOREIGN KEY (`ContainerId`) REFERENCES `containers` (`Id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `Files_FilePostsId_FK` FOREIGN KEY (`FilePostId`) REFERENCES `fileposts` (`Id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=977942 DEFAULT CHARSET=utf8;
CREATE TABLE `containers` (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Name` varchar(255) NOT NULL,
`Guid` binary(16) NOT NULL,
`AesKey` binary(32) NOT NULL,
`FileCount` int(10) unsigned NOT NULL DEFAULT '0',
`Size` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`Id`),
KEY `Containers_Guid_index` (`Guid`),
KEY `Containers_Name_index` (`Name`)
) ENGINE=InnoDB AUTO_INCREMENT=76 DEFAULT CHARSET=utf8;
You will notice there are some other relationships in the files table, which I have left out just to simplify the query without affecting the observed behavior.
Here is also an output from EXPLAIN EXTENDED:
+----+-------------+---------+-------+----------------------+-----------------------+---------+----------------------------------+-------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+-------+----------------------+-----------------------+---------+----------------------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | Extent2 | index | PRIMARY | Containers_Guid_index | 16 | NULL | 9 | 100.00 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | Extent1 | ref | Files_ContainerId_FK | Files_ContainerId_FK | 4 | netachmentgeneraltest.Extent2.Id | 73850 | 100.00 | |
+----+-------------+---------+-------+----------------------+-----------------------+---------+----------------------------------+-------+----------+----------------------------------------------+
Files table has ~900000 records (and counting) and containers has 9.
This issue only occurs when ORDER BY is present.
Also, I can't do much in terms of modifying the query because it is generated by Entity Framework. I did as much as I could with the LINQ query in order to simplify it (at first it had some horrible sub queries which executed even slower).
Query hints (as in force index) are not a solution here either, because EF does not support such features.
I am mostly hoping to find some database level optimizations to do.
For those who didn't spot the tags, the database in question is MySql.

MySQL only uses one index per table. Right now, it's preferring to use the foreign key index so the join is efficient, but that means that the sort is not using an index.
Try creating a compound index on ContainerId, filedID

This is essentially your query:
SELECT e1.*
FROM `files` e1 INNER JOIN
`containers` e2
ON e1.`ContainerId` = e2.`Id`
ORDER BY e1.`Id` ASC
LIMIT 0, 10;
You can try an index on files(id, ContainerId). This might inspire MySQL to use the composite index, focused on the order by.
It would probably be more likely if the query were phrased as:
SELECT e1.*
FROM `files` e1
WHERE EXISTS (SELECT 1 FROM containers e2 WHERE e1.`ContainerId` = e2.`Id`)
ORDER BY e1.`Id` ASC
LIMIT 0, 10;
There is one way that does work to use the indexes. However, it depends on something in MySQL that is not documented to work (although it does in practice). The following will read the data in order, but it incurs the overhead of materializing the subquery -- but not for a sort:
SELECT e1.*
FROM (SELECT e1.*
FROM files e1
ORDER BY e1.id ASC
) e1
WHERE EXISTS (SELECT 1 FROM containers e2 WHERE e1.`ContainerId` = e2.`Id`)
LIMIT 0, 10;

Slow query with multiple where and order by clauses

I'm trying to find a way to speed up a slow (filesort) MySQL query.
Tables:
categories (id, lft, rgt)
questions (id, category_id, created_at, votes_up, votes_down)
Example query:
SELECT * FROM questions q
INNER JOIN categories c ON (c.id = q.category_id)
WHERE c.lft > 1 AND c.rgt < 100
ORDER BY q.created_at DESC, q.votes_up DESC, q.votes_down ASC
LIMIT 4000, 20
If I remove the ORDER BY clause, it's fast. I know MySQL doesn't like both DESC and ASC orders in the same clause, so I tried adding a composite (created_at, votes_up) index to the questions table and removed q.votes_down ASC from the ORDER BY clause. That didn't help and it seems that the WHERE clause gets in the way here because it filters by columns from another (categories) table. However, even if it worked, it wouldn't be quite right since I do need the q.votes_down ASC condition.
What are good strategies to improve performance in this case? I'd rather avoid restructuring the tables, if possible.
EDIT:
CREATE TABLE `categories` (
`id` int(11) NOT NULL auto_increment,
`lft` int(11) NOT NULL,
`rgt` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `lft_idx` (`lft`),
KEY `rgt_idx` (`rgt`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `questions` (
`id` int(11) NOT NULL auto_increment,
`category_id` int(11) NOT NULL,
`votes_up` int(11) NOT NULL default '0',
`votes_down` int(11) NOT NULL default '0',
`created_at` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `questions_FI_1` (`category_id`),
KEY `votes_up_idx` (`votes_up`),
KEY `votes_down_idx` (`votes_down`),
KEY `created_at_idx` (`created_at`),
CONSTRAINT `questions_FK_1` FOREIGN KEY (`category_id`) REFERENCES `categories` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE q ALL questions_FI_1 NULL NULL NULL 31774 Using filesort
1 SIMPLE c eq_ref PRIMARY,lft_idx,rgt_idx PRIMARY 4 ttt.q.category_id 1 Using where

Try a subquery to get the desired categories:
SELECT * FROM questions
WHERE category_id IN ( SELECT id FROM categories WHERE lft > 1 AND rgt < 100 )
ORDER BY created_at DESC, votes_up DESC, votes_down ASC
LIMIT 4000, 20

Try selecting only what you need in your query, instead of the SELECT *
Why not to use SELECT * ( ALL ) in MySQL

Try putting conditions, concerning joined tables into ON clauses:
SELECT * FROM questions q
INNER JOIN categories c ON (c.id = q.category_id AND c.lft > 1 AND c.rgt < 100)
ORDER BY q.created_at DESC, q.votes_up DESC, q.votes_down ASC
LIMIT 4000, 20

How to optimise that query?

I've a vote system which is designed like this:
CREATE TABLE `vote` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`weight` int(11) NOT NULL,
`submited_date` datetime NOT NULL,
`resource_type` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2963832 DEFAULT CHARSET=latin1;
CREATE TABLE `article_preselection_vote` (
`id` int(11) NOT NULL,
`article_id` int(11) DEFAULT NULL,
`user_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `IDX_9B145DEA62922701` (`article_id`),
KEY `IDX_9B145DEAA76ED395` (`user_id`),
CONSTRAINT `article_preselection_vote_ibfk_4` FOREIGN KEY (`article_id`) REFERENCES `article` (`id`),
CONSTRAINT `article_preselection_vote_ibfk_5` FOREIGN KEY (`id`) REFERENCES `vote` (`id`) ON DELETE CASCADE,
CONSTRAINT `article_preselection_vote_ibfk_6` FOREIGN KEY (`user_id`) REFERENCES `user` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
v.weight can be +1 or -1, I need, given a bunch of articles ID, to get the sum of each positive vote (+1) and the sum of negative vote (-1) per articles id.
Then my result should be
article_id | vote_up | vote_down
-----------|---------|----------
1 | 36 | 20
-----------|---------|----------
68 | 12 | 56
-----------|---------|----------
25 | 90 | 12
-----------|---------|----------
I can get that result by doing the following request, but it's quite heavy and slow on 2,000,000 votes.
SELECT apv.article_id, COALESCE(SUM(up),0) as up, COALESCE(SUM(down),0) as down
FROM article_preselection_vote apv
LEFT JOIN(
SELECT id, weight up FROM vote WHERE weight > 0 AND vote.resource_type = 'article') v1 ON apv.id = v1.id
LEFT JOIN(
SELECT id, weight down FROM vote WHERE weight < 0 AND vote.resource_type = 'article') v2 ON apv.id = v2.id
WHERE apv.article_id IN (11702,11703,11704,11632,11652,11658)
GROUP BY apv.article_id
Any ideas?
Thanks in advance.

Subselects, IN (...) and GROUP BY in one query are killers.
You should redesign to have a more traditional solution:
Have a table with the votes article_id, votes_up, votes_down, vote_date, ...
Update (cron) the summary fields in your article table votes_up, votes_down, ... with one UPDATE.
That way, you can better handle the row/table locks and have fast queries

You can try a single join:
SELECT
apv.article_id,
SUM(COALESCE(weight, 0) > 0) AS up,
SUM(COALESCE(weight, 0) < 0) AS down
FROM article_preselection_vote apv
LEFT JOIN vote
ON apv.id = vote.id
AND vote.resource_type = 'article'
WHERE apv.article_id IN (11702, 11703, 11704, 11632, 11652, 11658)
GROUP BY apv.article_id
If you need to calculate this often it might be worthwhile to denormalize your database and store a cached copy of the results.

Instead of weighting the votes, why don't you just create two tables, one for up votes and one for down votes? The only thing it will complicate is vote combination, which will still be a simple sum of the counts of two different queries.

in a nut shell do something like this:
select * from article where article_id in (1,2,3);
+------------+-----------+---------------+-----------------+
| article_id | title | up_vote_count | down_vote_count |
+------------+-----------+---------------+-----------------+
| 1 | article 1 | 2 | 3 |
| 2 | article 2 | 2 | 1 |
| 3 | article 3 | 1 | 1 |
+------------+-----------+---------------+-----------------+
3 rows in set (0.00 sec)
drop table if exists article;
create table article
(
article_id int unsigned not null auto_increment primary key,
title varchar(255) not null,
up_vote_count int unsigned not null default 0,
down_vote_count int unsigned not null default 0
)
engine = innodb;
drop table if exists article_vote;
create table article_vote
(
article_id int unsigned not null,
user_id int unsigned not null,
score tinyint not null default 0,
primary key (article_id, user_id)
)
engine=innodb;
delimiter #
create trigger article_vote_after_ins_trig after insert on article_vote
for each row
begin
if new.score < 0 then
update article set down_vote_count = down_vote_count + 1 where article_id = new.article_id;
else
update article set up_vote_count = up_vote_count + 1 where article_id = new.article_id;
end if;
end#
delimiter ;
insert into article (title) values ('article 1'),('article 2'), ('article 3');
insert into article_vote (article_id, user_id, score) values
(1,1,-1),(1,2,-1),(1,3,-1),(1,4,1),(1,5,1),
(2,1,1),(2,2,1),(2,3,-1),
(3,1,1),(3,5,-1);
select * from article where article_id in (1,2,3);

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

MySQL Search for duplicate of a one to many relationship - mysql

Related

2 where clauses slower than separated queries

MySql group by optimization - avoid tmp table and/or filesort

Slow MySql query with order by limit with index

Slow query with multiple where and order by clauses

How to optimise that query?

Categories

Resources