Improve self-JOIN SQL Query performance - mysql

I try to improve performance of a SQL query, using MariaDB 10.1.18 (Linux Debian Jessie).
The server has a large amount of RAM (192GB) and SSD disks.
The real table has hundreds of millions of rows but I can reproduce my performance issue on a subset of the data and a simplified layout.
Here is the (simplified) table definition:
CREATE TABLE `data` (
`uri` varchar(255) NOT NULL,
`category` tinyint(4) NOT NULL,
`value` varchar(255) NOT NULL,
PRIMARY KEY (`uri`,`category`),
KEY `cvu` (`category`,`value`,`uri`),
KEY `cu` (`category`,`uri`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
To reproduce the actual distribution of my content, I insert about 200'000 rows like this (bash script):
#!/bin/bash
for i in `seq 1 100000`;
do
mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 1, 'foo');"
done
for i in `seq 99981 200000`;
do
mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 2, '$(($i % 5))');"
done
So, we insert about:
100'000 rows in category 1 with a static string ("foo") as value
100'000 rows in category 2 with a number between 1 and 5 as the value
20 rows have a common "uri" between each dataset (category 1 / 2)
I always run an ANALYZE TABLE before querying.
Here is the explain output of the query I run:
MariaDB [mydb]> EXPLAIN EXTENDED
-> SELECT d2.uri, d2.value
-> FROM data as d1
-> INNER JOIN data as d2 ON d1.uri = d2.uri AND d2.category = 2
-> WHERE d1.category = 1 and d1.value = 'foo';
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
| 1 | SIMPLE | d1 | ref | PRIMARY,cvu,cu | cu | 1 | const | 92964 | 100.00 | Using where |
| 1 | SIMPLE | d2 | eq_ref | PRIMARY,cvu,cu | PRIMARY | 768 | mydb.d1.uri,const | 1 | 100.00 | |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
MariaDB [mydb]> SHOW WARNINGS;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note | 1003 | select `mydb`.`d2`.`uri` AS `uri`,`mydb`.`d2`.`value` AS `value` from `mydb`.`data` `d1` join `mydb`.`data` `d2` where ((`mydb`.`d1`.`category` = 1) and (`mydb`.`d2`.`uri` = `mydb`.`d1`.`uri`) and (`mydb`.`d2`.`category` = 2) and (`mydb`.`d1`.`value` = 'foo')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
MariaDB [mydb]> SELECT d2.uri, d2.value FROM data as d1 INNER JOIN data as d2 ON d1.uri = d2.uri AND d2.category = 2 WHERE d1.category = 1 and d1.value = 'foo';
+-----------+-------+
| uri | value |
+-----------+-------+
| uri100000 | 0 |
| uri99981 | 1 |
| uri99982 | 2 |
| uri99983 | 3 |
| uri99984 | 4 |
| uri99985 | 0 |
| uri99986 | 1 |
| uri99987 | 2 |
| uri99988 | 3 |
| uri99989 | 4 |
| uri99990 | 0 |
| uri99991 | 1 |
| uri99992 | 2 |
| uri99993 | 3 |
| uri99994 | 4 |
| uri99995 | 0 |
| uri99996 | 1 |
| uri99997 | 2 |
| uri99998 | 3 |
| uri99999 | 4 |
+-----------+-------+
20 rows in set (0.35 sec)
This query returns 20 rows in ~350ms.
It seems quite slow to me.
Is there a way to improve performance of such query? Any advice?

Can you try the following query?
SELECT dd.uri, max(case when dd.category=2 then dd.value end) v2
FROM data as dd
GROUP by 1
having max(case when dd.category=1 then dd.value end)='foo' and v2 is not null;
I cannot at the moment repeat your test, but my hope is that having to scan the table just once could compensate the usage of the aggregate functions.
Edited
Created a test environment and tested some hypothesis.
As of today, the best performance (for 1 million rows) has been:
1 - Adding an index on uri column
2 - Using the following query
select d2.uri, d2.value
FROM data as d2
where exists (select 1
from data d1
where d1.uri = d2.uri
AND d1.category = 1
and d1.value='foo')
and d2.category=2
and d2.uri in (select uri from data group by 1 having count(*) > 1);
The ironic thing is that in the first proposal I tried to minimize the access to the table and now I'm proposing three accesses.
Edited: 30/10
Ok, so I've done some other experiments and I would like to summarize the outcomes.
First, I'd like to expand a bit Aruna answer:
what I found interesting in the OP question, is that it is an exception to a classic "rule of thumb" in database optimization: if the # of desired results is very small compared to the dimension of the tables involved, it should be possible with the correct indexes to have a very good performance.
Why can't we simply add a "magic index" to have our 20 rows? Because we don't have any clear "attack vector".. I mean, there's no clearly selective criteria we can apply on a record to reduce significatevely the number of the target rows.
Think about it: the fact that the value must be "foo" is just removing 50% of the table form the equation. Also the category is not selective at all: the only interest thing is that, for 20 uri, they appear both in records with category 1 and 2.
But here lies the issue: the condition involves comparing two rows, and unfortunately, to my knowledge, there's no way an index (not even the Oracle Function Based Indexes) can reduce a condition that is dependant on info on multiple rows.
The conlclusion might be: if these kind of query is what you need, you should revise your data model. For example, if you have a finite and small number of categories (lets' say three=, your table might be written as:
uri, value_category1, value_category2, value_category3
The query would be:
select uri, value_category2
where value_category1='foo' and value_category2 is not null;
By the way, let's go back tp the original question.
I've created a slightly more efficient test data generator (http://pastebin.com/DP8Uaj2t).
I've used this table:
use mydb;
DROP TABLE IF EXISTS data2;
CREATE TABLE data2
(
uri varchar(255) NOT NULL,
category tinyint(4) NOT NULL,
value varchar(255) NOT NULL,
PRIMARY KEY (uri,category),
KEY cvu (category,value,uri),
KEY ucv (uri,category,value),
KEY u (uri),
KEY cu (category,uri)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The outcome is:
+--------------------------+----------+----------+----------+
| query_descr | num_rows | num | num_test |
+--------------------------+----------+----------+----------+
| exists_plus_perimeter | 10000 | 0.0000 | 5 |
| exists_plus_perimeter | 50000 | 0.0000 | 5 |
| exists_plus_perimeter | 100000 | 0.0000 | 5 |
| exists_plus_perimeter | 500000 | 2.0000 | 5 |
| exists_plus_perimeter | 1000000 | 4.8000 | 5 |
| exists_plus_perimeter | 5000000 | 26.7500 | 8 |
| max_based | 10000 | 0.0000 | 5 |
| max_based | 50000 | 0.0000 | 5 |
| max_based | 100000 | 0.0000 | 5 |
| max_based | 500000 | 3.2000 | 5 |
| max_based | 1000000 | 7.0000 | 5 |
| max_based | 5000000 | 49.5000 | 8 |
| max_based_with_ucv | 10000 | 0.0000 | 5 |
| max_based_with_ucv | 50000 | 0.0000 | 5 |
| max_based_with_ucv | 100000 | 0.0000 | 5 |
| max_based_with_ucv | 500000 | 2.6000 | 5 |
| max_based_with_ucv | 1000000 | 7.0000 | 5 |
| max_based_with_ucv | 5000000 | 36.3750 | 8 |
| standard_join | 10000 | 0.0000 | 5 |
| standard_join | 50000 | 0.4000 | 5 |
| standard_join | 100000 | 2.4000 | 5 |
| standard_join | 500000 | 13.4000 | 5 |
| standard_join | 1000000 | 33.2000 | 5 |
| standard_join | 5000000 | 205.2500 | 8 |
| standard_join_plus_perim | 5000000 | 155.0000 | 2 |
+--------------------------+----------+----------+----------+
The queries used are:
- query_max_based_with_ucv.sql
- query_exists_plus_perimeter.sql
- query_max_based.sql
- query_max_based_with_ucv.sql
- query_standard_join_plus_perim.sql query_standard_join.sql
The best query is still the "query_exists_plus_perimeter"that I've put after the first environment creation.

It is mainly due to the number of rows analysed. Even though you have tables indexed the main decision making condition "WHERE d1.category = 1 and d1.value = 'foo'" filters huge amount of rows
+------+-------------+-------+-.....-+-------+----------+-------------+
| id | select_type | table | | rows | filtered | Extra |
+------+-------------+-------+-.....-+-------+----------+-------------+
| 1 | SIMPLE | d1 | ..... | 92964 | 100.00 | Using where |
Each and every matching row it has to read the table again which is for category 2. Since it is reading on primary key it can get the matching row directly.
On your original table check the cardinality of the combination of category and value. If it is more towards unique, you can add an index on (category, value) and that should improve the performance. If it is same like the example given you may not get any performance improvement.

Related

How to perform a sum for all previous records

I've been trying to implement the solution here with the added flavour of updating existing records. As an MRE I'm looking to populate the sum_date_diff column in a table with the sum of all the differences between the current row date and the date of every previous row where the current row p1_id matches the previous row p1_id or p2_id. I have already filled out the expected result below:
+-----+------------+-------+-------+---------------+
| id_ | date_time | p1_id | p2_id | sum_date_diff |
+-----+------------+-------+-------+---------------+
| 1 | 2000-01-01 | 1 | 2 | Null |
| 2 | 2000-01-02 | 2 | 4 | 1 |
| 3 | 2000-01-04 | 1 | 3 | 3 |
| 4 | 2000-01-07 | 2 | 5 | 11 |
| 5 | 2000-01-15 | 2 | 3 | 35 |
| 6 | 2000-01-20 | 1 | 3 | 35 |
| 7 | 2000-01-31 | 1 | 3 | 68 |
+-----+------------+-------+-------+---------------+
My query so far looks like:
UPDATE test.sum_date_diff AS sdd0
JOIN
(SELECT
id_,
SUM(DATEDIFF(sdd1.date_time, sq.date_time)) AS sum_date_diff
FROM
test.sum_date_diff AS sdd1
LEFT OUTER JOIN (SELECT
sdd2.date_time AS date_time, sdd2.p1_id AS player_id
FROM
test.sum_date_diff AS sdd2 UNION ALL SELECT
sdd3.date_time AS date_time, sdd3.p2_id AS player_id
FROM
test.sum_date_diff AS sdd3) AS sq ON sq.date_time < sdd1.date_time
AND sq.player_id = sdd1.p1_id
GROUP BY sdd1.id_) AS master_sq ON master_sq.id_ = sdd0.id_
SET
sdd0.sum_date_diff = master_sq.sum_date_diff
This works as shown here.
However, on a table of 1.5m records the query has been hanging for the last hour. Even when I add a WHERE clause onto the bottom to restrict the update to a single record then it hangs for 5 mins+.
Here is the EXPLAIN statement for the query on the full table:
+----+-------------+---------------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+-------+---------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+-------+---------+----------+--------------------------------------------+
| 1 | UPDATE | sum_date_diff | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100 | NULL |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 4 | const | 10 | 100 | NULL |
| 2 | DERIVED | sum_date_diff | NULL | index | PRIMARY,ix__match_oc_history__date_time,ix__match_oc_history__p1_id,ix__match_oc_history__p2_id,ix__match_oc_history__date_time_players | ix__match_oc_history__date_time_players | 14 | NULL | 1484288 | 100 | Using index; Using temporary |
| 2 | DERIVED | <derived3> | NULL | ALL | NULL | NULL | NULL | NULL | 2968576 | 100 | Using where; Using join buffer (hash join) |
| 3 | DERIVED | sum_date_diff | NULL | index | NULL | ix__match_oc_history__date_time_players | 14 | NULL | 1484288 | 100 | Using index |
| 4 | UNION | sum_date_diff | NULL | index | NULL | ix__match_oc_history__date_time_players | 14 | NULL | 1484288 | 100 | Using index |
+----+-------------+---------------+------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+-------+---------+----------+--------------------------------------------+
Here is the CREATE TABLE statement:
CREATE TABLE `sum_date_diff` (
`id_` int NOT NULL AUTO_INCREMENT,
`date_time` datetime DEFAULT NULL,
`p1_id` int NOT NULL,
`p2_id` int NOT NULL,
`sum_date_diff` int DEFAULT NULL,
PRIMARY KEY (`id_`),
KEY `ix__sum_date_diff__date_time` (`date_time`),
KEY `ix__sum_date_diff__p1_id` (`p1_id`),
KEY `ix__sum_date_diff__p2_id` (`p2_id`),
KEY `ix__sum_date_diff__date_time_players` (`date_time`,`p1_id`,`p2_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1822120 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
MySQL version is 8.0.26 running on a 2016 MacBook Pro with Monterey with 16Gb RAM.
After reading around about boosting the RAM available to MySQL I've added the following to the standard my.cnf file:
innodb_buffer_pool_size = 8G
tmp_table_size=2G
max_heap_table_size=2G
I'm wondering if:
I've done something wrong
This is just a very slow task no matter what I do
There is a faster method
I'm hoping someone could enlighten me!
Whereas it is possible to do calculations like this in SQL, it is messy. If the number of rows is not in the millions, I would fetch the necessary columns into my application and do the arithmetic there. (Loops are easier and faster in PHP/Java/etc than in SQL.)
LEAD() and LAG() are possible, but they are not optimized well (or so is my experience). In an APP language, it is easy and efficient to look up things in arrays.
The SELECT can (easily and efficiently) do any filtering and sorting so that the app only receives the necessary data.

Strange performance degradation when using a subquery

I have a nano AWS server running MySQL 5.5 for testing purposes. So, keep in mind that the server has limited resources (RAM, CPU, ...).
I have a table called "gpslocations". There is a primary index on its primary key "GPSLocationID". There is another secondary index on one of its fields "userID". The table has 6583 records.
When I run this query:
select * from gpslocations where GPSLocationID in (select max(GPSLocationID) from gpslocations where userID in (1,9) group by userID);
I get two rows and it takes a lot of time:
+---------------+---------------------+------------+-----------+--------------------------------------+--------+--------------------------------------+-------+-----------+----------+---------------------+----------------+----------+-----------+-----------+
| GPSLocationID | lastUpdate | latitude | longitude | phoneNumber | userID | sessionID | speed | direction | distance | gpsTime | locationMethod | accuracy | extraInfo | eventType |
+---------------+---------------------+------------+-----------+--------------------------------------+--------+--------------------------------------+-------+-----------+----------+---------------------+----------------+----------+-----------+-----------+
| 4107 | 2018-09-25 16:38:44 | 58.7641435 | 7.4868510 | e5d6fdff-9afe-44bb-a53a-3b454b12c9c6 | 9 | 77385f89-6b72-4b9e-b937-d2927959e0bd | 0 | 0 | 2.9 | 2018-09-25 18:38:43 | fused | 455 | 0 | android |
| 9822 | 2018-10-22 10:29:43 | 58.7794353 | 7.1952995 | 5240853e-2c36-4563-9dc3-238039de411e | 1 | 1fcad5af-c6ef-4bda-8fb2-d6e5688cf08a | 0 | 0 | 185.6 | 2018-10-22 12:29:41 | fused | 129 | 0 | android |
+---------------+---------------------+------------+-----------+--------------------------------------+--------+--------------------------------------+-------+-----------+----------+---------------------+----------------+----------+-----------+-----------+
2 rows in set (14.96 sec)
When I just execute the inner select:
select max(GPSLocationID) from gpslocations where userID in (1,9) group by userID;
I get two values very fast:
+--------------------+
| max(GPSLocationID) |
+--------------------+
| 9822 |
| 4107 |
+--------------------+
2 rows in set (0.00 sec)
When I take these two values and write them manually in the outer select:
select * from gpslocations where GPSLocationID in (9822,4107);
I get exactly the same result as the first query but in no time!
+---------------+---------------------+------------+-----------+--------------------------------------+--------+--------------------------------------+-------+-----------+----------+---------------------+----------------+----------+-----------+-----------+
| GPSLocationID | lastUpdate | latitude | longitude | phoneNumber | userID | sessionID | speed | direction | distance | gpsTime | locationMethod | accuracy | extraInfo | eventType |
+---------------+---------------------+------------+-----------+--------------------------------------+--------+--------------------------------------+-------+-----------+----------+---------------------+----------------+----------+-----------+-----------+
| 4107 | 2018-09-25 16:38:44 | 58.7641435 | 7.4868510 | e5d6fdff-9afe-44bb-a53a-3b454b12c9c6 | 9 | 77385f89-6b72-4b9e-b937-d2927959e0bd | 0 | 0 | 2.9 | 2018-09-25 18:38:43 | fused | 455 | 0 | android |
| 9822 | 2018-10-22 10:29:43 | 58.7794353 | 7.1952995 | 5240853e-2c36-4563-9dc3-238039de411e | 1 | 1fcad5af-c6ef-4bda-8fb2-d6e5688cf08a | 0 | 0 | 185.6 | 2018-10-22 12:29:41 | fused | 129 | 0 | android |
+---------------+---------------------+------------+-----------+--------------------------------------+--------+--------------------------------------+-------+-----------+----------+---------------------+----------------+----------+-----------+-----------+
2 rows in set (0.00 sec)
Can anybody explain this huge performance degradation when the two simple and fast queries are combined in one?
EDIT
Here is the output of explain:
+----+--------------------+--------------+-------+----------------------+--------+---------+------+------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------+-------+----------------------+--------+---------+------+------+---------------------------------------+
| 1 | PRIMARY | gpslocations | ALL | NULL | NULL | NULL | NULL | 6648 | Using where |
| 2 | DEPENDENT SUBQUERY | gpslocations | range | userNameIndex,userID | userID | 5 | NULL | 11 | Using where; Using index for group-by |
+----+--------------------+--------------+-------+----------------------+--------+---------+------+------+---------------------------------------+
2 rows in set (0.00 sec)
in can have really bad optimization characteristics. In your version of MySQL, the subquery is probably being run once for every row in gsplocations. I think this performance problem was fixed in later versions.
I recommend using a correlated subquery instead:
select l.*
from gpslocations l
where l.GPSLocationID = (select max(l2.GPSLocationID)
from gpslocations l2
where l2.userID = l.userId
) and
l.userID in (1, 9);
And for this, you want an index on gpslocations(userID, GPSLocationID).
Another alternative is the join approach:
select l.*
from gpslocations l join
(select l2.userID, max(l2.GPSLocationID)
from gpslocations l2
where l2.userID in (1, 9)
) l2
on l2.userID = l.userId
where l.userID in (1, 9);

query extremely slow after migration to mysql 5.7

I have a MySQL database with InnoDB tables summing up over 10 ten GB of data that I want to migrate from MySQL 5.5 to MySQL 5.7. And I have a query that looks a bit like:
SELECT dates.date, count(mySub2.myColumn1), sum(mySub2.myColumn2)
FROM (
SELECT date
FROM dates -- just a table containing all possible dates next 5 years
WHERE date BETWEEN '2016-06-01' AND '2016-09-03'
) AS dates
LEFT JOIN (
SELECT o.id, time_start, time_end
FROM order AS o
INNER JOIN order_items AS oi on oi.order_id = o.id
WHERE time_start BETWEEN '2016-06-01' AND '2016-09-03'
) AS mySub1 ON dates.date >= mySub1.time_start AND dates.date < mySub1.time_end
LEFT JOIN (
SELECT o.id, time_start, time_end
FROM order AS o
INNER JOIN order_items AS oi on oi.order_id = o.id
WHERE o.shop_id = 50 AND time_start BETWEEN '2016-06-01' AND '2016-09-03'
) AS mySub2 ON dates.date >= mySub2.time_start AND dates.date < mySub2.time_end
GROUP BY dates.date;
My problem is that this query is performing fast in MySQL 5.5 but extremely slow in MySQL 5.7.
In MySQL 5.5 it is taking over 1 second at first and < 0.001 seconds every recurring execution without restarting MySQL.
In MySQL 5.7 it is taking over 11.5 seconds at first and 1.4 seconds every recurring execution without restarting MySQL.
And the more LEFT JOINs I add to the query, the slower the query becomes in MySQL 5.7.
Both instances now run on the same machine, on the same hard drive and with the same my.ini settings. So it isn't hardware.
The execution plans do differ, though and I don't know what to make from it.
This is the EXPLAIN EXTENDED on MySQL 5.5:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | extra |
|----|-------------|------------|-------|---------------|-------------|---------|-----------|-------|----------|---------------------------------|
| 1 | PRIMARY | dates | ALL | | | | | 95 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | <derived2> | ALL | | | | | 281 | 100.00 | '' |
| 1 | PRIMARY | <derived3> | ALL | | | | | 100 | 100.00 | '' |
| 3 | DERIVED | o | ref | xxxxxx | shop_id_fk | 4 | '' | 1736 | 100.00 | '' |
| 3 | DERIVED | oc | ref | xxxxx | order_id_fk | 4 | myDb.o.id | 1 | 100.00 | Using index |
| 2 | DERIVED | o | range | xxxx | date_start | 3 | | 17938 | 100.00 | Using where |
| 2 | DERIVED | oc | ref | xxx | order_id_fk | 4 | myDb.o.id | 1 | 100.00 | Using where |
This is the EXPLAIN EXTENDED on MySQL 5.7:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | extra |
|----|-------------|-------|--------|---------------|-------------|---------|------------------|------|----------|----------------|
| 1 | SIMPLE | dates | ALL | | | | | 95 | 100.00 | Using filesort |
| 1 | SIMPLE | oi | ref | xxxxxx | order_id_fk | 4 | const | 228 | 100.00 | |
| 1 | SIMPLE | o | eq_ref | xxxxx | PRIMARY | 4 | myDb.oi.order_id | 1 | 100.00 | Using where |
| 1 | SIMPLE | o | ref | xxxx | shop_id_fk | 4 | const | 65 | 100.00 | Using where |
| 1 | SIMPLE | oi | ref | xxx | order_id_fk | 4 | myDb.o.id | 1 | 100.00 | Using where |
I want to understand why the MySQLs treat the same query that much different, and how I can tweak MySQL 5.7 to be faster?
I'm not looking for help on rewriting the query to be faster, as that is something I am already doing on my own.
As can be read in the comments, #wchiquito has suggested to look at the optimizer_switch. In here I found that the switch derived_merge could be set to off, to fix this new, and in this specific case undesired, behaviour.
set session optimizer_switch='derived_merge=off'; fixes the problem.
(This can also be done with set global ... or be put in the my.cnf / my.ini)
Building and maintaining a "Summary Table" would make this query run much faster than even 1 second.
Such a table would probably include shop_id, date, and some count.
More on summary tables.
I too faced slow query execution issue after migrating to mysql 5.7 and in my case, even setting session optimizer_switch to 'derived_merge=off'; didn't help.
Then, I followed this link: https://www.saotn.org/mysql-innodb-performance-improvement/ and the query's speed became normal.
To be specific my change was just setting these four parameters in my.ini as described in the link:
innodb_buffer_pool_size
innodb_buffer_pool_instances
innodb_write_io_threads
innodb_read_io_threads

MySQL::Eliminating redundant elements from a table?

I have a table like this:
+-------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| v1 | int(11) | YES | MUL | NULL | |
| v2 | int(11) | YES | MUL | NULL | |
+-------+---------+------+-----+---------+-------+
There is a tremendous amount of duplication in this table. For instance, elements like the following:
+------+------+
| v1 | v2 |
+------+------+
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 1 | 5 |
| 1 | 6 |
| 1 | 7 |
| 1 | 8 |
| 1 | 9 |
| 2 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 1 |
| 7 | 1 |
| 8 | 1 |
| 9 | 1 |
+------+------+
The table is large with 1540000 entries. To remove the redundant entries (i.e. to get a table having only (1,9) and no (9,1) entries), I was thinking of doing it with a subquery but is there a better way of doing this?
Actually, #Mark's approach will work too. I just figured out another way of doing it and was wondering if I can some feedback on this as well. I tested it and it seems to work fast.
SELECT v1,v2 FROM table WHERE v1<v2 UNION SELECT v2,v1 FROM table WHERE v1>v2;
In the case where this is right, you can always create a new table:
CREATE TABLE newtable AS SELECT v1,v2 FROM edges WHERE v1<v2 UNION SELECT v2,v1 FROM edges WHERE v1>v2;
Warning: these commands modify your database. Make sure you have a backup copy so that you can restore the data again if necessary.
You can add the requirement that v1 must be less than v2 which will cut your storage requirement roughly in half. You can make sure all the rows in the database satisfy this condition and reorder those that don't and delete one of the rows when you have both.
This query will insert any missing rows where you have for example (5, 1) but not (1, 5):
INSERT INTO table1
SELECT T1.v2, T1.v1
FROM table1 T1
LEFT JOIN table1 T2
ON T1.v1 = T2.v2 AND T1.v2 = T2.v1
WHERE T1.v1 > T1.v2 AND T2.v1 IS NULL
Then this query deletes the rows you don't want, like (5, 1):
DELETE table1 WHERE v1 > v2
You might need to change other places in your code that were programmed before this constraint was added.

indexes and speeding up 'derived' queries

I've recently noticed that a query I have is running quite slowly, at almost 1 second per query.
The query looks like this
SELECT eventdate.id,
eventdate.eid,
eventdate.date,
eventdate.time,
eventdate.title,
eventdate.address,
eventdate.rank,
eventdate.city,
eventdate.state,
eventdate.name,
source.link,
type,
eventdate.img
FROM source
RIGHT OUTER JOIN
(
SELECT event.id,
event.date,
users.name,
users.rank,
users.eid,
event.address,
event.city,
event.state,
event.lat,
event.`long`,
GROUP_CONCAT(types.type SEPARATOR ' | ') AS type
FROM event FORCE INDEX (latlong_idx)
JOIN users ON event.uid = users.id
JOIN types ON users.tid=types.id
WHERE `long` BETWEEN -74.36829174058 AND -73.64365405942
AND lat BETWEEN 40.35195025942 AND 41.07658794058
AND event.date >= '2009-10-15'
GROUP BY event.id, event.date
ORDER BY event.date, users.rank DESC
LIMIT 0, 20
)eventdate
ON eventdate.uid = source.uid
AND eventdate.date = source.date;
and the explain is
+----+-------------+------------+--------+---------------+-------------+---------+------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+-------------+---------+------------------------------+-------+---------------------------------+
| 1 | PRIMARY | | ALL | NULL | NULL | NULL | NULL | 20 | |
| 1 | PRIMARY | source | ref | iddate_idx | iddate_idx | 7 | eventdate.id,eventdate.date | 156 | |
| 2 | DERIVED | event | ALL | latlong_idx | NULL | NULL | NULL | 19500 | Using temporary; Using filesort |
| 2 | DERIVED | types | ref | eid_idx | eid_idx | 4 | active.event.id | 10674 | Using index |
| 2 | DERIVED | users | eq_ref | id_idx | id_idx | 4 | active.types.id | 1 | Using where |
+----+-------------+------------+--------+---------------+-------------+---------+------------------------------+-------+---------------------------------+
I've tried using 'force index' on latlong, but that doesn't seem to speed things up at all.
Is it the derived table that is causing the slow responses? If so, is there a way to improve the performance of this?
--------EDIT-------------
I've attempted to improve the formatting to make it more readable, as well
I run the same query changing only the 'WHERE statement as
WHERE users.id = (
SELECT users.id
FROM users
WHERE uidname = 'frankt1'
ORDER BY users.approved DESC , users.rank DESC
LIMIT 1 )
AND date & gt ; = '2009-10-15'
GROUP BY date
ORDER BY date)
That query runs in 0.006 seconds
the explain looks like
+----+-------------+------------+-------+---------------+---------------+---------+------------------------------+------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+---------------+---------+------------------------------+------+----------------+
| 1 | PRIMARY | | ALL | NULL | NULL | NULL | NULL | 42 | |
| 1 | PRIMARY | source | ref | iddate_idx | iddate_idx | 7 | eventdate.id,eventdate.date | 156 | |
| 2 | DERIVED | users | const | id_idx | id_idx | 4 | | 1 | |
| 2 | DERIVED | event | range | eiddate_idx | eiddate_idx | 7 | NULL | 24 | Using where |
| 2 | DERIVED | types | ref | eid_idx | eid_idx | 4 | active.event.bid | 3 | Using index |
| 3 | SUBQUERY | users | ALL | idname_idx | idname_idx | 767 | | 5 | Using filesort |
+----+-------------+------------+-------+---------------+---------------+---------+------------------------------+------+----------------+
The only way to clean up that mammoth SQL statement is to go back to the drawing board and carefully work though your database design and requirements. As soon as you start joining 6 tables and using an inner select you should expect incredible execution times.
As a start, ensure that all your id fields are indexed, but better to ensure that your design is valid. I don't know where to START looking at your SQL - even after I reformatted it for you.
Note that 'using indexes' means you need to issue the correct instructions when you CREATE or ALTER the tables you are using. See for instance MySql 5.0 create indexes