Lock wait timeout exceeded with one query running - mysql

I run the following query and it is the only query running on my large (2 vCPU, 7.5 GB RAM, 100GB SSD) RDS hosted database.
DELETE
FROM books
WHERE book_type = '/type/edition'
AND json LIKE '%"languages":%'
AND json NOT LIKE '%/eng%';
But I get the following error.
Error Code: 1205. Lock wait timeout exceeded; try restarting transaction
I increased the timeout to 1200 seconds using SET innodb_lock_wait_timeout = 1200;.
However, I get that same error. There are no other queries running on the database, it's newly created and not in production. Here is the result of show processlist:
+---+----------+----------------------------------------------------------+-------------+-------+-----+----------+------------------------------------------------------------------------------------------------------+
| 1 | rdsadmin | localhost:37959 | | Sleep | 10 | | |
+---+----------+----------------------------------------------------------+-------------+-------+-----+----------+------------------------------------------------------------------------------------------------------+
| 5 | website | host109-156-119-150.range109-156.btcentralplus.com:57923 | openlibrary | Sleep | 606 | | |
| 6 | website | host109-156-119-150.range109-156.btcentralplus.com:57924 | openlibrary | Query | 599 | updating | DELETE FROM books WHERE book_type = '/type/edition' AND json LIKE '%"languages":%' AND json NOT LIKE |
| 8 | website | host109-156-119-150.range109-156.btcentralplus.com:58021 | openlibrary | Sleep | 145 | | |
| 9 | website | host109-156-119-150.range109-156.btcentralplus.com:58022 | openlibrary | Query | 0 | init | show processlist |
+---+----------+----------------------------------------------------------+-------------+-------+-----+----------+------------------------------------------------------------------------------------------------------+
Here is the schema for this table.
CREATE TABLE `books` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`book_type` varchar(50) DEFAULT NULL,
`book_key` varchar(50) DEFAULT NULL,
`revision` tinyint(4) DEFAULT NULL,
`last_modified` varchar(50) DEFAULT NULL,
`json` text,
`date` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `book_type` (`book_type`),
KEY `book_key` (`book_key`),
KEY `revision` (`revision`)
) ENGINE=InnoDB AUTO_INCREMENT=97545025 DEFAULT CHARSET=utf8;
Please note, this table has about 100 million rows and contains 51GB of data.
Why am I getting a lock wait timeout? I thought this error could occur only when you are running multiple queries.

Well as you have tried most ways, maybe you could try :
create index book_type and json ? Maybe this can help some.
Otherwise try to split the delete operation,
e.g.
DELETE
FROM books
WHERE
ID between 1 and 1000000
AND book_type = '/type/edition'
AND json LIKE '%"languages":%'
AND json NOT LIKE '%/eng%';
and then run again where ID between 1000001 and 2000000
etc.

I have do same thing. see it very simple way
DELETE FROM `database`.`table` WHERE ((action="limit") AND (info='login') AND (creation < DATE_SUB(NOW(), INTERVAL 10 MINUTE)))
also face error then follow:
You can set it to higher value in /etc/my.cnf permanently with this line
[mysqld]
innodb_lock_wait_timeout=120
and restart mysql. If you cannot restart mysql at this time, run this:
SET GLOBAL innodb_lock_wait_timeout = 120;
You could also just set it for the duration of your session
SET innodb_lock_wait_timeout = 120;

Related

mysql bulk table script execution

I have 120 tables in my project.
Now I have to migrate MSSQL to MySQL.
So I did all Queries to create those tables that are already worked.
Now my problem is when I execute this script in MSSQL it completes within a second.
But MySQL takes around 4 min to complete its execution.
I want to improve my performance in MySQL. But I don't know how to do that if anyone knows please help me.
Thank you
Here is my sample table Script
MySQL
CREATE TABLE `rb_tbl_bak` (
`BakPathId` int NOT NULL AUTO_INCREMENT,
`BakPath` varchar(500) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`BakDate` datetime(3) DEFAULT NULL,
PRIMARY KEY (`BakPathId`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
MSSQL
--Create table and its columns
CREATE TABLE [dbo].[RB_Tbl_Bak] (
[BakPathId] [int] NOT NULL IDENTITY (1, 1),
[BakPath] [nvarchar](500) NULL,
[BakDate] [datetime] NULL);
GO
like this way, I have to complete for 120+ tables
Oh well, In this case, MySQL databases take time.
You can turn on profiling to get an idea of what takes so long. An example is given using Mysql's CLI:-
SET profiling = 1;
CREATE TABLE rb_tbl_back (id BIGINT UNSIGNED NOT NULL PRIMARY KEY);
SET profiling = 1;
You should get a response like this:-
mysql> SHOW PROFILES;
| Query_ID | Duration | Query |
+----------+------------+-------------------------------------------------------------+
| 1 | 0.00913800 | CREATE TABLE rb_tbl_back (id BIGINT UNSIGNED NOT NULL PRIMARY KEY) |
+----------+------------+-------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> SHOW PROFILE FOR QUERY 1;
+----------------------+----------+
| Status | Duration |
+----------------------+----------+
| starting | 0.000071 |
| checking permissions | 0.000007 |
| Opening tables | 0.001698 |
| System lock | 0.000043 |
| creating table | 0.007260 |
| After create | 0.000004 |
| query end | 0.000004 |
| closing tables | 0.000015 |
| freeing items | 0.000031 |
| logging slow query | 0.000002 |
| cleaning up | 0.000003 |
+----------------------+----------+
11 rows in set (0.00 sec)
If you read the profiling documentation, there are other flags for showing the profile of the query CPU, BLOCK IO, etc that might help you on the 'creating table' stage.
I got this answer from here

MySQL - Why is phpMyAdmin extremely slow with this query that is super fast in php/mysqli?

Edit: see also my answer, the main difference is the LIMIT that phpmyadmin adds, but I still don't understand and phpmyadmin is still slower than mysqli.
On our database (+web) server we have a huge difference in performance when doing a query in phpmyadmin vs doing it from php (mysqli) or directly on the mariadb server. 60 seconds vs < 0.01 seconds!
This query functions quite well:
SELECT * FROM `TitelDaggegevens`
WHERE `datum` > '2020-03-31' AND datum < '2020-05-02' AND `fondskosten` IS NULL
ORDER BY isbn;
But, only in phpMyAdmin, the query becomes extremely slow when we change 2020-05-02 to 2020-05-01.
SHOW PROCESSLIST shows that the queryu is mainly Sending data whilst running.
Following mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts I did the following query-series:
FLUSH STATUS;
SELECT-query above with one of the two dates;
SHOW SESSION STATUS LIKE 'Handler%';
The differences are fascinating. (I left out all the values equal to 0 in all cases). And consistent over time.
| how: | server/MySqli | phpMyAdmin
| date used in query: | 2020-05-02 | 2020-05-01 | 2020-05-02 | 2020-05-01
| records returned: | 6912 | 1 | 6912 | 1
| avg speed: | 0.27s | 0.00s | 0.52s | 60s (!)
| Variable_name | Value | Value | Value | Value
| Handler_icp_attempts | 213197 | 206286 | 213197 | 0
| Handler_icp_match | 6912 | 1 | 6912 | 0
| Handler_read_next | 6912 | 1 | 26651 | 11728896 (!)
| Handler_read_key | 1 | 1 | 151 | 4
| Handler_commit | 1 | 1 | 152 | 5
| Handler_read_first | 0 | 0 | 1 | 1
| Handler_read_rnd_next | 0 | 0 | 82 | 83
| Handler_read_rnd | 0 | 0 | 0 | 1
| Handler_tmp_write | 0 | 0 | 67 | 67
The EXPLAIN results are the same in all cases (phpmyadmin/mysqli/putty+mariadb).
[select_type] => SIMPLE
[table] => TitelDaggegevens
[type] => range
[possible_keys] => fondskosten,Datum+isbn+fondskosten
[key] => Datum+isbn+fondskosten
[key_len] => 3
[ref] =>
[Extra] => Using index condition; Using filesort
The only difference is in rows:
[rows] => 422796 for 2020-05-01
[rows] => 450432 for 2020-05-02
The question
Can you give us any directions in where we should could look to solve this problem? We've worked for a week to optimize the mariadb server (now optimal, except in phpmyadmin) and narrow some of our problems down to the example underneath. We use phpmyadmin a lot but have little to no experience with what is under the surface (like how it connects to the db).
About the indexing/ordering
In the slow query, if we change the ORDER BY from the indexed isbn field to a non-indexed field or leave out the ORDER BY altogether, everything has its normal lightning speed again. Changing the ORDER BY to the primary key id makes it slow too, but still 10x as fast as with the indexed isbn field.
We *know* we can solve this particular query by better indexing, which we already have ready to implement. However, we want to know what causes the different times within phpmyadmin vs mysqli/directly.
The details:
TitelDaggegevens contains < 11mln records, not even 3Gb, and has been OPTIMIZEd (rebuild)
The table structure:
CREATE TABLE `TitelDaggegevens` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`isbn` decimal(13,0) NOT NULL,
`datum` date NOT NULL,
`volgendeDatum` date DEFAULT NULL,
`prijs` decimal(8,2) DEFAULT NULL,
`prijsExclLaag` decimal(8,2) DEFAULT NULL,
`prijsExclHoog` decimal(8,2) DEFAULT NULL,
`stadiumDienstverlening` char(2) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
`stadiumLevenscyclus` char(1) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
`gewicht` double(7,3) DEFAULT NULL,
`volume` double(7,3) DEFAULT NULL,
`24uurs` tinyint(1) DEFAULT NULL,
`UitgeverCode` varchar(4) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
`imprintId` int(11) DEFAULT NULL,
`distributievormId` tinyint(4) DEFAULT NULL,
`boeksoort` char(1) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
`publishingStatus` tinyint(4) DEFAULT NULL,
`productAvailability` tinyint(4) DEFAULT NULL,
`voorraadAlles` mediumint(8) unsigned DEFAULT NULL,
`voorraadBeschikbaar` mediumint(8) unsigned DEFAULT NULL,
`voorraadGeblokkeerdEigenaar` smallint(5) unsigned DEFAULT NULL,
`voorraadGeblokkeerdCB` smallint(5) unsigned DEFAULT NULL,
`voorraadGereserveerd` smallint(5) unsigned DEFAULT NULL,
`fondskosten` enum('depot leverbaar','depot onleverbaar','POD','BOV','eBoek','geen') COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ISBN+datum` (`isbn`,`datum`) USING BTREE,
KEY `UitgeverCode` (`UitgeverCode`),
KEY `Imprint` (`imprintId`),
KEY `VolgendeDatum` (`volgendeDatum`),
KEY `Index op voorraad om maxima snel te vinden` (`isbn`,`voorraadAlles`) USING BTREE,
KEY `fondskosten` (`fondskosten`),
KEY `Datum+isbn+fondskosten` (`datum`,`isbn`,`fondskosten`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=16519430 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci
Configuration of our virtual web+database+mail server:
MariaDB 10.4
InnoDB
CentOs7
phpMyAdmin 4.9.5
php 5.6
Apache
Some important mariadb configuration parameters that we changed from what our virtual webserver had as default:
[mysqld]
innodb_buffer_pool_size=2G
innodb_buffer_pool_instances=4
innodb_flush_log_at_trx_commit=2
tmp_table_size=64M
max_heap_table_size=64M
join_buffer_size=4M
sort_buffer_size=8M
optimizer_search_depth=5
The biggest difference, is of course that phpmyadmin adds a LIMIT to the query. That gives the main explanation. I can't believe that that wasn't the first thing we tried, I am very embarrassed.
However, the speed difference between phpMyAdmin and mysqli is still big, and the results are still different (2020-05-01 on server or mysqli):
+----------------------------+----------+
| Variable_name | Value |
+----------------------------+----------+
| Handler_commit | 1 |
| Handler_read_first | 1 |
| Handler_read_next | 11733306 |
| rest | 0 |
+----------------------------+----------+
Speed with limit and 2020-05-02: all around 0.17-0.2
Speed with limit and 2020-05-01:
php/mysqli: claimed: 3.5sec but the page loads for about 30 secs
putty/mariadb: claimes also 3.5 secs but shows results after about 30 secs
phpmyadmin: claimed and real time about 60secs
Also the EXPLAIN does change considerably with a LIMIT:
(with rows 1268 with datum<20200501 and 1351 with datum<20200502)
+------+-------------+------------------+-------+------------------------------------+------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------------+-------+------------------------------------+------------+---------+------+------+-------------+
| 1 | SIMPLE | TitelDaggegevens | index | fondskosten,Datum+isbn+fondskosten | ISBN+datum | 9 | NULL | 1351 | Using where |
+------+-------------+------------------+-------+------------------------------------+------------+---------+------+------+-------------+
Consider making optimizer_search_depth=16 rather than 5
and
SELECT * FROM TitelDaggegevens
WHERE datum BETWEEN '2020-03-31' AND '2020-05-02' AND fondskosten IS NULL
ORDER BY isbn;
We've had a specialist look at it, additional to all your tips.
It turned out after MANY tests that the LIMIT 0,25 that phpMyAdmin added was the ONLY thing that caused the extreme delay. The expert could find NO differences between mysqli/phpmyadmin and executing it directly on the mariadb server.
Sometimes a VERY small difference in query (like adding a LIMIT for a query that returns only one record anyway) can cause a query to take 100.000 as long because it wil scan a whole index because the engine will see another strategy fit for that query. That is standard behaviour.
We already had found an index that eliminated this specific problem, nut now we are also assured that there is nothing wrong with our DB. Something we were not sure of because it seemed extreme behaviour. So: much ado about nothing.
HOWEVER I learned such a lot from this experiences. Both from our expert as from this community. I learned about MySQL diagnostics, logging, how mariaDB handles queries... For every diagnosis that turned out not to be the problem, I learned things to avoid or to strive for in tables, indexes or queries.
THANK YOU ALL, especially #Rick James, #Wilson Hauck and #ExploitFate
(I'm rather late weighing in. Glad to see that you have "resolved" it.)
You found a strange one, and did a good job of investigating.
Is there a way to get EXPLAIN from phpmyadmin? If so, that might give another clue.
The Handler numbers strongly imply a different EXPLAIN was used.
Clearly phpmyadmin modifies the query (at least by adding the LIMIT). I wonder if it messed with the query accidentally. Did you have the Slowlog or the General log turned on at that time? Either should have the SQL as run.
Replacing the index on just (fondskosten) with INDEX(fondskosten, datum) should improve performance.
("Sending data", as always, is useless information provided by the engine.)
Suggest filing a bug with mariadb.com.

MySQL optimize query for counting scheduled items over time periods

In a scheduling application I am working on I am dealing with a fairly complex database schema in order to describe a series of kids assigned to groups on timeslots on certain dates. Now in this schema, I want to query the database what the number of scheduled kids are on a certain group for a certain timeslot on a certain range of dates.
DB Schema
Timeslot: A timeslot has a certain start and end time (e.g. 13:00 - 18:00). Time can vary in 15-minute steps. In our application we want to schedule a kid on a group for the duration of this timeslot.
Time slice: For every 15 minutes in a 24-hour period exists a time slice record (96). 15 minutes is the smallest possible planning unit. A timeslot is assigned to each slice covered between its start and end time (for example, timeslot 13:00-18:00 will have a record pointing to time slice [13:00, 13:15, 13:30...17:45]). This makes it possible to count how many kids are 'occupying' the same time slice at any give time and date.
Kid: A kid is simply the entity being scheduled
Group: A group is a representation of a physical location with a specific capacity
GroupAssignment: A group assignment is bound in time. Between date 1 and 2 it could be group A, between date 2 and 3 it could be group B.
Occupancy: The main scheduling record. This has a timeslot_id, kid_id, start and end date. note: a kid is scheduled on the start day and every subsequent 7 days up to the end date.
DB Schema SQL
The number of records can be roughly derived from the auto_increment value. If not present, I mentioned them manually.
CREATE TABLE `group_assignment_caches` (
`group_id` int(11) DEFAULT NULL,
`occupancy_id` int(11) DEFAULT NULL,
`start` date DEFAULT NULL,
`end` date DEFAULT NULL,
KEY `index_group_assignment_caches_on_occupancy_id` (`occupancy_id`),
KEY `index_group_assignment_caches_on_group_id` (`group_id`),
KEY `index_group_assignment_caches_on_start_and_end` (`start`,`end`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/* (~1500 records) */
CREATE TABLE `kids` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`archived` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=592 DEFAULT CHARSET=utf8;
CREATE TABLE `occupancies` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kid_id` int(11) DEFAULT NULL,
`timeslot_id` int(11) DEFAULT NULL,
`start` date DEFAULT NULL,
`end` date DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_occupancies_on_kid_id` (`kid_id`),
KEY `index_occupancies_on_timeslot_id` (`timeslot_id`),
KEY `index_occupancies_on_start_and_end` (`start`,`end`)
) ENGINE=InnoDB AUTO_INCREMENT=2675 DEFAULT CHARSET=utf8;
CREATE TABLE `time_slices` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start` time DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_time_slices_on_start` (`start`)
) ENGINE=InnoDB AUTO_INCREMENT=97 DEFAULT CHARSET=latin1;
CREATE TABLE `timeslot_slices` (
`timeslot_id` int(11) DEFAULT NULL,
`time_slice_id` int(11) DEFAULT NULL,
KEY `index_timeslot_slices_on_timeslot_id` (`timeslot_id`),
KEY `index_timeslot_slices_on_time_slice_id` (`time_slice_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/* (~1500 records) */
CREATE TABLE `timeslots` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start` time DEFAULT NULL,
`end` time DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=91 DEFAULT CHARSET=utf8;
Current solution
So far, I have designed the following query to tie it all together. While it does work, it scales very poorly. Running the query using 1 date, 1 timeslot and 1 group it takes about 50ms. However, with 100 dates this becomes 1000ms and when you start adding groups and timeslots this quickly rises exponentially in the multiple seconds. Ive noticed that the runtime is highly dependent on the size of the timeslot. It seems that when a specific timeslot covers more time slices it escalates rapidly in runtime!
SELECT subq.date, subq.group_id, subq.timeslot_id, MAX(subq.spots) AS max_spots
FROM (
SELECT di.date,
ts.start,
gac.group_id AS group_id,
tss2.timeslot_id AS timeslot_id,
COUNT(*) AS spots
FROM date_intervals di,
timeslot_slices tss2,
occupancies o
JOIN timeslots t ON o.timeslot_id = t.id
JOIN group_assignment_caches gac ON o.id = gac.occupancy_id
JOIN timeslot_slices tss1 ON t.id = tss1.timeslot_id
JOIN time_slices ts ON tss1.time_slice_id = ts.id
JOIN kids k ON o.kid_id = k.id
WHERE di.date BETWEEN gac.start AND gac.end
AND di.date BETWEEN o.start AND o.end
AND MOD(DATEDIFF(di.date, o.start),7)=0
AND k.archived = 0
AND tss1.time_slice_id = tss2.time_slice_id
AND gac.group_id IN (3) AND tss2.timeslot_id IN (5)
GROUP BY ts.start, di.date, group_id, timeslot_id
) subq
GROUP BY subq.date, subq.group_id, subq.timeslot_id
Note that running the derived subquery separately takes the same amount of time. This yields 1 record with the number of occupancies for each time slice (15 min) for the given group in the given timeslot. This is great for debugging. Obviously I am only interested in the max number of occupancies for the entire timeslot.
Date_intervals is not described in the schema. This is a temporary table I fill using a REPEAT statement at the beginning of this procedure call. Its only column is 'date' and it's filled with 10-300 dates generally in most situations. The query should be able to handle this.
If I EXPLAIN this query, I get the following results. I am not really sure how to go further from here. The first row about the derived table can be ignored, since executing the subquery takes the same amount of time. The only other table not using an index is date_intervals di which is a small temporary table with 122 records.
+----+-------------+------------+--------+----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+---------+----------------------------+------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+---------+----------------------------+------+------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5124 | Using temporary; Using filesort |
| 2 | DERIVED | tss2 | ref | index_timeslot_slices_on_timeslot_id,index_timeslot_slices_on_time_slice_id | index_timeslot_slices_on_timeslot_id | 5 | | 42 | Using where; Using temporary; Using filesort |
| 2 | DERIVED | ts | eq_ref | PRIMARY | PRIMARY | 4 | ookidoo.tss2.time_slice_id | 1 | |
| 2 | DERIVED | tss1 | ref | index_timeslot_slices_on_timeslot_id,index_timeslot_slices_on_time_slice_id | index_timeslot_slices_on_time_slice_id | 5 | ookidoo.tss2.time_slice_id | 6 | Using where |
| 2 | DERIVED | o | ref | PRIMARY,index_occupancies_on_timeslot_id,index_occupancies_on_kid_id,index_occupancies_on_start_and_end | index_occupancies_on_timeslot_id | 5 | ookidoo.tss1.timeslot_id | 6 | Using where |
| 2 | DERIVED | k | eq_ref | PRIMARY | PRIMARY | 4 | ookidoo.o.kid_id | 1 | Using where |
| 2 | DERIVED | gac | ref | index_group_assignment_caches_on_occupancy_id,index_group_assignment_caches_on_start_and_end,index_group_assignment_caches_on_group_id | index_group_assignment_caches_on_occupancy_id | 5 | ookidoo.o.id | 1 | Using where |
| 2 | DERIVED | di | range | PRIMARY | PRIMARY | 3 | NULL | 1 | Range checked for each record (index map: 0x1) |
| 2 | DERIVED | t | eq_ref | PRIMARY | PRIMARY | 4 | ookidoo.o.timeslot_id | 1 | Using where; Using index |
+----+-------------+------------+--------+----------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+---------+----------------------------+------+------------------------------------------------+
Current results
The above query yields the following results (122 records, abbreviated)
date group_id timeslot_id max_spots
+------------+----------+-------------+-----------+
| date | group_id | timeslot_id | max_spots |
+------------+----------+-------------+-----------+
| 2012-08-20 | 3 | 5 | 12 |
| 2012-08-27 | 3 | 5 | 12 |
| 2012-09-03 | 3 | 5 | 12 |
| 2012-09-10 | 3 | 5 | 12 |
+------------+----------+-------------+-----------+
| 2014-11-24 | 3 | 5 | 15 |
| 2014-12-01 | 3 | 5 | 15 |
| 2014-12-08 | 3 | 5 | 15 |
| 2014-12-15 | 3 | 5 | 15 |
+------------+----------+-------------+-----------+
Wrapping up
I would like to know a way to either restructure my query or even my database schema in order to make querying this information less time consuming. I can't imagine this being impossible, considering there are relatively so little records present in this database (10-1000's for most tables)
Any sufficient complex problem can bring a computer to its knees. Actually, it's easy to create a complex problem, and difficult to make a complex problem easy.
Your single query is very complex. It goes over the entire database. Is that necessary? What happens if, for instance, you restrict it to one date? Does it scale better?
Using just a single query to do a complex task is often very efficient, but not always, as you've found out. I often find that the only way to break the exponential time needed to execute the task, is to split it up in multiple steps. One date at a time, for instance. Perhaps you don't always need them all?
In some of those cases I use an intermediate SQLite database that resides in memory. Operations on a small (!) temporary database in memory are very fast. It work like this:
$SQLiteDB = new PDO("sqlite::memory:");
$SQLiteDB->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$SQL = "<any valid sqlite query>";
$SQLiteDB->query($SQL);
First check that you have the sqlite PHP module installed. Read the manual:
http://www.sqlite.org
When using this you first create tables in your new database and then you populate them with the needed data. You can use prepared statements if you have to copy multiple rows.
The tricky bit is taking apart your single complex query. How you would do that depends on the exact question you want to answer. The art is to limit the amount of data you have to work with. Don't copy the whole database, but make an informed selection.
A big advantage of taking multiple smaller steps is that your code may become much more readable, and understandable. I wouldn't want to be the guy who has to change your SQL query ten years from now because you went on to other things.
I have found a solution which is acceptable for my particular use case.
I have created an intermediate or 'cache' table with the following structure:
CREATE TABLE `occupancy_caches` (
`occupancy_id` int(11) DEFAULT NULL,
`kid_id` int(11) DEFAULT NULL,
`group_id` int(11) DEFAULT NULL,
`client_id` int(11) DEFAULT NULL,
`date` date DEFAULT NULL,
`timeslot_id` int(11) DEFAULT NULL,
`start` int(11) DEFAULT NULL,
`end` int(11) DEFAULT NULL,
KEY `index_occupancy_caches_on_date_and_client_id` (`date`,`client_id`),
KEY `index_occupancy_caches_on_date_and_group_id` (`date`,`group_id`),
KEY `index_occupancy_caches_on_occupancy_id` (`occupancy_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
This allowed me to completely eliminate the group_assignment_caches table and no longer did I have to search for dates using calculated columns (MOD(DATEDIFF...)). Also, I only needed a single join on the time slices instead of 2.
The downside, however, is that I now have to create an occupancy_caches record for every week covered by the original occupancies record. In most cases these occupancies describe a 4 year period. This means that for every occupancies record I now have to create 400 (!) records... Since the number of records will only grow linear, correct usage of indexes should keep this from spinning out of control when the system grows.
Time will tell, though...

delete duplicate rows from large dB

This is a second post to my original question posted here.
My setup:
amazon RDS using MySQL Workbench with connection timeout set to max
I am trying to DELETE duplicate rows from my dB which has close to 1MIL rows.
the table looks like this, mytext is a mediumtext blob. id is AUTO_INCREMENT
+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc | 123 |
| 2 | joe | min | abc | 123 |
| 3 | mar | kam | def | 789 |
| 4 | kel | smi | ghi | 456 |
+------------------------------+
I would like to end up with a table like this
+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc | 123 |
| 3 | mar | kam | def | 789 |
| 4 | kel | smi | ghi | 456 |
+------------------------------+
This solution started woking but after about 10,000 rows the process takes longer and eventualy hangs.
I let this run for over 20 hours, settings at 10 thousand rows with a WHERE condition (i thought deleting in chunks would be safer).
But even with the WHERE clause the system hangs then I have to Reboot RDS to access the dB.
DELETE
FROM yourTable
WHERE id>40000
AND id<=50000
AND id NOT IN
(
SELECT MAXID FROM
(
SELECT MAX(id) as MAXID
FROM yourTable
GROUP BY mytext
) as temp_table
)
heres the create statement
CREATE TABLE `yourTable` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`fname` varchar(45) DEFAULT NULL,
`lname` varchar(45) DEFAULT NULL,
`mytext` mediumtext,
`morevar` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1$$
Question
Is this sql command ok for handeling large amounts of rows and what I am trying to achieve? Or is there a better solution.
How long would it normally take to process 1MIL rows?
Is there a setting like in php.ini inside amazon for large data set manipulation?
Or would it make more sense to create a new table and insert all rows excluding duplicates?
I really wouldn't use NOT IN.
I would ensure that there is an index on myText, id and then try this...
DELETE
FROM
yourTable
WHERE
id > 40000
AND id <= 50000
AND EXISTS (SELECT *
FROM yourTable AS lookup
WHERE lookup.myText = yourTable.myText
AND lookup.id > yourTable.id
)
This way you only check the myText values that you are potentially deleting.
Where as your sub-query will return ids for myTexts that don't even appear in the range you are checking.

SELECTing non-indexed column increases 'sending data' 25x - why and how to improve?

Given this table on local MySQL instance 5.1 with query caching off:
show create table product_views\G
*************************** 1. row ***************************
Table: product_views
Create Table: CREATE TABLE `product_views` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`dateCreated` datetime NOT NULL,
`dateModified` datetime DEFAULT NULL,
`hibernateVersion` bigint(20) DEFAULT NULL,
`brandName` varchar(255) DEFAULT NULL,
`mfrModel` varchar(255) DEFAULT NULL,
`origin` varchar(255) NOT NULL,
`price` float DEFAULT NULL,
`productType` varchar(255) DEFAULT NULL,
`rebateDetailsViewed` tinyint(1) NOT NULL,
`rebateSearchZipCode` int(11) DEFAULT NULL,
`rebatesFoundAmount` float DEFAULT NULL,
`rebatesFoundCount` int(11) DEFAULT NULL,
`siteSKU` varchar(255) DEFAULT NULL,
`timestamp` datetime NOT NULL,
`uiContext` varchar(255) DEFAULT NULL,
`siteVisitId` bigint(20) NOT NULL,
`efficiencyLevel` varchar(255) DEFAULT NULL,
`siteName` varchar(255) DEFAULT NULL,
`clicks` varchar(1024) DEFAULT NULL,
`rebateFormDownloaded` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `siteVisitId` (`siteVisitId`,`siteSKU`),
KEY `FK52C29B1E3CAB9CC4` (`siteVisitId`),
KEY `rebateSearchZipCode_idx` (`rebateSearchZipCode`),
KEY `FIND_UNPROCESSED_IDX` (`siteSKU`,`siteVisitId`,`timestamp`),
CONSTRAINT `FK52C29B1E3CAB9CC4` FOREIGN KEY (`siteVisitId`) REFERENCES `site_visits` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=32909504 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
This query takes ~3s:
SELECT pv.id, pv.siteSKU
FROM product_views pv
CROSS JOIN site_visits sv
WHERE pv.siteVisitId = sv.id
AND pv.siteSKU = 'foo'
AND sv.siteId = 'bar'
AND sv.postProcessed = 1
AND pv.timestamp >= '2011-05-19 00:00:00'
AND pv.timestamp < '2011-06-18 00:00:00';
But this one (non-indexed column added to SELECT) takes ~65s:
SELECT pv.id, pv.siteSKU, pv.hibernateVersion
FROM product_views pv
CROSS JOIN site_visits sv
WHERE pv.siteVisitId = sv.id
AND pv.siteSKU = 'foo'
AND sv.siteId = 'bar'
AND sv.postProcessed = 1
AND pv.timestamp >= '2011-05-19 00:00:00'
AND pv.timestamp < '2011-06-18 00:00:00';
Nothing in 'where' or 'from' clauses is different. All the extra time is spent in 'sending data':
mysql> show profile for query 1;
+--------------------+-----------+
| Status | Duration |
+--------------------+-----------+
| starting | 0.000155 |
| Opening tables | 0.000029 |
| System lock | 0.000007 |
| Table lock | 0.000019 |
| init | 0.000072 |
| optimizing | 0.000032 |
| statistics | 0.000316 |
| preparing | 0.000034 |
| executing | 0.000002 |
| Sending data | 63.530402 |
| end | 0.000044 |
| query end | 0.000005 |
| freeing items | 0.000091 |
| logging slow query | 0.000002 |
| logging slow query | 0.000109 |
| cleaning up | 0.000004 |
+--------------------+-----------+
16 rows in set (0.00 sec)
I understand that using a non-indexed column in where clause would slow things down, but why here? What can be done to improve the latter case - given that I will actually want to SELECT(*) from product_views?
EXPLAIN Output
explain extended select pv.id, pv.siteSKU from product_views pv cross join site_visits sv where pv.siteVisitId=sv.id and pv.siteSKU='foo' and sv.sit eId='bar' and sv.postProcessed=1 and pv.timestamp>='2011-05-19 00:00:00' and pv.timestamp<'2011-06-18 00:00:00';
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filt ered | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ | 1 | SIMPLE | pv | ref | siteVisitId,FK52C29B1E3CAB9CC4,FIND_UNPROCESSED_IDX | FIND_UNPROCESSED_IDX | 258 | const | 41810 | 10
0.00 | Using where; Using index | | 1 | SIMPLE | sv | eq_ref | PRIMARY,post_processed_idx | PRIMARY | 8 | clabs.pv.siteVisitId | 1 | 10
0.00 | Using where |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ 2 rows in set, 1 warning (0.00 sec)
mysql> explain extended select pv.id, pv.siteSKU, pv.hibernateVersion from product_views pv cross join site_visits sv where pv.siteVisitId=sv.id and pv.siteSKU= 'foo' and sv.siteId='bar' and sv.postProcessed=1 and pv.timestamp>='2011-05-19 00:00:00' and pv.timestamp<'2011-06-18 00:00:00';
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filt ered | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ | 1 | SIMPLE | pv | ref | siteVisitId,FK52C29B1E3CAB9CC4,FIND_UNPROCESSED_IDX | FIND_UNPROCESSED_IDX | 258 | const | 41810 | 10
0.00 | Using where | | 1 | SIMPLE | sv | eq_ref | PRIMARY,post_processed_idx | PRIMARY | 8 | clabs.pv.siteVisitId | 1 | 10
0.00 | Using where |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ 2 rows in set, 1 warning (0.00 sec)
UPDATE1: Splitting into 2 queries brings total time down to ~30s range
Not sure why, but splitting the latter query into the following reduces lat. from 65s to ~30s:
1) SELECT pv.id .... //from, where clauses same as above
2) SELECT * FROM product_views where id in (idList); //idList
UPDATE2: TABLE SIZE
table has on the order of 10M rows
query returns about 3k rows
When you select only indexed columns, MySQL does read only the index, and does not need to read the table data. This, as far as I remember, is called index-covered query. However, when there are columns, that are not present in the used index, MySQL needs to open the table and read the data from it. This is the reason index-covered queries to be much faster.
See Using Covering Indexes to Improve Query Performance.
As for the improvement, how many rows are in the table, how much the query returns and what is your buffer pool size, how much RAM is available, etc.?
From what I have read about show profile, 'sending data' is a portion of execution process, and has almost nothing to do with sending actual data to the client. You can take a look on this thread
Also, mysql docs says about "Sending data" :
The thread is reading and processing rows for a SELECT statement, and sending data to the client. Because operations occurring during this state tend to perform large amounts of disk access (reads), it is often the longest-running state over the lifetime of a given query.
In my opinion, mysql would better not mix together "reading and processing rows for a SELECT statement" and "sending data" in one state, especially in state called "sending" data" which causes lots of confusion.
I'm don't know MySQL internals at all, but Darhazer's explanation looks like the winner to me. When the non-indexed field is added, the entire row must be retrieved. And your rows are very wide. I can't quite tell from the names how (if at all) it is denormalized, but I suspect it is. site name and site sku smell like they belong in a site lookup table with an FK. rebates found amount and rebates found count sound like statistics that should be coming from a join to a separate product rebate table. etc.