Related
I have 12 fixed tables (group, local, element, sub_element, service, ...), each table with different numbers of rows.
The columns 'id_' in all table is a primary key (int). The others columns are of datatype varchar(20). The maximum number of rows in these tables are 300.
Each table was created in this way:
CREATE TABLE group
(
id_G int NOT NULL,
name_group varchar(20) NOT NULL,
PRIMARY KEY (id_G)
);
|........GROUP......| |.......LOCAL.......| |.......SERVICE.......|
| id_G | name_group | | id_L | name_local | | id_S | name_service |
+------+------------+ +------+------------+ +------+--------------+
| 1 | group1 | | 1 | local1 | | 1 | service1 |
| 2 | group2 | | 2 | local2 | | 2 | service2 |
And I have one table that combine all these tables depending on user selects.
The 'id_' come from fixed tables selected by the user are recorded into this table.
This table was crate in this way:
CREATE TABLE group
(
id_E int NOT NULL,
event_name varchar(20) NOT NULL,
id_G int NOT NULL,
id_L int NOT NULL,
...
PRIMARY KEY (id_G)
);
The tables (event) look like this:
|....................EVENT.....................|
| id_E | event_name | id_G | id_L | ... |id_S |
+------+-------------+------+------+-----+-----+
| 1 | mater1 | 1 | 1 | ... | 3 |
| 2 | master2 | 2 | 2 | ... | 6 |
This table get greater each day, an now it has about thousunds of rows.
Column id_E is the primary key (int), event_name is varchar(20).
This table has, in addition of id_E and event_name columns, 12 other columns the came from the fixed tables.
Every time than I need to retrieve information on the event table, to turn more readable, I need to do about 12 joins.
My query look like this where i need to retrieve all columns from table event:
SELECT event_name, name_group, name_local ..., name_service
FROM event
INNER JOIN group on event.id_G = group.id_G
INNER JOIN local on event.id_L = local.id_L
...
INNER JOIN service on event.id_S = service.id_S
WHERE event.id_S = 7 (for example)
This slows down my system performance. Is there a way to reduce the number of joins? I've heard about using Natural Keys, but I think this is not a good idea to form my case thinking in future maintenance.
My queries are taking about 7 seconds and I need to reduce this time.
I changed the WHERE clause and this caused not affect. So, I am sure that the problem is that the query has so many joins.
Could someone give some help? thanks a lot...
MySQL has a great keyword of "STRAIGHT_JOIN" and might be what you are looking for. First, each of your lookup tables (id/description) I have to assume already have an index on the ID column since that is primary key.
Your event table is the one you are querying as the primary basis of the details and joining to the lookups per their respective IDs. As long as your WHERE clause applicable to the EVENT table is optimized, such as the ID you are looking for, it SHOULD be virtually instantaneous.
If it is not, then it might be that MySQL is trying to think for you and take one of the secondary lookup tables and make it a primary basis of the query for whatever reason, such as much lower record count. In this case, add the keyword and try it..
SELECT STRAIGHT_JOIN ... rest of your query
This tells MySQL to do the query in the order you gave it, thus the Event table first and it's where clause on the ID. It should find that one thing, then grab all the corresponding lookup descriptions from the other tables.
Create indexes, concretely use compound indexes, for instance, start creating a compound index for event and groups:
on table events create one for (event id, group id).
then, on the group table create another one for the next relation (group id, local id).
on local do the same with service, and so on...
I've a query that takes about 18 seconds to finish:
THE QUERY:
SELECT YEAR(c.date), MONTH(c.date), p.district_id, COUNT(p.owner_id)
FROM commission c
INNER JOIN partner p ON c.customer_id = p.id
WHERE (c.date BETWEEN '2018-01-01' AND '2018-12-31')
AND (c.company_id = 90)
AND (c.source = 'ACTUAL')
AND (p.id IN (3062, 3063, 3064, 3065, 3066, 3067, 3068, 3069, 3070, 3071,
3072, 3073, 3074, 3075, 3076, 3077, 3078, 3079, 3081, 3082, 3083, 3084,
3085, 3086, 3087, 3088, 3089, 3090, 3091, 3092, 3093, 3094, 3095, 3096,
3097, 3098, 3099, 3448, 3449, 3450, 3451, 3452, 3453, 3454, 3455, 3456,
3457, 3458, 3459, 3460, 3461, 3471, 3490, 3491, 6307, 6368, 6421))
GROUP BY YEAR(c.date), MONTH(c.date), p.district_id
The commission table has around 2,8 millions of records, of which 860 000+ belong to the current year 2018. The partner table has at this moment 8600+ records.
RESULT
| `YEAR(c.date)` | `MONTH(c.date)` | district_id | `COUNT(c.id)` |
|----------------|-----------------|-------------|---------------|
| 2018 | 1 | 1 | 19154 |
| 2018 | 1 | 5 | 9184 |
| 2018 | 1 | 6 | 2706 |
| 2018 | 1 | 12 | 36296 |
| 2018 | 1 | 15 | 13085 |
| 2018 | 2 | 1 | 21231 |
| 2018 | 2 | 5 | 10242 |
| ... | ... | ... | ... |
55 rows retrieved starting from 1 in 18 s 374 ms
(execution: 18 s 368 ms, fetching: 6 ms)
EXPLAIN:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | extra |
|----|-------------|-------|------------|-------|------------------------------------------------------------------------------------------------------|----------------------|---------|-----------------|------|----------|----------------------------------------------|
| 1 | SIMPLE | p | null | range | PRIMARY | PRIMARY | 4 | | 57 | 100 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | c | null | ref | UNIQ_6F7146F0979B1AD62FC0CB0F5F8A7F73,IDX_6F7146F09395C3F3,IDX_6F7146F0979B1AD6,IDX_6F7146F0AA9E377A | IDX_6F7146F09395C3F3 | 5 | p.id | 6716 | 8.33 | Using where |
DDL:
create table if not exists commission (
id int auto_increment
primary key,
date date not null,
source enum('ACTUAL', 'EXPECTED') not null,
customer_id int null,
transaction_id varchar(255) not null,
company_id int null,
constraint UNIQ_6F7146F0979B1AD62FC0CB0F5F8A7F73 unique (company_id, transaction_id, source),
constraint FK_6F7146F09395C3F3 foreign key (customer_id) references partner (id),
constraint FK_6F7146F0979B1AD6 foreign key (company_id) references companies (id)
) collate=utf8_unicode_ci;
create index IDX_6F7146F09395C3F3 on commission (customer_id);
create index IDX_6F7146F0979B1AD6 on commission (company_id);
create index IDX_6F7146F0AA9E377A on commission (date);
I noted that by removing the partner IN condition MySQL takes only 3s. I tried to replace it doing something crazy like this:
AND (',3062,3063,3064,3065,3066,3067,3068,3069,3070,3071,3072,3073,3074,3075,3076,3077,3078,3079,3081,3082,3083,3084,3085,3086,3087,3088,3089,3090,3091,3092,3093,3094,3095,3096,3097,3098,3099,3448,3449,3450,3451,3452,3453,3454,3455,3456,3457,3458,3459,3460,3461,3471,3490,3491,6307,6368,6421,'
LIKE CONCAT('%,', p.id, ',%'))
and the result was about 5s... great! but it's a hack.
WHY this query is taking a very long execution time when I uses IN statement? workaround, tips, links, etc. Thanks!
MySQL can use one index at a time. For this query you need a compound index covering the aspects of the search. Constant aspects of the WHERE clause should be used before range aspects like:
ALTER TABLE commission
DROP INDEX IDX_6F7146F0979B1AD6,
ADD INDEX IDX_6F7146F0979B1AD6 (company_id, source, date)
Here's what the Optimizer sees in your query.
Checking whether to use an index for the GROUP BY:
Functions (YEAR()) in the GROUP BY, so no.
Multiple tables (c and p) mentioned, so no.
For a JOIN, Optimizer will (almost always) start with one, then reach into the other. So, let's look at the two options:
If starting with p:
Assuming you have PRIMARY KEY(id), there is not much to think about. It will simply use that index.
For each row selected from p, it will then look into c, and any variation of this INDEX would be optimal.
c: INDEX(company_id, source, customer_id, -- in any order (all are tested "=")
date) -- last, since it is tested as a range
If starting with c:
c: INDEX(company_id, source, -- in any order (all are tested "=")
date) -- last, since it is tested as a range
-- slightly better:
c: INDEX(company_id, source, -- in any order (all are tested "=")
date, -- last, since it is tested as a range
customer_id) -- really last -- added only to make it "covering".
The Optimizer will look at "statistics" to crudely decide which table to start with. So, add all the indexes I suggested.
A "covering" index is one that contains all the columns needed anywhere in the query. It is sometimes wise to extend a 'good' index with more columns to make it "covering".
But there is a monkey wrench in here. c.customer_id = p.id means that customer_id IN (...) effectively exists. But now there are two "range-like" constraints -- one is an IN, the other is a 'range'. In some newer versions, the Optimizer will happily jump around due to the IN and still be able to do "range" scans. So, I recommend this ordering:
Test(s) of column = constant
Test(s) with IN
One 'range' test (BETWEEN, >=, LIKE with trailing wildcard, etc)
Perhaps add more columns to make it "covering" -- but don't do this step if you end up with more than, say, 5 columns in the index.
Hence, for c, the following is optimal for the WHERE, and happens to be "covering".
INDEX(company_id, source, -- first, but in any order (all "=")
customer_id, -- "IN"
date) -- last, since it is tested as a range
p: (same as above)
Since there was an IN or "range", there is no use seeing if the index can also handle the GROUP BY.
A note on COUNT(x) -- it checks that x is NOT NULL. It is usually just as correct to say COUNT(*), which counts the number of rows without any extra checking.
This is a non-starter since it hides the indexed column (id) in a function:
AND (',3062,3063,3064,3065,3066,...6368,6421,'
LIKE CONCAT('%,', p.id, ',%'))
With your LIKE-hack you are tricking optimizer so it uses different plan (most probably using IDX_6F7146F0AA9E377A index on the first place).
You should be able to see this in explain.
I think the real issue in your case is the second line of explain: server executing multiple functions (MONTH, YEAR) for 6716 rows and then trying to group all these rows. During this time all these 6716 rows should be stored (in memory or on disk that is based on your server configuration).
SELECT COUNT(*) FROM commission WHERE (date BETWEEN '2018-01-01' AND '2018-12-31') AND company_id = 90 AND source = 'ACTUAL';
=> How many rows are we talking about?
If the number in above query is much lower then 6716 I'd try to add covering index on columns customer_id, company_id, source and date. Not sure about the best order as it depends on data you have (check cardinality for these columns). I'd started with index (date, company_id, source, customer_id). Also, I'd add unique index (id, district_id, owner_id) on partner.
It is also possible to add additional generated stored columns _year and _month (if your server is a bit old you can add normal columns and fill them in with trigger) to rid off the multiple function executions.
I'm running with some troubles on a query. I'm trying to retrieve some data of a big database where 3 tables are involved.
These tables contain data about adds where, in a backend website, the administrator can manage which local adds he wants to be displayed, position and etc... These are organized in 3 tables, 1 of them, contains all the data that are relevant to adds info (Name, date of avaliability, date of expiration, etc...). Then, there's another 2 tables which contain some extra info, but just about views, or clicks.
So I have only 15 adds, that have multiple clicks and multiple views.
Each click and view table, register a new row for every click. So, when a click is registered, it will add a new row where addid_views is a register(click), and addid is addid from adds_table. So for instance, add (1) will have 2 views and 2 clicks while add (2) will have 1 view and 1 click.
My idea is to get for each add, how many clicks and views had in total.
I have 3 tables like these:
adds_table adds_clicks_table adds_views_table
+-------+-----------+ +-------------+------+ +-------------+------+
| addid | name | | addid_click |addid | | addid_views |addid |
+-------+-----------+ +-------------+------+ +-------------+------+
| 1 | add_name1 | | 1 | 1 | | 1 | 1 |
+-------+-----------+ +-------------+------+ +-------------+------+
| 2 | add_name2 | | 2 | 2 | | 2 | 1 |
+-------+-----------+
| 3 | add_name3 | | 3 | 1 | | 3 | 2 |
+-------+-----------+ +-------------+------+ +-------------+------+
CREATE TABLE `bwm_adds` (
`addid` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL,
...
PRIMARY KEY (`addid`)
) ENGINE=InnoDB AUTO_INCREMENT=16 DEFAULT CHARSET=utf8
CREATE TABLE `bwm_adds_clicks` (
`add_clickid` int(19) NOT NULL AUTO_INCREMENT,
`addid` int(11) NOT NULL,
...
PRIMARY KEY (`add_clickid`)
) ENGINE=InnoDB AUTO_INCREMENT=3374 DEFAULT CHARSET=utf8
CREATE TABLE `bwm_adds_views` (
`add_viewsid` int(19) NOT NULL AUTO_INCREMENT,
`addid` int(11) NOT NULL,
...
PRIMARY KEY (`add_viewsid`)
) ENGINE=InnoDB AUTO_INCREMENT=2078738 DEFAULT CHARSET=utf8
The result would be a single table where I retrieved, per each add (addid), how many clicks and how many views it had.
I need to get all a query where I get something like this:
+-------+---------+-----------+
| addid | clicks | views |
+-------+---------+-----------+
| 1 | 123123 | 235457568 |
+-------+---------+-----------+
| 2 | 5124123 | 435345234 |
+-------+---------+-----------+
| 3 | 123541 | 453563623 |
+-------+---------+-----------+
I tried to execute a query but it get's stuck and loading for undefined time... I 'm pretty sure that my query is failing cause if I remove one of the counts, displays some data very fast.
SELECT a.addid, COUNT(ac.addid_clicks) as 'clicks', COUNT(av.addid_views) as 'views'
FROM `adds_table` a
LEFT JOIN `adds_clicks_table` ac ON a.addid = ac.addid_click
LEFT JOIN `adds_views_table` av ON ac.addid_click = av.addid_views
GROUP BY a.addid
Mysql gets loading all the time, any idea to help know what I'm missing?
By the way, I found this post where treats almost the same problem I have, you can see I have the query very similar to the first answer, but I get the Loading message all the time. No errors, just Loading.
Edit: I missplaced the numbers and got confused. Now the tables are fixed and I added some explanation about it.
Edit2: I updated the post with SHOW CREATE TABLES DEFINITIONS.
Edit3: Is there any way to optimise this query? It seems it retrieves the result I want but the mysql database cancels the query because it gets more than 30 seconds to execute.
SELECT a.addid,
(SELECT COUNT(addid) FROM add_clicks where addid = a.addid) as clicks,
(SELECT COUNT(addid) FROM add_views where addid = a.addid) as views
FROM adds a ORDER BY a.addid;
If those are really your tables (one column, plus an auto_inc), then there is no meaningful information justifying having 3 tables instead of 1:
CREATE TABLE `bwm_adds` (
`addid` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL,
clicks INT UNSIGNED NOT NULL,
views INT UNSIGNED NOT NULL,
PRIMARY KEY (`addid`)
) ENGINE=InnoDB AUTO_INCREMENT=16 DEFAULT CHARSET=utf8
and then UPDATE ... SET views = views + 1 (etc) rather than inserting into the other tables.
If you have an old version,
SELECT a.addid,
( SELECT COUNT(addid_clicks)
FROM `adds_clicks_table`
WHERE addid = a.addid
) AS 'clicks',
( SELECT COUNT(addid_clicks)
FROM `adds_views_table`
WHERE addid = a.addid
) AS 'views'
FROM adds_table AS a
For 5.6 and later, this might be faster:
SELECT a.addid, c.clicks, v.views
FROM `adds_table` a
LEFT JOIN ( SELECT addid, COUNT(addid_clicks) FROM addid_clicks ) AS c USING(addid)
LEFT JOIN ( SELECT addid, COUNT(addid_views) FROM addid_views ) AS v USING(addid)
If you get NULLs but prefer 0s, then wrap the value in IFNULL(..., 0).
If you need to discuss further, please provide SHOW CREATE TABLE and EXPLAIN SELECT ...
I ended with a solution to my problem. The table I was trying to reach was too big cause of the bad engineered database, where in adds_views_table, for each view, a new row would be added. Ending with almost 3 millions of rows and with a table that weights almost the 35% of the entire database (326MB).
When phpmyadmin tried to execute a query, loaded for ever and never showed a result because a timeout limit applied to mysql. Changing this value would help but wasn't viable to retrieve that data and display it on a website (that implies the website or data wouldn't load until the query its executed).
That problem was fixed thanks to creating an index of addid in adds_table. Also, the query it's faster if subquery's are used for some reason. The query ended like this:
SELECT a.addid,
(SELECT COUNT(addid) FROM adds_clicks_table WHERE addid = a.addid) AS 'clicks',(SELECT COUNT(addid) FROM adds_views_table WHERE addid = a.addid) AS 'views'
FROM adds_table a
ORDER BY a.addid;
Thanks to #Rick James who posted a similar query and I ended modifying it to get the data I needed
forgive my horrible english
If I compare
explain select * from Foo where find_in_set(id,'2,3');
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | User | ALL | NULL | NULL | NULL | NULL | 4 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
with this one
explain select * from Foo where id in (2,3);
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | User | range | PRIMARY | PRIMARY | 8 | NULL | 2 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
It is apparent that FIND_IN_SET does not exploit the primary key.
I want to put a query such as the above into a stored procedure, with the comma-separated string as an argument.
Is there any way to make the query behave like the second version, in which the index is used, but without knowing the content of the id set at the time the query is written?
In reference to your comment:
#MarcB the database is normalized, the CSV string comes from the UI.
"Get me data for the following people: 101,202,303"
This answer has a narrow focus on just those numbers separated by a comma. Because, as it turns out, you were not even talking about FIND_IN_SET afterall.
Yes, you can achieve what you want. You create a prepared statement that accepts a string as a parameter like in this Recent Answer of mine. In that answer, look at the second block that shows the CREATE PROCEDURE and its 2nd parameter which accepts a string like (1,2,3). I will get back to this point in a moment.
Not that you need to see it #spraff but others might. The mission is to get the type != ALL, and possible_keys and keys of Explain to not show null, as you showed in your second block. For a general reading on the topic, see the article Understanding EXPLAIN’s Output and the MySQL Manual Page entitled EXPLAIN Extra Information.
Now, back to the (1,2,3) reference above. We know from your comment, and your second Explain output in your question that it hits the following desired conditions:
type = range (and in particular not ALL) . See the docs above on this.
key is not null
These are precisely the conditions you have in your second Explain output, and the output that can be seen with the following query:
explain
select * from ratings where id in (2331425, 430364, 4557546, 2696638, 4510549, 362832, 2382514, 1424071, 4672814, 291859, 1540849, 2128670, 1320803, 218006, 1827619, 3784075, 4037520, 4135373, ... use your imagination ..., ..., 4369522, 3312835);
where I have 999 values in that in clause list. That is an sample from this answer of mine in Appendix D than generates such a random string of csv, surrounded by open and close parentheses.
And note the following Explain output for that 999 element in clause below:
Objective achieved. You achieve this with a stored proc similar to the one I mentioned before in this link using a PREPARED STATEMENT (and those things use concat() followed by an EXECUTE).
The index is used, a Tablescan (meaning bad) is not experienced. Further readings are The range Join Type, any reference you can find on MySQL's Cost-Based Optimizer (CBO), this answer from vladr though dated, with a eye on the ANALYZE TABLE part, in particular after significant data changes. Note that ANALYZE can take a significant amount of time to run on ultra-huge datasets. Sometimes many many hours.
Sql Injection Attacks:
Use of strings passed to Stored Procedures are an attack vector for SQL Injection attacks. Precautions must be in place to prevent them when using user-supplied data. If your routine is applied against your own id's generated by your system, then you are safe. Note, however, that 2nd level SQL Injection attacks occur when data was put in place by routines that did not sanitize that data in a prior insert or update. Attacks put in place prior via data and used later (a sort of time bomb).
So this answer is Finished for the most part.
The below is a view of the same table with a minor modification to it to show what a dreaded Tablescan would look like in the prior query (but against a non-indexed column called thing).
Take a look at our current table definition:
CREATE TABLE `ratings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;
select min(id), max(id),count(*) as theCount from ratings;
+---------+---------+----------+
| min(id) | max(id) | theCount |
+---------+---------+----------+
| 1 | 5046213 | 4718592 |
+---------+---------+----------+
Note that the column thing was a nullable int column before.
update ratings set thing=id where id<1000000;
update ratings set thing=id where id>=1000000 and id<2000000;
update ratings set thing=id where id>=2000000 and id<3000000;
update ratings set thing=id where id>=3000000 and id<4000000;
update ratings set thing=id where id>=4000000 and id<5100000;
select count(*) from ratings where thing!=id;
-- 0 rows
ALTER TABLE ratings MODIFY COLUMN thing int not null;
-- current table definition (after above ALTER):
CREATE TABLE `ratings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;
And then the Explain that is a Tablescan (against column thing):
You can use following technique to use primary index.
Prerequisities:
You know the maximum amount of items in comma separated string and it is not large
Description:
we convert comma separated string into temporary table
inner join to the temporary table
select #ids:='1,2,3,5,11,4', #maxCnt:=15;
SELECT *
FROM foo
INNER JOIN (
SELECT * FROM (SELECT #n:=#n+1 AS n FROM foo INNER JOIN (SELECT #n:=0) AS _a) AS _a WHERE _a.n <= #maxCnt
) AS k ON k.n <= LENGTH(#ids) - LENGTH(replace(#ids, ',','')) + 1
AND id = SUBSTRING_INDEX(SUBSTRING_INDEX(#ids, ',', k.n), ',', -1)
This is a trick to extract nth value in comma separated list:
SUBSTRING_INDEX(SUBSTRING_INDEX(#ids, ',', k.n), ',', -1)
Notes: #ids can be anything including other column from other or the same table.
The schema
I have a MySQL database with one large table (5 million rows say). This table has several fields for actual data, an optional comment field, and fields to record when the row was first added and when the data is deleted. To simplify to one "data" column, it looks a bit like this:
+----+------+---------+---------+----------+
| id | data | comment | created | deleted |
+----+------+---------+---------+----------+
| 1 | val1 | NULL | 1 | 2 |
| 2 | val2 | nice | 1 | NULL |
| 3 | val3 | NULL | 2 | NULL |
| 4 | val4 | NULL | 2 | 3 |
| 5 | val5 | NULL | 3 | NULL |
This schema allows us to look at any past version of the data thanks to the created and deleted fields e.g.
SET #version=1;
SELECT data, comment FROM MyTable
WHERE created <= #version AND
(deleted IS NULL OR deleted > #version);
+------+---------+
| data | comment |
+------+---------+
| val1 | NULL |
| val2 | nice |
The current version of the data can be fetched more simply:
SELECT data, comment FROM MyTable WHERE deleted IS NULL;
+------+---------+
| data | comment |
+------+---------+
| val2 | nice |
| val3 | NULL |
| val5 | NULL |
DDL:
CREATE TABLE `MyTable` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`data` varchar(32) NOT NULL,
`comment` varchar(32) DEFAULT NULL,
`created` int(11) NOT NULL,
`deleted` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `data` (`data`,`comment`)
) ENGINE=InnoDB;
Updating
Periodically a new set of data and comments arrives. This can be fairly large, half a million rows say. I need to update MyTable so that this new data set is stored in it. This means:
"Deleting" old rows. Note the "scare quotes" - we don't actually delete rows from MyTable. We have to set the deleted field to the new version N. This has to be done for all rows in MyTable that are in the previous version N-1, but are not in the new set.
Inserting new rows. All rows that are in the new set and are not in version N-1 in MyTable must be added as new rows with the created field set to the new version N, and deleted as NULL.
Some rows in the new set may match existing rows in MyTable at version N-1 in which case there is nothing to do.
My current solution
Given that we have to "diff" two sets of data to work out the deletions, we can't just read over the new data and do insertions as appropriate. I can't think of a way to do the diff operation without dumping all the new data into a temporary table first. So my strategy goes like this:
-- temp table uses MyISAM for speed.
CREATE TEMPORARY TABLE tempUpdate (
`data` char(32) NOT NULL,
`comment` char(32) DEFAULT NULL,
PRIMARY KEY (`data`),
KEY (`data`, `comment`)
) ENGINE=MyISAM;
-- Bulk insert thousands of rows
INSERT INTO tempUpdate VALUES
('some new', NULL),
('other', 'comment'),
...
-- Start transaction for the update
BEGIN;
SET #newVersion = 5; -- Worked out out-of-band
-- Do the "deletions". The join selects all non-deleted rows in MyTable for
-- which the matching row in tempUpdate does not exist (tempUpdate.data is NULL)
UPDATE MyTable
LEFT JOIN tempUpdate
ON MyTable.data = tempUpdate.data AND
MyTable.comment <=> tempUpdate.comment
SET MyTable.deleted = #newVersion
WHERE tempUpdate.data IS NULL AND
MyTable.deleted IS NULL;
-- Delete all rows from the tempUpdate table that match rows in the current
-- version (deleted is null) to leave just new rows.
DELETE tempUpdate.*
FROM MyTable RIGHT JOIN tempUpdate
ON MyTable.data = tempUpdate.data AND
MyTable.comment <=> tempUpdate.comment
WHERE MyTable.id IS NOT NULL AND
MyTable.deleted IS NULL;
-- All rows left in tempUpdate are new so add them.
INSERT INTO MyTable (data, comment, created)
SELECT DISTINCT tempUpdate.data, tempUpdate.comment, #newVersion
FROM tempUpdate;
COMMIT;
DROP TEMPORARY TABLE IF EXISTS tempUpdate;
The question (at last)
I need to find the fastest way to do this update operation. I can't change the schema for MyTable, so any solution must work with that constraint. Can you think of a faster way to do the update operation, or suggest speed-ups to my existing method?
I have a Python script for testing the timings of different update strategies and checking their correctness over several versions. It's fairly long but I can edit into the question if people think it would be useful.
One of speed-ups is for loading -- LOAD DATA INFILE.
In so far as I've experienced audit-logging, you'll be better off with two tables, e.g.:
yourtable (id, col1, col2, version) -- pkey on id
yourtable_logs (id, col1, col2, version) -- pkey on (id, version)
Then add an update trigger on yourtable, which inserts the previous version in yourtable_logs.