MySQL: Strange behavior of UPDATE query (ERROR 1062 Duplicate entry) - mysql

I have a MySQL database the stores news articles with the publications date (just day information), the source, and category. Based on these I want to generate a table that holds the article counts w.r.t. to these 3 parameters.
Since for some combinations of these 3 parameters there might be no article, a simple GROUP BY won't do. I therefore first generate a table news_article_counts with all possible combinations of the 3 parameters, and an default article_count of 0 -- like this:
SELECT * FROM news_article_counts;
+--------------+------------+----------+---------------+
| published_at | source | category | article_count |
+------------- +------------+----------+---------------+
| 2016-08-05 | 1826089206 | 0 | 0 |
| 2016-08-05 | 1826089206 | 1 | 0 |
| 2016-08-05 | 1826089206 | 2 | 0 |
| 2016-08-05 | 1826089206 | 3 | 0 |
| 2016-08-05 | 1826089206 | 4 | 0 |
| ... | ... | ... | ... |
+--------------+------------+----------+---------------+
For testing, I now created a temporary table tmp as the GROUP BY result from the original news article table:
SELECT * FROM tmp LIMIT 6;
+--------------+------------+----------+-----+
| published_at | source | category | cnt |
+--------------+------------+----------+-----+
| 2016-08-05 | 1826089206 | 3 | 1 |
| 2003-09-19 | 1826089206 | 4 | 1 |
| 2005-08-08 | 1826089206 | 3 | 1 |
| 2008-07-22 | 1826089206 | 4 | 1 |
| 2008-11-26 | 1826089206 | 8 | 1 |
| ... | ... | ... | ... |
+--------------+------------+----------+-----+
Given these two tables, the following query works as expected:
SELECT * FROM news_article_counts c, tmp t
WHERE c.published_at = t.published_at AND c.source = t.source AND c.category = t.category;
But now I need to update the article_count of table news_article_counts with the values in table tmp where the 3 parameters match up. For this I'm using the following query (I've tried different ways but with the same results):
UPDATE
news_article_counts c
INNER JOIN
tmp t
ON
c.published_at = t.published_at AND
c.source = t.source AND
c.category = t.category
SET
c.article_count = t.cnt;
Executing this query yields this error:
ERROR 1062 (23000): Duplicate entry '2018-04-07 14:46:17-1826089206-1' for key 'uniqueIndex'
uniqueIndex is a joint index over published_at, source, category of table news_article_counts. But this shouldn't be a problem since I do not -- as far as I can tell -- update any of those 3 values, only article_count.
What confuses me most is that in the error it mentions the timestamp I executed the query (here: 2018-04-07 14:46:17). I have no absolutely idea where this comes into play. In fact, some rows in news_article_counts now have 2018-04-07 14:46:17 as value for published_at. While this explains the error, I cannot see why published_at gets overwritten with the current timestamp. There is no ON UPDATE CURRENT_TIMESTAMP on this column; see:
CREATE TABLE IF NOT EXISTS `test`.`news_article_counts` (
`published_at` TIMESTAMP NOT NULL,
`source` INT UNSIGNED NOT NULL,
`category` INT UNSIGNED NOT NULL,
`article_count` INT UNSIGNED NOT NULL DEFAULT 0,
UNIQUE INDEX `uniqueIndex` (`published_at` ASC, `source` ASC, `category` ASC))
ENGINE = MyISAM
DEFAULT CHARACTER SET = utf8mb4;
What am I missing here?
UPDATE 1: I actually checked the table definition of news_article_counts in the database. And there's indeed the following:
mysql> SHOW COLUMNS FROM news_article_counts;
+---------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+-------------------+-----------------------------+
| published_at | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| source | int(10) unsigned | NO | | NULL | |
| category | int(10) unsigned | NO | | NULL | |
| article_count | int(10) unsigned | NO | | 0 | |
+---------------+------------------+------+-----+-------------------+-----------------------------+
But why is on update CURRENT_TIMESTAMP set. I double and triple-checked my CREATE TABLE statement. I removed the joint index, I added an artificial primary key (auto_increment). Nothing help. I've even tried to explicitly remove these attributes from published_at with:
ALTER TABLE `news_article_counts` CHANGE `published_at` `published_at` TIMESTAMP NOT NULL;
Nothing seems to work for me.

It looks like you have the explicit_defaults_for_timestamp system variable disabled. One of the effects of this is:
The first TIMESTAMP column in a table, if not explicitly declared with the NULL attribute or an explicit DEFAULT or ON UPDATE attribute, is automatically declared with the DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP attributes.
You could try enabling this system variable, but that could potentially impact other applications. I think it only takes effect when you're actually creating a table, so it shouldn't affect any existing tables.
If you don't to make a system-level change like this, you could add an explicit DEFAULT attribute to the published_at column of this table, then it won't automatically add ON UPDATE.

Related

Selecting the oldest updated set of entries

I have the following table my_entry:
Id int(11) AI PK
InternalId varchar(30)
UpdatedDate datetime
IsDeleted bit(1)
And I have the following query:
SELECT
`Id`, `InternalId`
FROM
`my_entry`
WHERE
(`IsDeleted` = FALSE)
AND ((`UpdatedDate` IS NULL
OR DATE(`UpdatedDate`) != DATE(STR_TO_DATE('17/10/2019', '%d/%m/%Y'))))
ORDER BY `x`.`UpdatedDate`
Limit 200;
The table has around 3M records, I have a program running that executes the above query and returns 200 entries from the table that weren't updated today, the program then changes those 200 entries and updates them again setting the UpdatedDate to today's date, on the next execution those 200 entries will be ignored, and new 200 entries will get selected, this keeps running until all the entries in the table are selected and updated for today.
This way I can ensure that all the entries are updated at least once every day.
This works perfectly fine, for the very first thousands of entries, the select query executes in a couple of milliseconds, but as soon as more entries are updated and have today's date in the UpdatedDate the query keeps slowing down, reaching execution times up to 20 seconds.
I'm wondering if I can do something to optimize the query, or if there is a better approach to take without using the UpdatedDate.
I was thinking of using the Id and paginating the entries, but I'm afraid this way I might miss some of them.
What I already tried:
Adding indexes to both the UpdatedDate and IsDeleted.
Changing the UpdatedDate type from datetime to date.
Edit:
MySql version: 5.6.45
The table in hand:
CREATE TABLE `my_entry` (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`InternalId` varchar(30) NOT NULL,
`UpdatedDate` date DEFAULT NULL,
`IsDeleted` bit(1) NOT NULL DEFAULT b'0',
PRIMARY KEY (`Id`),
UNIQUE KEY `InternalId` (`InternalId`),
KEY `UpdatedDate` (`UpdatedDate`),
KEY `entry_isdeleted_index` (`IsDeleted`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=8204626 DEFAULT CHARSET=utf8mb4
The output of the EXPLAIN query:
+----+-------------+-------+-------+-------------------------------------+-------------+---------+------+------+---------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-------------------------------------+-------------+---------+------+------+---------------+
| 1 | SIMPLE | x | index | "UpdatedDate entry_isdeleted_index" | UpdatedDate | 4 | NULL | 400 | "Using where" |
+----+-------------+-------+-------+-------------------------------------+-------------+---------+------+------+---------------+
Example of data in the table:
+------------+--------+---------------------+-----------+
| InternalId | Id | UpdatedDate | IsDeleted |
+------------+--------+---------------------+-----------+
| 328044773 | 552990 | 2019-10-17 10:11:29 | 0 |
| 330082707 | 552989 | 2019-10-17 10:11:29 | 0 |
| 329701688 | 552988 | 2019-10-17 10:11:29 | 0 |
| 329954358 | 552987 | 2019-10-16 10:11:29 | 0 |
| 964227577 | 552986 | 2019-10-16 12:33:29 | 0 |
| 329794593 | 552985 | 2019-10-16 12:33:29 | 0 |
| 400015773 | 552984 | 2019-10-16 12:33:29 | 0 |
| 330674329 | 552983 | 2019-10-16 12:33:29 | 0 |
+------------+--------+---------------------+-----------+
Example expected output of the query:
+------------+--------+
| InternalId | Id |
+------------+--------+
| 329954358 | 552987 |
| 964227577 | 552986 |
| 329794593 | 552985 |
| 400015773 | 552984 |
| 330674329 | 552983 |
+------------+--------+
First, simplify the date arithmetic. Then take the following approach:
Take NULL values in one subquery
Take rows on the date in another
Then order and select the results
Start by writing the query as:
SELECT Id, InternalId
FROM ((SELECT Id, InternalId, 2 as priority
FROM my_entry
WHERE NOT IsDeleted AND UpdatedDate IS NULL
LIMIT 200
) UNION ALL
(SELECT Id, InternalId, 1 as priority
FROM my_entry
WHERE NOT IsDeleted AND UpdatedDate <> '2019-10-17'
LIMIT 200
)
) t
ORDER BY priority
LIMIT 200;
The index that you want is either (updateddate, isdeleted) or (isdeleted, updateddate). You can add id and internalid.
The idea is to select at most 200 rows from the two subqueries without sorting. Then the outer query is sorting at most 400 rows -- and that should not take multiple seconds.

MySQL doesn't use my index while it declares it will in explain statement

I recently encounter a problem involving MySQL DBSM.
The Table is like this:
CREATE TABLE `orders` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(60) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
`sex` enum('男','女') DEFAULT NULL,
`amount` float(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `name_i` (`name`),
KEY `sex` (`sex`)
) ENGINE=InnoDB AUTO_INCREMENT=5000001 DEFAULT CHARSET=utf8
As is shown above ,I create a single colume index on col name
I want to perform a range query on name, and the explain statement is
mysql> explain select * from orders where name like '王%';
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
| 1 | SIMPLE | orders | NULL | range | name_i | name_i | 183 | NULL | 20630 | 100.00 | Using index condition; Using MRR |
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
1 row in set, 1 warning (0.10 sec)
so it should use the index name_i and finish the query in a flash(my classmate spent 0.07 sec)
however , this is how it turned out:
| 4998119 | 王缝 | 27 | 男 | 159.21 |
| 4998232 | 王求葬 | 19 | 男 | 335.65 |
| 4998397 | 王倘予 | 49 | 女 | 103.39 |
| 4998482 | 王厚 | 77 | 男 | 960.69 |
| 4998703 | 王啄淋 | 73 | 女 | 458.85 |
| 4999106 | 王般埋 | 70 | 女 | 700.98 |
| 4999359 | 王胆具 | 31 | 女 | 362.83 |
| 4999510 | 王铁脾 | 31 | 女 | 973.09 |
| 4999880 | 王战万 | 59 | 女 | 127.28 |
| 4999928 | 王忆 | 42 | 女 | 72.47 |
+---------+--------+------+------+--------+
11160 rows in set (3.43 sec)
And it seems to not use the index at all, because the data is sorted by the primary key id rather than col name(besides it is too slow ,comparing to 0.07 sec).
Has anyone encountered the problem too?
What percentage of the table is "Kings" (王) ? If it is more than about 20%, it will choose to do a table scan instead of use the index. (And this may actually be faster.) (Based on Comments, 0.22% of the table is Kings.)
EXPLAIN and the execution of the query are separate things. Although I don't remember proving this, it is possible that the EXPLAIN might say one thing, but the query would work another way.
Do you have 5 million rows in the table? Was the cache 'cold' when you first ran it? And it had to fetch 11,160 rows from disk? Then the second time, all was in cache, so much faster?
Was the table loaded in "alphabetical" (or whatever the Chinese word for that is) order? If so, there is a good chance the ids and the names are in the same order?
Apparently you are using utf8_general_ci COLLATION? Maybe it does not sort Chinese well. (Provide a test case; I'll do some tests.)
I do not understand why it mentioned MRR.
I, too, am baffled by "1 min 32.24sec". The ORDER BY name should have further encouraged the Optimizer to use INDEX(name). Can you turn on "Optimizer trace".
To really see whether it used the index, do this:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
If the big number(s) look like the number of rows in the table, then it did a table scan. If they look more like 11160, then they used the index.

Mysql Query performance very slow

The below query was taking more than 8 min and 900 000 rows processed. it is very slow and affect my product. I can't identify why the query getting slow, all index are set fine.
explain SELECT
COUNT(DISTINCT (cinfo.CONTACT_ID))
FROM
cinfo
INNER JOIN
LTocMapping ON cinfo.CONTACT_ID = LTocMapping.CONTACT_ID
WHERE
(((((((((cinfo.COUNTRY LIKE '%Panama%')
OR (cinfo.COUNTRY LIKE '%PANAMA%'))
AND (((cinfo.CONTACT_EMAIL NOT LIKE '%test%')
AND (cinfo.CONTACT_EMAIL NOT LIKE '%engine%'))
OR (cinfo.CONTACT_EMAIL IS NULL)))
AND ((SELECT
(GROUP_CONCAT(Temp.LIST_ID
ORDER BY Temp.LIST_ID) REGEXP ('.*,*221715000514445053,*.*$'))
FROM
LTocMapping Temp
WHERE
((LTocMapping.CONTACT_ID = Temp.CONTACT_ID)
AND (((Temp.MAPPING_ID >= 221715000000000000)
AND (Temp.MAPPING_ID <= 221715999999999999))
OR ((Temp.MAPPING_ID >= 0)
AND (Temp.MAPPING_ID <= 999999999999))))
GROUP BY Temp.CONTACT_ID) = '0'))
AND ((SELECT
(GROUP_CONCAT(Temp.LIST_ID
ORDER BY Temp.LIST_ID) REGEXP ('.*,*221715000520574130,*.*$'))
FROM
LTocMapping Temp
WHERE
((LTocMapping.CONTACT_ID = Temp.CONTACT_ID)
AND (((Temp.MAPPING_ID >= 221715000000000000)
AND (Temp.MAPPING_ID <= 221715999999999999))
OR ((Temp.MAPPING_ID >= 0)
AND (Temp.MAPPING_ID <= 999999999999))))
GROUP BY Temp.CONTACT_ID) = '0'))
AND (LTocMapping.LIST_ID IN (221715000520574130 , 221715000201569885)))
AND (LTocMapping.STATUS = BINARY 'subscribed'))
AND (((cinfo.CONTACT_STATUS = BINARY 'active')
OR (cinfo.CONTACT_STATUS = BINARY 'softbounce'))
AND (LTocMapping.STATUS = BINARY 'subscribed')))
AND (((cinfo.CONTACT_ID >= 221715000000000000)
AND (cinfo.CONTACT_ID <= 221715999999999999))
OR ((cinfo.CONTACT_ID >= 0)
AND (cinfo.CONTACT_ID <= 999999999999))))
And the answer will be
Below tables FYR
Table 1 :
mysql> desc cinfo;
+------------------------+--------------+------+-----+-----------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+--------------+------+-----+-----------+-------+
| CONTACT_ID | bigint(19) | NO | PRI | NULL | |
| CONTACT_EMAIL | varchar(100) | NO | MUL | NULL | |
| TITLE | varchar(20) | YES | | NULL | |
| FIRSTNAME | varchar(100) | YES | | NULL | |
| LASTNAME | varchar(50) | YES | | NULL | | |
| ADDED_BY | varchar(20) | YES | | NULL | |
| ADDED_TIME | bigint(19) | NO | | NULL | |
| LAST_UPDATED_TIME | bigint(19) | NO | | NULL | |
+------------------------+--------------+------+-----+-----------+-------+
Table 2 :
mysql> desc LTocMapping;
+---------------------+--------------+------+-----+------------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+--------------+------+-----+------------+-------+
| MAPPING_ID | bigint(19) | NO | PRI | NULL | |
| CONTACT_ID | bigint(19) | NO | MUL | NULL | |
| LIST_ID | bigint(19) | NO | MUL | NULL | |
| STATUS | varchar(100) | YES | | subscribed | |
| MAPPING_STATUS | varchar(20) | YES | | connected | |
| MAPPING_TIME | bigint(19) | YES | | NULL | |
+---------------------+--------------+------+-----+------------+-------+
As Far as I can tell, your subqueries are the bottleneck:
For the first subquery, you are using LTocMapping.CONTACT_ID
For the second subquery, you are using LTocMapping.CONTACT_ID as well.
These references (to values of the outer query) are causing these inner queries to become correlated subqueries (also called dependent subqueries). And that means: For every row you are going to fetch on one of the outer tables (~970000) - you are firing 2 additional queries on another table.
So, that's 1.8 Million (as it seems as well not trivial) queries you are executing.
Most the time, a correlated subquery can be replaced by a proper join. But this depends on the usecase. You also can join the same table twice, when using a different alias.
But to outline some join-options, you need to explain, why the subqueries resulting in the condition group_concat(....) = '0' are important - or maybe better, what you want to achieve.
(ps.: You can also see, that explain outlines them as dependent subquery)
OR is inefficient, see if you can avoid it.
Leading wildcards in LIKE are inefficient. See if a FULLTEXT index would work for you.
With a proper COLLATION, you don't need to test both upper and lower case. Also you can avoid use of BINARY. In both cases, you might be able to use an index. (What indexes do you have?)
Try to change from
WHERE ( ( SELECT ... ) = '0' )
to
WHERE ( NOT EXISTS ( SELECT ... ) )
(The SELECT will need some modification.)
(Please get rid of some of the redundant parens; it is hard to read.)
(Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE.)

on duplicate key update result affecting all the rows of the table

I have a table of this structure:
mysql> desc securities;
+-----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+-------+
| sym | varchar(19) | NO | PRI | | |
| bqn | int(11) | YES | | NULL | |
| sqn | int(11) | YES | | NULL | |
| tqn | int(11) | YES | | NULL | |
+-----------------+-------------+------+-----+---------+-------+
4 rows in set (0.01 sec)
I am trying to do a select and an update within the same query, so the reason I have chosen
insert into securities (sym, bqn, sqn , tqn) values('ANK', 50,0,1577798)
on duplicate key update bqn=bqn+50 , sqn=sqn+0 , tqn=tqn+1577798;
When I ran the above I observed it is in fact changing the values for all the other rows also.
Is this behaviour expected? I am using MySQL Database.
Your fiddle is missing the key, and the INSERT statement in the right panel (where it does not belong in the first place) is using different column names … *sigh*
Define the symbol column as PRIMARY KEY – and use the VALUES() syntax to get the values to add in the ON UPDATE part, so that you don’t have to repeat them every single time:
insert into securities
(symbol, buyerquan, sellerquan , totaltradedquan)
values('BANKBARODA', 73, 0, 4290270)
on duplicate key update
buyerquan=buyerquan+VALUES(buyerquan),
sellerquan=sellerquan+VALUES(sellerquan),
totaltradedquan=totaltradedquan+VALUES(totaltradedquan);
Works perfectly fine, result values are as to be expect from the input: http://sqlfiddle.com/#!2/21638f/1

Difficult MySQL Update Query with Self-Join

Our website has listings. We use a connections table with the following structure to connect our members with these various listings:
CREATE TABLE `connections` (
`cid1` int(9) unsigned NOT NULL DEFAULT '0',
`cid2` int(9) unsigned NOT NULL DEFAULT '0',
`type` char(2) NOT NULL,
`created` datetime DEFAULT NULL,
`updated` datetime DEFAULT NULL,
PRIMARY KEY (`cid1`,`cid2`,`type`,`cid3`),
KEY `cid1` (`cid1`,`type`),
KEY `cid2` (`cid2`,`type`)
);
The problem we've run into is when we have to combine duplicate listings from time to time we also need to update our member connections and have been using the following query which breaks if a member is connected to both listings:
update connections set cid2=100000
where type IN ('MC','MT','MW') AND cid2=100001;
What I can't figure out is how to do the following which would solve this issue:
update connections set cid2=100000
where type IN ('MC','MT','MW') AND cid2=100001 AND cid1 NOT IN (
select cid1 from connections
where type IN ('MC','MT','MW') AND cid2=100000
);
When I try to run that query I get the following error:
ERROR 1093 (HY000): You can't specify target table 'connections' for update in FROM clause
Here is some sample data. Notice the update conflict for cid1 = 10025925
+----------+--------+------+---------------------+---------------------+
| cid1 | cid2 | type | created | updated |
+----------+--------+------+---------------------+---------------------+
| 10010388 | 100000 | MC | 2010-08-05 18:04:51 | 2011-06-16 16:26:17 |
| 10025925 | 100000 | MC | 2010-10-31 09:21:25 | 2010-10-31 16:21:25 |
| 10027662 | 100000 | MC | 2011-06-13 16:31:12 | NULL |
| 10038375 | 100000 | MW | 2011-02-05 05:32:35 | 2011-02-05 19:51:58 |
| 10065771 | 100000 | MW | 2011-04-24 17:06:35 | NULL |
| 10025925 | 100001 | MC | 2010-10-31 09:21:45 | 2010-10-31 16:21:45 |
| 10034884 | 100001 | MC | 2011-01-20 18:54:51 | NULL |
| 10038375 | 100001 | MC | 2011-02-04 05:00:35 | NULL |
| 10041989 | 100001 | MC | 2011-02-26 09:33:18 | NULL |
| 10038259 | 100001 | MC | 2011-05-07 13:34:20 | NULL |
| 10027662 | 100001 | MC | 2011-06-13 16:33:54 | NULL |
| 10030855 | 100001 | MT | 2010-12-31 20:40:18 | NULL |
| 10038375 | 100001 | MT | 2011-02-04 05:00:36 | NULL |
+----------+--------+------+---------------------+---------------------+
I'm hoping that someone can suggest the right way to run the above query. Thanks in advance!
The reason for the error in your query is because in MySQL you cannot SELECT from the table you are trying to UPDATE in the same query.
Use UPDATE IGNORE to avoid duplicate conflicts.
I think you should try reading INSERT ON DUPLICATE KEY. The idea is that you frame an INSERT query that always create a DUPLICATE conflict and then the UPDATE part will do its part.
One possible way would be to use a temporary table for your subquery then select from the temp table. Efficiency could break down fast if you need to execute a lot of these queires though.
create temporary table subq as select cid1 from connections where type IN ('MC','MT','MW') AND cid2=100000
And
update connections set cid2=100000 where type IN ('MC','MT','MW') AND cid2=100001 AND cid1 NOT IN (select cid1 from subq);
UPDATE connections cn1
LEFT JOIN connections cn2 ON cn1.cid1 != cn2.cid1
AND cn2.type IN ('MC','MT','MW')
AND cn2.cid2=100000
SET cn1.cid2=100000
WHERE cn1.TYPE IN ('MC','MT','MW')
AND cn1.cid2=100001
AND cn2.cid1 IS NULL -- i.e. there is no matching record
I was thinking something like the following, but I'm not 100% sure how your data looks before and after to determine accuracy. The idea is to join the table to itself on your subquery's where clause and the exclusion where cid1 must not match.
update connections c1 left outer join connections c2
on (c2.cid2 = 100000 and c2.type in ('MC','MT','MW') and c1.cid1 != c2.cid1)
set c1.cid2 = 100000
where c1.type in ('MC', 'MT', 'MW') and c1.cid2=100001 and c2.cid1 is null;
As near as I can tell, it'll work. I used your create table (minus the cid3 in the primary key) and made sure I had 2 rows with the same cid1 and different cid2 (one 100000 and the other as 100001) and that the statement only affected 1 row.