RANK conversion of MS SQL to MYSQL - mysql

I am converting our project database from SQL Server to MySQL, the DB conversion has done already.
We have code as below to identify duplicate records based on hashcode and update them as duplicate.
Rank function in MySQL ([Rank function in MySQL) need rank based on age which will start with 1 and increment by 1 for each record. But for me Rank for each hascode should start from 1 and increment by 1 for same hascode, if new hascode comes Rank should start from 1.
update table set Duplicate=1
WHERE id IN
( SELECT id FROM (
select RANK() OVER (PARTITION BY Hashcode ORDER BY Date asc) R,*
from table )A where R!=1 )
Below is table structure
CREATE TABLE TBL (
id int(11) NOT NULL AUTO_INCREMENT,
FileName varchar(100) DEFAULT NULL,
date datetime DEFAULT NULL,
hashcode varchar(255) DEFAULT NULL,
FileSize varchar(25) DEFAULT NULL,
IsDuplicate bit(1) DEFAULT NULL,
IsActive bit(1) DEFAULT NULL
PRIMARY KEY (`id`)
)
Please help me to migrate this code to MYSQL.

You don't need to use enumeration for this logic. You just want to set the duplicate flag on everything that is not the minimum date for the hashcode:
update table t join
(select hashcode, min(date) as mindate
from table t
group by hashcode
) tt
on t.hashcode = tt.hashcode and t.date > tt.mindate
set t.Duplicate = 1;

MySQL features a rather unique way to delete duplicates:
alter ignore table YourTable
add unique index ux_yourtable_hashcode (hashcode);
The trick here is in the ignore option:
If IGNORE is specified, only one row is used of rows with duplicates
on a unique key. The other conflicting rows are deleted.
But there are also other ways. Based on your comment, there is an auto_increment column called id. Since this column is unique and not null, you can use it to distinguish duplicates. You'd need a temporary table to work around the cant specify target table TBL for update in FROM clause error:
create temporary table tmp_originals (id int);
insert tmp_originals
(id)
select min(id)
from YourTable
group by
hashcode;
update YourTable
set Duplicate = 1
where id not in (select id from tmp_originals);
The group by query selects the lowest id per group of rows with the same hashcode.

Related

Get AVG value for each selected row from MySQL table with 500m rows?

I have one table with 500 million records In MySQL 8.x. My regular query to get a certain result set is 200ms, but if I try to get an AVG value the performance drops to 30s+.
Structure:
KW_ID | DATE | SERP | MERCHANT_ID | ARTICLE_ID
-- auto-generated definition
create table merchants_keyword_serps
(
KW_ID mediumint unsigned null,
MERCHANT_ID tinyint unsigned null,
ARTICLE_ID char(10) null,
SERP tinyint unsigned null,
DATE date null,
constraint `unique`
unique (MERCHANT_ID, ARTICLE_ID, KW_ID, DATE),
constraint fk_serps_kwd_t
foreign key (MERCHANT_ID, ARTICLE_ID) references merchants_product_catalog (MERCHANT_ID, ARTICLE_ID)
on delete cascade,
constraint keywords
foreign key (KW_ID) references merchants_keywords (ID)
on delete cascade
);
create index merchants_keyword_serps_SERP_index
on merchants_keyword_serps (SERP);
create index mks_date
on merchants_keyword_serps (DATE);
Goal, get SERP for 20220122 and MERCHANT_ID = 2:
select
mcs.SERP
FROM merchants_keyword_serps mcs
WHERE date = 20220120
AND mcs.MERCHANT_ID = 2;
Now do also get the AVG SERP for all shops in addition:
select
mcs.SERP,
(
SELECT AVG(SERP)
FROM merchants_keyword_serps mcs2
WHERE mcs2.date = 20220120
AND mcs2.KW_ID = mcs.KW_ID
AND mcs2.ARTICLE_ID = mcs.ARTICLE_ID) AS SERP_AVG
from merchants_keyword_serps mcs
WHERE
date = 20220120
AND mcs.MERCHANT_ID = 2;
The expected result would be an additional column with the average SERP value for all shops with the same KW_ID, DATE, ARTICLE_ID.
Is there a way to speed that up with a different approach? The indexes are all set OK I believe since the standard query runs perfectly fast in unter 200ms.
Where does KW_ID come from? Please provide SHOW CREATE TABLE merchants_keyword_serps.
Using 20220120 for a date is asking for trouble. (I don't see any problem yet.)
Add these:
INDEX(merchant_id, date)
INDEX(kw_id, article_id, date, serp)
and Drop these since they will be redundant:
INDEX(merchant_id)
INDEX(kw_id)

MYSQL Select filtered on one column, use index of other column

I have a table looking something like:
CREATE TABLE myTable (
id INT NOT NULL AUTO_INCREMENT,
atTime TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
stuff varchar(100) NULL,
CONSTRAINT myTable_PK PRIMARY KEY (id)
)
ENGINE=InnoDB
DEFAULT CHARSET=latin1
COLLATE=latin1_swedish_ci ;
The atTime column is not indexed.
The id column is always in chronological order.
I'd like to select all rows where id > x and atTime < y.
Is there any way of doing this without doing a full table scan for each select and without adding an index to atTime?
Essentially what I want is to tell MYSQL to rely on id being chronological to optimize the query.
Edit:
Figured out one way of doing it using a subquery but it only made it faster with about 30%:
SELECT * from myTable
WHERE id<( SELECT id FROM myTable
WHERE atTime>'y'
ORDER BY id LIMIT 1
)
AND id>x
By themselves these 2 queries are near instantaneous, but together they take quite some time. Why could that be?

MySQL — ON DUPLICATE KEY UPDATE only when both unique fields match, else INSERT

I have the following query:
"INSERT INTO `occ_apps` (`occ_date`, `service_tag`, `counter`) VALUES (?, ?, '1') ON DUPLICATE KEY UPDATE `counter` = (`counter`+1)"
Currently it's incrementing the counter when either occ_date or service_tag is matching in a row.
Where occ_date and service_tag are unique fields, and I can't set primary key to both unique fields.
I ran the following:
ALTER TABLE occ_apps
DROP PRIMARY KEY,
ADD PRIMARY KEY (occ_date, service_tag);
And I get, the error:
`#1075 - Incorrect table definition; there can be only one auto column and it must be defined as a key`
I want it to update (increment) the counter only when occ_date and service_tag both matches (already exists) in a single row, otherwise it should insert a new row.
Software version: 5.5.53-MariaDB-1~wheezy - mariadb.org binary distribution
when I ran DESC occ_apps I get:
Field Type Null Key Default Extra
serial_no int(255) NO PRI NULL auto_increment
occ_date varchar(255) NO UNI NULL
counter int(255) NO NULL
service_tag varchar(255) YES UNI NULL
I don't think you even need a counter field in your table. It looks like your counter is merely holding how many times a given value occurs. And that's something that can be generated easily using a GROUP BY
SELECT occ_date, COUNT(*) FROM occ_apps GROUP BY `occ_date`;
So you want to filter the query so that you get only items with at least 5 counts?
SELECT occ_date, COUNT(*) FROM occ_apps WHERE service_tag = 'service-1'
GROUP BY `occ_date` HAVING COUNT(*) > 5
These sorts of problems have been solved millions of times using GROUP BY. This is just the tip of the ice berge as far as what SQL query aggregation can do. Please take a moment to read up on it.

Mysql Group By implementation details - which row mysql chooses in a Group By query without operators?

I have a table with multiple rows per "website_id"
CREATE TABLE `MyTable` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`tagCheckResult` int(11) DEFAULT NULL,
`website_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `IX_website_id` (`website_id`),
) ENGINE=InnoDB;
I am trying to select the latest entry per website_id
-- This creates a temporary table with the last entry per website_id, and joins it
-- to get the entire row
SELECT *
FROM `WebsiteStatus` ws1
JOIN (
SELECT MAX(id) max_id, website_id FROM `WebsiteStatus`
GROUP BY website_id) ws2
ON ws1.id = ws2.max_id
Now, I know the correct way to get the last row per website_id is as above. My qusetion is - I also tried the following simpler query, at it seemed to return the exact same results as above:
SELECT * FROM `WebsiteStatus`
GROUP BY website_id
ORDER BY website_id DESC
I know that in principle GROUP BY without operators (e.g. MAX), like I do in my 2nd query can return any of the relevant rows ... but in practice it returns the last one. Is there an implementation detail in mysql that guarantees this is always the case?
(Just asking for academic curiosity, I know the 1st query is "more correct").

How to optimize this simple JOIN+ORDER BY query?

I have two mysql tables:
/* Table users */
CREATE TABLE IF NOT EXISTS `users` (
`Id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`DateRegistered` datetime NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/* Table statistics_user */
CREATE TABLE IF NOT EXISTS `statistics_user` (
`UserId` int(10) unsigned NOT NULL AUTO_INCREMENT,
`Sent_Views` int(10) unsigned NOT NULL DEFAULT '0',
`Sent_Winks` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`UserId`),
CONSTRAINT `statistics_user_ibfk_1` FOREIGN KEY (`UserId`) REFERENCES `users` (`Id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Both tables are populated with 10.000 random rows for testing by using the following procedure:
DELIMITER //
CREATE DEFINER=`root`#`localhost` PROCEDURE `FillUsersStatistics`(IN `cnt` INT)
BEGIN
DECLARE i INT DEFAULT 1;
DECLARE dt DATE;
DECLARE Winks INT DEFAULT 1;
DECLARE Views INT DEFAULT 1;
WHILE (i<=cnt) DO
SET dt = str_to_date(concat(floor(1 + rand() * (9-1)),'-',floor(1 + rand() * (28 -1)),'-','2011'),'%m-%d-%Y');
INSERT INTO users (Id, DateRegistered) VALUES(i, dt);
SET Winks = floor(1 + rand() * (30-1));
SET Views = floor(1 + rand() * (30-1));
INSERT INTO statistics_user (UserId, Sent_Winks, Sent_Views) VALUES (i, Winks, Views);
SET i=i+1;
END WHILE;
END//
DELIMITER ;
CALL `FillUsersStatistics`(10000);
The problem:
When I run the EXPLAIN for this query:
SELECT
t1.Id, (Sent_Views + Sent_Winks) / DATEDIFF(NOW(), t1.DateRegistered) as Score
FROM users t1
JOIN statistics_user t2 ON t2.UserId = t1.Id
ORDER BY Score DESC
.. I get this explain:
Id select_type table type possible_keys key key_len ref rows extra
1 SIMPLE t1 ALL PRIMARY (NULL) (NULL) (NULL) 10037 Using temporary; Using filesort
1 SIMPLE t2 eq_ref PRIMARY PRIMARY 4 test2.t2.UserId 1
The above query gets very slow when both tables have more than 500K rows. I guess it's because of the 'Using temporary; Using filesort' in the explain of the query.
How can the above query be optimized so that it runs faster?
I'm faily sure that the ORDER BY is what's killing you, since it cannot be properly indexed. Here is a workable, if not particularly pretty, solution.
First, let's say you have a column named Score for storing a user's current score. Every time a user's Sent_Views or Sent_Winks changes, modify the Score column to match. This could probably be done with a trigger (my experience with triggers is limited), or definitely done in the same code that updates the Sent_Views and Sent_Winks fields. This change wouldn't need to know the DATEDIFF portion, because it could just divide by the old sum of Sent_Views + Sent_Winks and multiply by the new one.
Now you just need to change the Score column once per day (if you're not picky about the precise number of hours a user has been registered). This could be done with a script run by a cron job.
Then, just index the Score column and SELECT away!
Note: edited to remove incorrect first attempt.
I'm offering my comment as answer:
Establish a future date, far enough to not interfere with your application, say the year 5000. Replace the current date with this future date in your score calculation. The score computation is now for all intents and purposes absolute, and can be computed when updating winks and views (through a stored rocedure or atrigger (does mysql have triggers?)).
Add a score column to your statistics_user table to store the computed score and define an index on it.
Your SQL can be rewritten as:
SELECT
UserId, score
FROM
statistics_user
ORDER BY score DESC
If you need the real score, it is easily computed with just a constant multiplication which could be done afterwards if it interferse with mysql index selection.
Shouldn't you have indexed DateRegistered in Users?
You should try an inner join, rather than a cartesian product, the next thing you can do is partitioning according to date_registered.