How to identify and delete duplicate rows, except for most recent - mysql

I'm working in HeidiSQL and I'm trying to figure out how to delete all duplicate rows except for the most recent. There are some slight differences amongst the "duplicates," but whenever more than four specific values are identical (i.e. UserID, ContactID, SMSID, and EventID) the row is considered a duplicate. I need to remove these according to the most recent row (identified by CreatedDate).
The following query identifies these rows:
SELECT a.UserID, a.ContactID, a.SMSID, a.EventID, CreatedDate
FROM WhenToText a
JOIN (SELECT UserID, ContactID, SMSID, EventID
FROM WhenToText
GROUP BY UserID, ContactID, SMSID, EventID
HAVING COUNT(*) > 1 ) b
ON a.UserID = b.UserID
AND a.ContactID = b.ContactID
AND a.SMSID = b.SMSID
AND a.EventID = b.EventID
ORDER BY UserID, ContactID, SMSID, EventID, CreatedDate DESC
However, I'm not sure how to delete these duplicates after I've identified them.
Here is some sample data:

Here is one approach:
DELETE FROM WhenToText w1
INNER JOIN
(
SELECT UserID, ContactID, SMSID, EventID, MAX(CreatedDate) AS MaxDate
FROM WhenToText
GROUP BY UserID, ContactID, SMSID, EventID
) w2
ON w1.UserID = w2.UserID AND w1.ContactID = w2.ContactID AND w1.SMSID = w2.SMSID
AND w1.EventID = w2.EventID
AND w1.CreatedDate != w2.MaxDate
This will delete any record for a given (UserID, ContactID, SMSID, EventID) group whose CreatedDate is not the most recent. Keep in mind this may leave behind more than one record for each group in the event that the latest CreatedDate is shared.
If you want to test which this query first to see which records will be targeted for deletion, you can replace DELETE FROM WhenToText w1 with SELECT w1.* FROM WhenToText w1.
Here is a link to a SQL Fiddle which demonstrates how the query will identify records for deletion:
SQLFiddle

Here is a solution using DELETE FROM JOIN, w/ a full demo with your data.
SQL:
-- Data preparation
create table WhenToText(UserID int, ContactID int, SMSID int, EventID int, CreatedDate datetime);
insert into WhenToText values
(4, 25, 7934, 7407, '2016-02-10 00:00:11'),
(4, 25, 7934, 7407, '2016-02-09 00:00:12'),
(4, 29, 5132, 7407, '2016-02-10 00:00:11'),
(4, 29, 5132, 7407, '2016-02-09 00:00:12'),
(4, 31, 12944, 7405, '2016-02-10 07:03:02'),
(4, 31, 12944, 7405, '2016-02-10 05:03:02'),
(4, 146, 12908, 7405, '2016-02-10 06:52:02'),
(4, 146, 12908, 7405, '2016-02-10 04:52:02'),
(15, 63, 12964, 7401, '2016-02-10 03:42:04'),
(15, 63, 12964, 7401, '2016-02-10 03:41:04'),
(15, 64, 12326, 7401, '2016-02-07 03:01:03'),
(15, 64, 12326, 7401, '2016-02-07 03:00:03');
SELECT * FROM WhenToText;
-- SQL needed
DELETE a FROM
WhenToText a INNER JOIN
(
SELECT UserID, ContactID, SMSID, EventID, MAX(CreatedDate) CreatedDate
FROM WhenToText
GROUP BY UserID, ContactID, SMSID, EventID
) b
USING(UserID, ContactID, SMSID, EventID)
WHERE
a.CreatedDate != b.CreatedDate;
SELECT * FROM WhenToText;
Output:
mysql> SELECT * FROM WhenToText;
+--------+-----------+-------+---------+---------------------+
| UserID | ContactID | SMSID | EventID | CreatedDate |
+--------+-----------+-------+---------+---------------------+
| 4 | 25 | 7934 | 7407 | 2016-02-10 00:00:11 |
| 4 | 25 | 7934 | 7407 | 2016-02-09 00:00:12 |
| 4 | 29 | 5132 | 7407 | 2016-02-10 00:00:11 |
| 4 | 29 | 5132 | 7407 | 2016-02-09 00:00:12 |
| 4 | 31 | 12944 | 7405 | 2016-02-10 07:03:02 |
| 4 | 31 | 12944 | 7405 | 2016-02-10 05:03:02 |
| 4 | 146 | 12908 | 7405 | 2016-02-10 06:52:02 |
| 4 | 146 | 12908 | 7405 | 2016-02-10 04:52:02 |
| 15 | 63 | 12964 | 7401 | 2016-02-10 03:42:04 |
| 15 | 63 | 12964 | 7401 | 2016-02-10 03:41:04 |
| 15 | 64 | 12326 | 7401 | 2016-02-07 03:01:03 |
| 15 | 64 | 12326 | 7401 | 2016-02-07 03:00:03 |
+--------+-----------+-------+---------+---------------------+
12 rows in set (0.00 sec)
mysql>
mysql> -- SQL needed
mysql> DELETE a FROM
-> WhenToText a INNER JOIN
-> (
-> SELECT UserID, ContactID, SMSID, EventID, MAX(CreatedDate) CreatedDate
-> FROM WhenToText
-> GROUP BY UserID, ContactID, SMSID, EventID
-> ) b
-> USING(UserID, ContactID, SMSID, EventID)
-> WHERE
-> a.CreatedDate != b.CreatedDate;
SELECT * FQuery OK, 6 rows affected (0.00 sec)
mysql>
mysql> SELECT * FROM WhenToText;
+--------+-----------+-------+---------+---------------------+
| UserID | ContactID | SMSID | EventID | CreatedDate |
+--------+-----------+-------+---------+---------------------+
| 4 | 25 | 7934 | 7407 | 2016-02-10 00:00:11 |
| 4 | 29 | 5132 | 7407 | 2016-02-10 00:00:11 |
| 4 | 31 | 12944 | 7405 | 2016-02-10 07:03:02 |
| 4 | 146 | 12908 | 7405 | 2016-02-10 06:52:02 |
| 15 | 63 | 12964 | 7401 | 2016-02-10 03:42:04 |
| 15 | 64 | 12326 | 7401 | 2016-02-07 03:01:03 |
+--------+-----------+-------+---------+---------------------+
6 rows in set (0.00 sec)

This should provide the solution you're looking for, given CreatedDate is a date datatype. This is also under the assumption that the most recent row is technically the most recent CreatedDate.
SELECT UserID, ContactID, SMSID, EventID, MAX(CreatedDate) AS CreatedDate
FROM WhenToText
GROUP BY 1, 2, 3, 4;
With these values you could just overwrite WhenToText table...which would look something like this...
CREATE TABLE tmp_table LIKE WhenToText;
INSERT INTO tmp_table (SELECT UserID, ContactID, SMSID, EventID, MAX(CreatedDate) AS CreatedDate
FROM WhenToText
GROUP BY 1, 2, 3, 4);
TRUNCATE WhenToText;
INSERT INTO WhenToText (SELECT * FROM tmp_table);
DROP TABLE tmp_table;

Related

how to track score gains in mysql

I would like to display a players current score as well as how many points they have gained within a selected time frame.
I have 2 tables
skills table
+----+---------+---------------------+
| id | name | created_at |
+----+---------+---------------------+
| 1 | skill 1 | 2020-06-05 00:00:00 |
| 2 | skill 2 | 2020-06-05 00:00:00 |
| 3 | skill 3 | 2020-06-05 00:00:00 |
+----+---------+---------------------+
scores table
+----+-----------+----------+-------+---------------------+
| id | player_id | skill_id | score | created_at |
+----+-----------+----------+-------+---------------------+
| 1 | 1 | 1 | 5 | 2020-06-06 00:00:00 |
| 2 | 1 | 1 | 10 | 2020-07-06 00:00:00 |
| 3 | 1 | 2 | 1 | 2020-07-06 00:00:00 |
| 4 | 2 | 1 | 11 | 2020-07-06 00:00:00 |
| 5 | 1 | 1 | 13 | 2020-07-07 00:00:00 |
| 6 | 1 | 2 | 10 | 2020-07-07 00:00:00 |
| 7 | 2 | 1 | 12 | 2020-07-07 00:00:00 |
| 8 | 1 | 1 | 20 | 2020-07-08 00:00:00 |
| 9 | 1 | 2 | 15 | 2020-07-08 00:00:00 |
| 10 | 2 | 1 | 17 | 2020-07-08 00:00:00 |
+----+-----------+----------+-------+---------------------+
my expected results are:-
24 hour query
+-----------+---------+-------+------+
| player_id | name | score | gain |
+-----------+---------+-------+------+
| 1 | skill 1 | 20 | 7 |
| 1 | skill 2 | 15 | 5 |
+-----------+---------+-------+------+
7 day query
+-----------+---------+-------+------+
| player_id | name | score | gain |
+-----------+---------+-------+------+
| 1 | skill 1 | 20 | 10 |
| 1 | skill 2 | 15 | 14 |
+-----------+---------+-------+------+
31 day query
+-----------+---------+-------+------+
| player_id | name | score | gain |
+-----------+---------+-------+------+
| 1 | skill 1 | 20 | 15 |
| 1 | skill 2 | 15 | 14 |
+-----------+---------+-------+------+
so far I have the following, but all this does is return the last 2 records for each skill, I am struggling to calculate the gains and the different time frames
SELECT player_id, skill_id, name, score
FROM (SELECT player_id, skill_id, name, score,
#skill_count := IF(#current_skill = skill_id, #skill_count + 1, 1) AS skill_count,
#current_skill := skill_id
FROM skill_scores
INNER JOIN skills
ON skill_id = skills.id
WHERE player_id = 1
ORDER BY skill_id, score DESC
) counted
WHERE skill_count <= 2
I would like some help figuring out the query I need to build to get the desired results, or is it best to do this with php instead of in the db?
EDIT:-
MYSQL 8.0.20 dummy data id's are primary_key auto increment but I didnt ad that for simplicity:-
CREATE TABLE skills
(
id bigint,
name VARCHAR(80)
);
CREATE TABLE skill_scores
(
id bigint,
player_id bigint,
skill_id bigint,
score bigint,
created_at timestamp
);
INSERT INTO skills VALUES (1, 'skill 1');
INSERT INTO skills VALUES (2, 'skill 2');
INSERT INTO skills VALUES (3, 'skill 3');
INSERT INTO skill_scores VALUES (1, 1, 1 , 5, '2020-06-06 00:00:00');
INSERT INTO skill_scores VALUES (2, 1, 1 , 10, '2020-07-06 00:00:00');
INSERT INTO skill_scores VALUES (3, 1, 2 , 1, '2020-07-06 00:00:00');
INSERT INTO skill_scores VALUES (4, 2, 1 , 11, '2020-07-06 00:00:00');
INSERT INTO skill_scores VALUES (5, 1, 1 , 13, '2020-07-07 00:00:00');
INSERT INTO skill_scores VALUES (6, 1, 2 , 10, '2020-07-07 00:00:00');
INSERT INTO skill_scores VALUES (7, 2, 1 , 12, '2020-07-07 00:00:00');
INSERT INTO skill_scores VALUES (8, 1, 1 , 20, '2020-07-08 00:00:00');
INSERT INTO skill_scores VALUES (9, 1, 2 , 15, '2020-07-08 00:00:00');
INSERT INTO skill_scores VALUES (10, 2, 1 , 17, '2020-07-08 00:00:00');
WITH cte AS (
SELECT id, player_id, skill_id,
FIRST_VALUE(score) OVER (PARTITION BY player_id, skill_id ORDER BY created_at DESC) score,
FIRST_VALUE(score) OVER (PARTITION BY player_id, skill_id ORDER BY created_at DESC) - FIRST_VALUE(score) OVER (PARTITION BY player_id, skill_id ORDER BY created_at ASC) gain,
ROW_NUMBER() OVER (PARTITION BY player_id, skill_id ORDER BY created_at DESC) rn
FROM skill_scores
WHERE created_at BETWEEN #current_date - INTERVAL #interval DAY AND #current_date
)
SELECT cte.player_id, skills.name, cte.score, cte.gain
FROM cte
JOIN skills ON skills.id = cte.skill_id
WHERE rn = 1
ORDER BY player_id, name;
fiddle
Ps. I don't understand where gain=15 is taken for 31-day period - the difference between '2020-07-08 00:00:00' and '2020-06-06 00:00:00' is 32 days.
Well i think you need a (temporary) table for this. I will call it "player_skill_gains". Its basically the players skills ordered by created_at and with an auto_incremented id:
CREATE TABLE player_skill_gains
(`id` int PRIMARY KEY AUTO_INCREMENT NOT NULL
, `player_id` int
, skill_id int
, score int
, created_at date)
;
INSERT INTO player_skill_gains(player_id, skill_id, score, created_at)
SELECT player_skills.player_id AS player_id
, player_skills.skill_id
, SUM(player_skills.score) AS score
, player_skills.created_at
FROM player_skills
GROUP BY player_skills.id, player_skills.skill_id, player_skills.created_at
ORDER BY player_skills.player_id, player_skills.skill_id, player_skills.created_at ASC;
Using this table we can relatively easily select the last skill for each row (id-1). Using this we can calculate the gains:
SELECT player_skill_gains.player_id, skills.name, player_skill_gains.score
, player_skill_gains.score - IFNULL(bef.score,0) AS gain
, player_skill_gains.created_at
FROM player_skill_gains
INNER JOIN skills ON player_skill_gains.skill_id = skills.id
LEFT JOIN player_skill_gains AS bef ON (player_skill_gains.id - 1) = bef.id
AND player_skill_gains.player_id = bef.player_id
AND player_skill_gains.skill_id = bef.skill_id
For the different queries you want to have (24 hours, 7 days, etc.) you just have to specify the needed where-part for the query.
You can see all this in action here: http://sqlfiddle.com/#!9/1571a8/11/0

SQL request excluding periods of time

I need to get all DISTINCT users excluding those who are not available according to unavailability periods of time.
The user table:
+------+-----------+--------------------------------------+
| id | firstname | content |
+------+-----------+--------------------------------------+
| 13 | John | ... |
| 44 | Marc | ... |
| 55 | Elise | ... |
+------+-----------+--------------------------------------+
The unavailability periods table:
+------+-----------+--------------+--------------+
| id | user_id | start | end |
+------+-----------+--------------+--------------+
| 1 | 13 | 2019-07-01 | 2019-07-10 |
| 2 | 13 | 2019-07-20 | 2019-07-30 |
| 3 | 13 | 2019-09-01 | 2019-09-30 |
| 4 | 44 | 2019-08-01 | 2019-08-15 |
+------+-----------+--------------+--------------|
For example, we want user who are available from 2019-06-20 to 2019-07-05: Marc and Elise are available.
Do I have to use a LEFT JOIN? This request is not working:
SELECT DISTINCT user.*, unavailability.start, unavailability.end,
FROM user
LEFT JOIN unavailability ON unavailability.user_id = user.id
WHERE
unavailability.start < "2019-06-20" AND unavailability.end > "2019-06-20"
AND unavailability.start < "2019-07-05" AND unavailability.end > "2019-07-05"
And I need as result:
+------+-----------+--------------------------------------+
| id | firstname | content |
+------+-----------+--------------------------------------+
| 44 | Marc | ... |
| 55 | Elise | ... |
+------+-----------+--------------------------------------+
With this request I don't get Elise who has no unavailability periods of time.
DROP TABLE IF EXISTS user;
CREATE TABLE user
(id SERIAL PRIMARY KEY
,firstname VARCHAR(12) NOT NULL UNIQUE
);
INSERT INTO user VALUES
(13,'John'),
(44,'Marc'),
(55,'Elise');
DROP TABLE IF EXISTS unavailability ;
CREATE TABLE unavailability
(id SERIAL PRIMARY KEY
,user_id INT NOT NULL
,start DATE NOT NULL
,end DATE NOT NULL
);
INSERT INTO unavailability VALUES
(1,13,'2019-07-01','2019-07-10'),
(2,13,'2019-07-20','2019-07-30'),
(3,13,'2019-09-01','2019-09-30'),
(4,44,'2019-08-01','2019-08-15');
SELECT x.*
FROM user x
LEFT
JOIN unavailability y
ON y.user_id = x.id
AND y.start <= '2019-07-05'
AND y.end >= '2019-06-20'
WHERE y.id IS NULL;
+----+-----------+
| id | firstname |
+----+-----------+
| 44 | Marc |
| 55 | Elise |
+----+-----------+
2 rows in set (0.01 sec)
This approach can be used:
select * from user k
where not exists (
select 1 from user
join unavailability u on u.user_id = user.id
and ('2019-06-20' between start and end or '2019-07-05' between start and end)
where user.id = k.id)
You can select the ids of the unavailables and use this result in a subquery :
Schema (MySQL v5.7)
CREATE TABLE user (
`id` INTEGER,
`firstname` VARCHAR(5),
`content` VARCHAR(3)
);
INSERT INTO user
(`id`, `firstname`, `content`)
VALUES
(13, 'John', '...'),
(44, 'Marc', '...'),
(55, 'Elise', '...');
CREATE TABLE unavailability (
`id` INTEGER,
`user_id` INTEGER,
`start` DATETIME,
`end` DATETIME
);
INSERT INTO unavailability
(`id`, `user_id`, `start`, `end`)
VALUES
(1, 13, '2019-07-01', '2019-07-10'),
(2, 13, '2019-07-20', '2019-07-30'),
(3, 13, '2019-09-01', '2019-09-30'),
(4, 44, '2019-08-01', '2019-08-15');
Query #1
SELECT *
FROM user us
WHERE us.id NOT IN (
SELECT u.user_id
FROM unavailability u
WHERE u.start <= '2019-07-05' AND u.end >= '2019-06-20'
);
| id | firstname | content |
| --- | --------- | ------- |
| 44 | Marc | ... |
| 55 | Elise | ... |
View on DB Fiddle
Note
This condition :
unavailability.start < 2019-06-20 AND unavailability.end > 2019-06-20
AND unavailability.start < 2019-07-05 AND unavailability.end > 2019-07-05
Will be evaluated like this :
unavailability.start < 2019-06-20 AND unavailability.end > 2019-07-05
Because, for the parts unavailability.start < 2019-06-20 AND unavailability.start < 2019-07-05, everything below 2019-07-05 but above 2019-06-20 will be excluded (you are using AND). The same for both unavailability.end

Right join / inner join / multiselect [MYSQL] TABLE RESULTS

I have a big trouble to find a correct way to select a column from another table, and show one results that would contain two tables in the same time.
First table:
id | times | project_id |
12 | 12.24 | 40 |
13 | 13.22 | 40 |
14 | 13.22 | 20 |
15 | 12.22 | 20 |
16 | 13.30 | 40 |
Second table:
id | times | project_id |
32 | 22.24 | 40 |
33 | 23.22 | 40 |
34 | 23.22 | 70 |
35 | 22.22 | 70 |
36 | 23.30 | 40 |
I expect to select all the times from the first table for project_id =40, and join to this times from the second table for the same project_id =40.
The results should be like this below:
id | time | time | project_id |
12 | 12.24 | 22.24 | 40 |
13 | 13.22 | 23.22 | 40 |
16 | 13.30 | 23.30 | 40 |
You need to use UNION ALL between those 2 tables otherwise you will get incorrect results. Once you have all the rows together then you can use variables to carry over "previous values" such as shown below and demonstrated at this SQL Fiddle
MySQL 5.6 Schema Setup:
CREATE TABLE Table1
(`id` int, `times` decimal(6,2), `project_id` int)
;
INSERT INTO Table1
(`id`, `times`, `project_id`)
VALUES
(12, 12.24, 40),
(13, 13.22, 40),
(14, 13.22, 20),
(15, 12.22, 20),
(16, 13.30, 40)
;
CREATE TABLE Table2
(`id` int, `times` decimal(6,2), `project_id` int)
;
INSERT INTO Table2
(`id`, `times`, `project_id`)
VALUES
(32, 22.24, 40),
(33, 23.22, 40),
(34, 23.22, 70),
(35, 22.22, 70),
(36, 23.30, 40)
;
Query 1:
select
project_id, id, prev_time, times
from (
select
#row_num :=IF(#prev_value=d.project_id,#row_num+1,1) AS RowNumber
, d.*
, IF(#row_num %2 = 0, #prev_time, '') prev_time
, #prev_value := d.project_id
, #prev_time := times
from (
select `id`, `times`, `project_id` from Table1
union all
select `id`, `times`, `project_id` from Table2
) d
cross join (select #prev_value := 0, #row_num := 0) vars
order by d.project_id, d.times
) d2
where prev_time <> ''
Results:
| project_id | id | prev_time | times |
|------------|----|-----------|-------|
| 20 | 14 | 12.22 | 13.22 |
| 40 | 13 | 12.24 | 13.22 |
| 40 | 32 | 13.30 | 22.24 |
| 40 | 36 | 23.22 | 23.3 |
| 70 | 34 | 22.22 | 23.22 |
Note: MySQL doe snot currently support LEAD() and LAG() functions when this answer was prepared. When MySQL does support these that approach would be simpler and probably more efficient.
select
d.*
from (
select
d1.*
, LEAD(times,1) OVER(partition by project_id order by times ASC) next_time
from (
select id, times, project_id from Table1
union all
select id, times, project_id from Table2
) d1
) d
where next_time is not null

Select the most recent Date, but smaller than today's date in a record list

grade_id
grade_name
price
update_date.
There are several records for a given Grade, with diferent date & price...:
grade_id grade_name price update_date (y-m-d)
1 A 8$ 2011-02-01
1 A 10$ 2011-03-01
1 A 20$ 2011-04-01
2 B 10$ 2011-02-01
2 B 20$ 2011-03-01
2 B 30$ 2011-04-01
How can i get the last updated price (but in the past) with a select queries..
to get:
1 A 10$ 2011-03-01
2 B 20$ 2011-03-01
as a result...(because most recent price, with a past date..
thx
david
SELECT t1.grade_id, t1.grade_name, t1.price, t1.update_date
FROM my_tbl t1
LEFT JOIN my_tbl t2 on t2.grade_id = t1.grade_id
AND t2.update_date > t1.update_date
AND t2.update_date < CURRENT_DATE
WHERE t1.update_date < CURRENT_DATE
AND t2.grade_id IS NULL
ORDER BY t1.grade_name
Not the fastest solution for a large nr of records, but a readable one.
select *
from table t1
where update_date =
(select max(update_date)
from table t2
where t2.grade_id = t1.grade_id
and t2.update_date < current_date);
A primary key on (grade_id, update_date) on a InnoDB table helps.
root#natasha:test> CREATE TABLE t (grade_id INT UNSIGNED NOT NULL, grade_name CHAR(1) NOT NULL, price CHAR(3) NOT NULL, update_date DATE);
Query OK, 0 rows affected (0.10 sec)
root#natasha:test> INSERT INTO t VALUES (1, 'A', '8$', '2011-02-01'), (1, 'A', '10$', '2011-03-01'), (1, 'A', '20$', '2011-04-01'), (2, 'B', '10$', '2011-02-01'), (2, 'B', '20$', '2011-03-01'), (2, 'B', '30$', '2011-04-01');
Query OK, 6 rows affected (0.13 sec)
Records: 6 Duplicates: 0 Warnings: 0
root#natasha:test> SELECT * FROM t;
+----------+------------+-------+-------------+
| grade_id | grade_name | price | update_date |
+----------+------------+-------+-------------+
| 1 | A | 8$ | 2011-02-01 |
| 1 | A | 10$ | 2011-03-01 |
| 1 | A | 20$ | 2011-04-01 |
| 2 | B | 10$ | 2011-02-01 |
| 2 | B | 20$ | 2011-03-01 |
| 2 | B | 30$ | 2011-04-01 |
+----------+------------+-------+-------------+
6 rows in set (0.00 sec)
root#natasha:test> SELECT * FROM (SELECT * FROM t WHERE update_date < DATE(NOW()) ORDER BY update_date DESC) AS `t` GROUP BY grade_id;
+----------+------------+-------+-------------+
| grade_id | grade_name | price | update_date |
+----------+------------+-------+-------------+
| 1 | A | 10$ | 2011-03-01 |
| 2 | B | 20$ | 2011-03-01 |
+----------+------------+-------+-------------+
2 rows in set (0.00 sec)
select t1.* from table as t1
inner join (
select grade_id,max(update_date) as update_date
from table where update_date < curdate() group by grade_id ) as t2
on t1.grade_id = t2.grade_id and t1.update_date = t2.update_date
Add an index to the table on (grade_id,update_date)
SELECT MAX(update_date) FROM table WHERE DATE(update_date) != DATE(NOW());

How do I create a period date range from a mysql table grouping every common sequence of value in a column

My goal is to return a start and end date having same value in a column. Here is my table. The (*) have been marked to give you the idea of how I want to get "EndDate" for every similar sequence value of A & B columns
ID | DayDate | A | B
-----------------------------------------------
1 | 2010/07/1 | 200 | 300
2 | 2010/07/2 | 200 | 300 *
3 | 2010/07/3 | 150 | 250
4 | 2010/07/4 | 150 | 250 *
8 | 2010/07/5 | 150 | 350 *
9 | 2010/07/6 | 200 | 300
10 | 2010/07/7 | 200 | 300 *
11 | 2010/07/8 | 100 | 200
12 | 2010/07/9 | 100 | 200 *
and I want to get the following result table from the above table
| DayDate |EndDate | A | B
-----------------------------------------------
| 2010/07/1 |2010/07/2 | 200 | 300
| 2010/07/3 |2010/07/4 | 150 | 250
| 2010/07/5 |2010/07/5 | 150 | 350
| 2010/07/6 |2010/07/7 | 200 | 300
| 2010/07/8 |2010/07/9 | 100 | 200
UPDATE:
Thanks Mike, The approach of yours seems to work in your perspective of considering the following row as a mistake.
8 | 2010/07/5 | 150 | 350 *
However it is not a mistake. The challenge I am faced with this type of data is like a scenario of logging a market price change with date. The real problem in mycase is to select all rows with the beginning and ending date if both A & B matches in all these rows. Also to select the rows which are next to previously selected, and so on like that no data is left out in the table.
I can explain a real world scenario. A Hotel with Room A and B has room rates for each day entered in to table as explained in my question. Now the hotel needs to get a report to show the price calendar in a shorter way using start and end date, instead of listing all the dates entered. For example, on 2010/07/01 to 2010/07/02 the price of A is 200 and B is 300. This price is changed from 3rd to 4th and on 5th there is a different price only for that day where the Room B is price is changed to 350. So this is considered as a single day difference, thats why start and end dates are same.
I hope this explained the scenario of the problem. Also note that this hotel may be closed for a specific time period, lets say this is an additional problem to my first question. The problem is what if the rate is not entered on specific dates, for example on Sundays the hotel do not sell these two rooms so they entered no price, meaning the row will not exist in the table.
Creating related tables allows you much greater freedom to query and extract relevant information. Here's a few links that you might find useful:
You could start with these tutorials:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
http://net.tutsplus.com/tutorials/databases/sql-for-beginners/
There are also a couple of questions here on stackoverflow that might be useful:
Normalization in plain English
What exactly does database normalization do?
Anyway, on to a possible solution. The following examples use your hotel rooms analogy.
First, create a table to hold information about the hotel rooms. This table just contains the room ID and its name, but you could store other information in here, such as the room type (single, double, twin), its view (ocean front, ocean view, city view, pool view), and so on:
CREATE TABLE `room` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(45) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `name_UNIQUE` (`name` ASC) )
ENGINE = InnoDB;
Now create a table to hold the changing room rates. This table links to the room table through the room_id column. The foreign key constraint prevents records being inserted into the rate table which refer to rooms that do not exist:
CREATE TABLE `rate` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT ,
`room_id` INT UNSIGNED NOT NULL,
`date` DATE NOT NULL,
`rate` DECIMAL(6,2) UNSIGNED NOT NULL,
PRIMARY KEY (`id`),
INDEX `fk_room_rate` (`room_id` ASC),
CONSTRAINT `fk_room_rate`
FOREIGN KEY (`room_id` )
REFERENCES `room` (`id` )
ON DELETE CASCADE
ON UPDATE CASCADE)
ENGINE = InnoDB;
Create two rooms, and add some daily rate information about each room:
INSERT INTO `room` (`id`, `name`) VALUES (1, 'A'), (2, 'B');
INSERT INTO `rate` (`id`, `room_id`, `date`, `rate`) VALUES
( 1, 1, '2010-07-01', 200),
( 2, 1, '2010-07-02', 200),
( 3, 1, '2010-07-03', 150),
( 4, 1, '2010-07-04', 150),
( 5, 1, '2010-07-05', 150),
( 6, 1, '2010-07-06', 200),
( 7, 1, '2010-07-07', 200),
( 8, 1, '2010-07-08', 100),
( 9, 1, '2010-07-09', 100),
(10, 2, '2010-07-01', 300),
(11, 2, '2010-07-02', 300),
(12, 2, '2010-07-03', 250),
(13, 2, '2010-07-04', 250),
(14, 2, '2010-07-05', 350),
(15, 2, '2010-07-06', 300),
(16, 2, '2010-07-07', 300),
(17, 2, '2010-07-08', 200),
(18, 2, '2010-07-09', 200);
With that information stored, a simple SELECT query with a JOIN will show you the all the daily room rates:
SELECT
room.name,
rate.date,
rate.rate
FROM room
JOIN rate
ON rate.room_id = room.id;
+------+------------+--------+
| A | 2010-07-01 | 200.00 |
| A | 2010-07-02 | 200.00 |
| A | 2010-07-03 | 150.00 |
| A | 2010-07-04 | 150.00 |
| A | 2010-07-05 | 150.00 |
| A | 2010-07-06 | 200.00 |
| A | 2010-07-07 | 200.00 |
| A | 2010-07-08 | 100.00 |
| A | 2010-07-09 | 100.00 |
| B | 2010-07-01 | 300.00 |
| B | 2010-07-02 | 300.00 |
| B | 2010-07-03 | 250.00 |
| B | 2010-07-04 | 250.00 |
| B | 2010-07-05 | 350.00 |
| B | 2010-07-06 | 300.00 |
| B | 2010-07-07 | 300.00 |
| B | 2010-07-08 | 200.00 |
| B | 2010-07-09 | 200.00 |
+------+------------+--------+
To find the start and end dates for each room rate, you need a more complex query:
SELECT
id,
room_id,
MIN(date) AS start_date,
MAX(date) AS end_date,
COUNT(*) AS days,
rate
FROM (
SELECT
id,
room_id,
date,
rate,
(
SELECT COUNT(*)
FROM rate AS b
WHERE b.rate <> a.rate
AND b.date <= a.date
AND b.room_id = a.room_id
) AS grouping
FROM rate AS a
ORDER BY a.room_id, a.date
) c
GROUP BY rate, grouping
ORDER BY room_id, MIN(date);
+----+---------+------------+------------+------+--------+
| id | room_id | start_date | end_date | days | rate |
+----+---------+------------+------------+------+--------+
| 1 | 1 | 2010-07-01 | 2010-07-02 | 2 | 200.00 |
| 3 | 1 | 2010-07-03 | 2010-07-05 | 3 | 150.00 |
| 6 | 1 | 2010-07-06 | 2010-07-07 | 2 | 200.00 |
| 8 | 1 | 2010-07-08 | 2010-07-09 | 2 | 100.00 |
| 10 | 2 | 2010-07-01 | 2010-07-02 | 2 | 300.00 |
| 12 | 2 | 2010-07-03 | 2010-07-04 | 2 | 250.00 |
| 14 | 2 | 2010-07-05 | 2010-07-05 | 1 | 350.00 |
| 15 | 2 | 2010-07-06 | 2010-07-07 | 2 | 300.00 |
| 17 | 2 | 2010-07-08 | 2010-07-09 | 2 | 200.00 |
+----+---------+------------+------------+------+--------+
You can find a good explanation of the technique used in the above query here:
http://www.sqlteam.com/article/detecting-runs-or-streaks-in-your-data
My general approach is to join the table onto itself based on DayDate = DayDate+1 and the A or B values not being equal
This will find the end dates for each period (where the value is going to be different on the following day)
The only problem is, that won't find an end date for the final period. To get around this, I selct the max date from the table and union that into my list of end dates
Once you have the list of end dates defined, you can join them to the original table based on the end date being greater than or equal to the original date
From this final list, select the minimum daydate grouped by the other fields
select
min(DayDate) as DayDate,EndDate,A,B from
(SELECT DayDate, A, B, min(ends.EndDate) as EndDate
FROM yourtable
LEFT JOIN
(SELECT max(DayDate) as EndDate FROM yourtable UNION
SELECT t1.DayDate as EndDate
FROM yourtable t1
JOIN yourtable t2
ON date_add(t1.DayDate, INTERVAL 1 DAY) = t2.DayDate
AND (t1.A<>t2.A OR t1.B<>t2.B)) ends
ON ends.EndDate>=DayDate
GROUP BY DayDate, A, B) x
GROUP BY EndDate,A,B
I think I have found a solution which does produce the table desired.
SELECT
a.DayDate AS StartDate,
( SELECT b.DayDate
FROM Dates AS b
WHERE b.DayDate > a.DayDate AND (b.B = a.B OR b.B IS NULL)
ORDER BY b.DayDate ASC LIMIT 1
) AS StopDate,
a.A as A,
a.B AS B
FROM Dates AS a
WHERE Coalesce(
(SELECT c.B
FROM Dates AS c
WHERE c.DayDate <= a.DayDate
ORDER BY c.DayDate DESC LIMIT 1,1
), -99999
) <> a.B
AND a.B IS NOT NULL
ORDER BY a.DayDate ASC;
is able to generate the following table result
StartDate StopDate A B
2010-07-01 2010-07-02 200 300
2010-07-03 2010-07-04 150 250
2010-07-05 NULL 150 350
2010-07-06 2010-07-07 200 300
2010-07-08 2010-07-09 100 200
But I need a way to replace the NULL with the same date of the start date.