Cross join, on multiple columns, without duplicates - mysql

We have two tables with a mostly unique email, and a date where a transaction was sent (from one system) and received (in another system):
CREATE TABLE `alpha` (
`id` int(11) NOT NULL,
`email` varchar(255) NOT NULL,
`date_sent` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `alpha`
VALUES
(12344,'loremipsum#example.com','2013-01-01 02:26:04'),
(12345,'foobar#example.com','2013-01-01 04:39:16'),
(12346,'foobar#example.com','2013-01-01 04:43:18');
CREATE TABLE `bravo` (
`id` int(11) NOT NULL,
`email` varchar(60) DEFAULT NULL,
`date_recvd` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `bravo`
VALUES
(98764,'loremipsum#example.com','2013-01-01 03:29:12'),
(98765,'foobar#example.com','2013-01-01 05:42:08'),
(98766,'foobar#example.com','2013-01-01 05:46:08');
With a simple join on email and m/d/y of the date:
select a.id, a.date_sent, b.id, b.date_recvd
from alpha a inner join bravo b
on a.email = b.email and date_format(a.date_sent,'%m/%d/%Y') = date_format(b.date_recvd,'%m/%d/%Y')
We get every permutation of email+date:
| a.id | a.date_sent | b.id | b.date_recvd |
+-------+---------------------+-------+---------------------+
| 12344 | 2013-01-01 02:26:04 | 98764 | 2013-01-01 03:29:12 |
| 12345 | 2013-01-01 04:39:16 | 98765 | 2013-01-01 05:42:08 |
| 12346 | 2013-01-01 04:43:18 | 98765 | 2013-01-01 05:42:08 |
| 12345 | 2013-01-01 04:39:16 | 98766 | 2013-01-01 05:46:08 |
| 12346 | 2013-01-01 04:43:18 | 98766 | 2013-01-01 05:46:08 |
What we want is something more like this, where we join firstly on the email, and then only the dates in an order that they sort of line up:
| a.id | a.date_sent | b.id | b.date_recvd |
+-------+---------------------+-------+---------------------+
| 12344 | 2013-01-01 02:26:04 | 98764 | 2013-01-01 03:29:12 |
| 12345 | 2013-01-01 04:39:16 | 98765 | 2013-01-01 05:42:08 |
| 12346 | 2013-01-01 04:43:18 | 98766 | 2013-01-01 05:46:08 |
But I'm not even certain how to approach this?
Clarification: What we'd like to do is, emails being equal, eliminate the duplicates so that the date gaps are smallest.

Under certain conditions the following query will provide the results you want:
SELECT an.*, bn.*
FROM
(SELECT a.*,
(CASE a.email
WHEN #curEmail THEN #i:=#i+1
ELSE #i:=1 AND #curEmail:=a.email
END) AS rn
FROM (SELECT #i:=0, #curEmail:='') foo, (SELECT * FROM alpha ORDER BY email, date_sent) a) an
JOIN
(SELECT b.*,
(CASE b.email
WHEN #curEmail THEN #i:=#i+1
ELSE #i:=1 AND #curEmail:=b.email
END) AS rn
FROM (SELECT #i:=0, #curEmail:='') foo, (SELECT * FROM bravo ORDER BY email, date_recvd) b) bn
ON an.email=bn.email AND an.rn=bn.rn;
With the limited data you provided, this works. You can see it here: SQLFiddle
What this is doing is:
Adding an rn column to alpha... this is some sort of row numbering within all rows with the same email, sorted by date_sent
Adding an rn column to bravo... same as above
JOINing the two result sets on email and rn
This will work ONLY if alpha and bravo contain good data that matches well.
The conditions are quite strict, especially on the bravo table. In particular, bravo should not contain and early rows... rows that match email with alpha, but have date_recvd less than the first alpha date_sent (with same email).
You could elaborate on this and work out a more complex version that works on email, date (day only) and rownumber... as you suggested in your question. But I don't think this is a good solution. I see you have significant gaps between date_sent and date_recvd. If the gaps roll over midnight you will not be able to match rows correctly.

Related

Get specific values from same column within grouped rows

This is a problem for which I have a working query, but it feels horribly inefficient to me and I'd like some help constructing a better one. This is going into a live production environment, and the number of queries the db handles each day is incredibly high, so the more efficient this can be, the better. I have a table structured something like this (stripped to just the relevant parts):
id | type | datecolumn
1 | A | 2014-01-01
1 | B | 0000-00-00
2 | A | 2014-01-02
2 | B | 2014-01-10
3 | A | 2014-01-01
3 | B | 0000-00-00
There will always be two rows for each id, one of type A and one of type B. A will always have a valid date, and B will either have a date >= that of A, or all 0s. What I want is a query that will produce output similar to this:
id | date for A | date for B
1 | 2014-01-01 | None
2 | 2014-01-02 | 2014-01-10
3 | 2014-01-01 | None
The way I'm doing this now is as follows:
SELECT
id,
IF(MIN(datecolumn) > 0, MIN(datecolumn), MAX(datecolumn)) AS 'date for A',
IF(MIN(datecolumn) > 0, MAX(datecolumn), 'None') AS 'date for B'
GROUP BY id
But it really feels like I should be able to pluck the datecolumn value on a by-type basis somehow. I know the simplest solution should be to change the table structure so that each id only uses one row, but I'm afraid that is not possible in this case; there has to be two rows. Is there a way to leverage the type column properly in this query?
Edit: Also, this is on a table that will have upwards of 10,000,000 rows. So again, efficiency is key.
I'd stick with what you've go, but maybe write it this way...
CREATE TABLE my_table
(id INT NOT NULL
,type CHAR(1) NOT NULL
,datecolumn DATE NOT NULL DEFAULT '0000-00-00'
,PRIMARY KEY(id,type)
);
INSERT INTO my_table VALUES
(1 ,'A','2014-01-01'),
(1 ,'B','0000-00-00'),
(2 ,'A','2014-01-02'),
(2 ,'B','2014-01-10'),
(3 ,'A','2014-01-01'),
(3 ,'B','0000-00-00');
SELECT id
, MAX(CASE WHEN type = 'A' THEN datecolumn END) a
, MAX(REPLACE(CASE WHEN type='B' THEN datecolumn END,'0000-00-00','none')) b
FROM my_table
GROUP
BY id;
+----+------------+------------+
| id | a | b |
+----+------------+------------+
| 1 | 2014-01-01 | none |
| 2 | 2014-01-02 | 2014-01-10 |
| 3 | 2014-01-01 | none |
+----+------------+------------+
Make sure you have an index that covers both the id and type columns (e.g ALTER TABLE tbl ADD INDEX (type,id)), then do:
SELECT
table_a.id,
table_a.datecolumn AS 'date for A',
IF(table_b.datecolumn > 0, table_b.datecolumn, 'None') AS 'date for B'
FROM tbl AS table_a
JOIN tbl AS table_b ON table_a.id = table_b.id AND table_b.type = 'B'
WHERE table_a.type = 'A';

MySQL GROUP BY with sorting

I'm having some trouble writing succinct code to generate the desired result efficiently (on a multiple million records DB).
items will be grouped by time
items will be selected by provider being that B takes precedence over A (and C over B)
value must match value of selected provider
Table vs wanted result:
// given this table
id | provider | time | value
---+----------+------------+-----------
1 | A | 2013-07-01 | 0.1
2 | A | 2013-07-02 | 0.2
3 | B | 2013-07-02 | 0.3
4 | A | 2013-07-03 | 0.4
// extrapolate this result
---+----------+------------+-----------
1 | A | 2013-07-01 | 0.1
3 | B | 2013-07-02 | 0.3
4 | A | 2013-07-03 | 0.4
The queries to generate table and populate data:
data_teste CREATE TABLE `data_teste` (`id` int(11) unsigned NOT NULL AUTO_INCREMENT,`provider` varchar(12) NOT NULL,`time` date NOT NULL,`value` double NOT NULL,PRIMARY KEY (`id`),UNIQUE KEY `index` (`provider`,`time`),KEY `provider` (`provider`),KEY `time` (`time`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO data_teste(`provider`, `time`, `value`) VALUES('A', '2013-07-01', 0.1),('A', '2013-07-02', 0.2),('B', '2013-07-02', 0.3),('A', '2013-07-03', 0.4);
This is the classic group_by/sort problem windowed.
Thank you very much.
select d.*
from data_teste d
inner join
(
select `time`, max(provider) mp
from data_teste
group by `time`
) x on x.mp = d.provider
and x.`time` = d.`time`
order by `time` asc,
provider desc
How well does this perform?
SELECT
*
FROM
`data_teste` dt1
LEFT JOIN `data_teste` dt2 ON ( dt2.time = dt1.time
AND dt2.provider > dt1.provider )
WHERE
dt2.ID IS NULL

Join two tables where table A has a date value and needs to find the next date in B below the date in A

I got this table "A":
| id | date |
===================
| 1 | 2010-01-13 |
| 2 | 2011-04-19 |
| 3 | 2011-05-07 |
| .. | ... |
and this table "B":
| date | value |
======================
| 2009-03-29 | 0.5 |
| 2010-01-30 | 0.55 |
| 2011-08-12 | 0.67 |
Now I am looking for a way to JOIN those two tables having the "value" column in "B" mapped to the dates in "A". The tricky part for me here is that table "B" only stores the change date and the new value. Now when I need this value in table "A" the SQL needs to look back what date is the next below the date it is asking the value for.
So in the end the JOIN of those tables should look like this:
| id | date | value |
===========================
| 1 | 2010-01-13 | 0.5 |
| 2 | 2011-04-19 | 0.55 |
| 3 | 2011-05-07 | 0.55 |
| .. | ... | ... |
How can I do this?
-- Create and fill first table
CREATE TABLE `id_date` (
`id` int(11) NOT NULL auto_increment,
`iddate` date NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
INSERT INTO `id_date` VALUES(1, '2010-01-13');
INSERT INTO `id_date` VALUES(2, '2011-04-19');
INSERT INTO `id_date` VALUES(3, '2011-05-07');
-- Create and fill second table
CREATE TABLE `date_val` (
`mydate` date NOT NULL,
`myval` varchar(4) collate utf8_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
INSERT INTO `date_val` VALUES('2009-03-29', '0.5');
INSERT INTO `date_val` VALUES('2010-01-30', '0.55');
INSERT INTO `date_val` VALUES('2011-08-12', '0.67');
-- Get the result table as asked in question
SELECT iddate, t2.mydate, t2.myval
FROM `id_date` t1
JOIN date_val t2 ON t2.mydate <= t1.iddate
AND t2.mydate = (
SELECT MAX( t3.mydate )
FROM `date_val` t3
WHERE t3.mydate <= t1.iddate )
What we're doing:
for each date in the id_date table (your table A),
we find the date in the date_val table (your table B)
which is the highest date in the date_val table (but still smaller than the id_date.date)
You could use a subquery with limit 1 to look up the latest value in table B:
select id
, date
, (
select value
from B
where B.date < A.date
order by
B.date desc
limit 1
) as value
from A
I have been inspired by the other answers but ended with my own solution using common table expressions:
WITH datecombination (id, adate, bdate) AS
(
SELECT id, A.date, MAX(B.Date) as Bdate
FROM tableA A
LEFT JOIN tableB B
ON B.date <= A.date
GROUP BY A.id, A.date
)
SELECT DC.id, DC.adate, B.value FROM datecombination DC
LEFT JOIN tableB B
ON DC.bdate = B.bdate
The INNER JOIN return rows when there is at least one match in both tables. Try this.
Select A.id,A.date,b.value
from A inner join B
on A.date=b.date

MySQL grouping by date range with multiple joins

I currently have quite a messy query, which joins data from multiple tables involving two subqueries. I now have a requirement to group this data by DAY(), WEEK(), MONTH(), and QUARTER().
I have three tables: days, qos and employees. An employee is self-explanatory, a day is a summary of an employee's performance on a given day, and qos is a random quality inspection, which can be performed many times a day.
At the moment, I am selecting all employees, and LEFT JOINing day and qos, which works well. However, now, I need to group the data in order to breakdown a team or individual's performance over a date range.
Taking this data:
Employee
id | name
------------------
1 | Bob Smith
Day
id | employee_id | day_date | calls_taken
---------------------------------------------
1 | 1 | 2011-03-01 | 41
2 | 1 | 2011-03-02 | 24
3 | 1 | 2011-04-01 | 35
Qos
id | employee_id | qos_date | score
----------------------------------------
1 | 1 | 2011-03-03 | 85
2 | 1 | 2011-03-03 | 95
3 | 1 | 2011-04-01 | 91
If I were to start by grouping by DAY(), I would need to see the following results:
Day__date | Day__Employee__id | Day__calls | Day__qos_score
------------------------------------------------------------
2011-03-01 | 1 | 41 | NULL
2011-03-02 | 1 | 24 | NULL
2011-03-03 | 1 | NULL | 90
2011-04-01 | 1 | 35 | 91
As you see, Day__calls should be SUM(calls_taken) and Day__qos_score is AVG(score). I've tried using a similar method as above, but as the date isn't known until one of the tables has been joined, its only displaying a record where there's a day saved.
Is there any way of doing this, or am I going about things the wrong way?
Edit: As requested, here's what I've come up with so far. However, it only shows dates where there's a day.
SELECT COALESCE(`day`.day_date, qos.qos_date) AS Day__date,
employee.id AS Day__Employee__id,
`day`.calls_taken AS Day__Day__calls,
qos.score AS Day__Qos__score
FROM faults_employees `employee`
LEFT JOIN (SELECT `day`.employee_id AS employee_id,
SUM(`day`.calls_taken) AS `calls_in`,
FROM faults_days AS `day`
WHERE employee.id = 7
GROUP BY (`day`.day_date)
) AS `day`
ON `day`.employee_id = `employee`.id
LEFT JOIN (SELECT `qos`.employee_id AS employee_id,
AVG(qos.score) AS `score`
FROM faults_qos qos
WHERE employee.id = 7
GROUP BY (qos.qos_date)
) AS `qos`
ON `qos`.employee_id = `employee`.id AND `qos`.qos_date = `day`.day_date
WHERE employee.id = 7
GROUP BY Day__date
ORDER BY `day`.day_date ASC
The solution I'm comming up with looks like:
SELECT
`date`,
`employee_id`,
SUM(`union`.`calls_taken`) AS `calls_taken`,
AVG(`union`.`score`) AS `score`
FROM ( -- select from union table
(SELECT -- first select all calls taken, leaving qos_score null
`day`.`day_date` AS `date`,
`day`.`employee_id`,
`day`.`calls_taken`,
NULL AS `score`
FROM `employee`
LEFT JOIN
`day`
ON `day`.`employee_id` = `employee`.`id`
)
UNION -- union both tables
(
SELECT -- now select qos score, leaving calls taken null
`qos`.`qos_date` AS `date`,
`qos`.`employee_id`,
NULL AS `calls_taken`,
`qos`.`score`
FROM `employee`
LEFT JOIN
`qos`
ON `qos`.`employee_id` = `employee`.`id`
)
) `union`
GROUP BY `union`.`date` -- group union table by date
For the UNION to work, we have to set the qos_score field in the day table and the calls_taken field in the qos table to null. If we don't, both calls_taken and score would be selected into the same column by the UNION statement.
After this, I selected the required fields with the aggregation functions SUM() and AVG() from the union'd table, grouping by the date field in the union table.

How to get smallest column value without triggering "Mixing of GROUP columns [...] with no GROUP columns is illegal if there is no GROUP BY clause"?

I have a table 'foo' with a timestamp field 'bar'. How do I get only the oldest timestamp for a query like: SELECT foo.bar from foo? I tried doing something like: SELECT MIN(foo.bar) from foo but it failed with this error
ERROR 1140 (42000) at line 1: Mixing of GROUP columns (MIN(),MAX(),COUNT(),...) with no GROUP columns is illegal if there is no GROUP BY clause
OK, so my query is much more complicated than that and that's why I am having a hard time with it. This is the query with the MIN(a.timestamp):
select distinct a.user_id as 'User ID',
a.project_id as 'Remix Project Id',
prjs.based_on_pid as 'Original Project ID',
(case when f.reasons is NULL then 'N' else 'Y' end)
as 'Flagged Y or N',
f.reasons, f.timestamp, MIN(a.timestamp)
from view_stats a
join (select id, based_on_pid, user_id
from projects p) prjs on
(a.project_id = prjs.id)
left outer join flaggers f on
( f.project_id = a.project_id
and f.user_id = a.user_id)
where a.project_id in
(select distinct b.id
from projects b
where b.based_on_pid in
( select distinct c.id
from projects c
where c.user_id = a.user_id
)
)
order by f.reasons desc, a.user_id, a.project_id;
Any help would be greatly appreciated.
The view_stats table:
+------------+------------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+-------------------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| user_id | int(10) unsigned | NO | MUL | 0 | |
| project_id | int(10) unsigned | NO | MUL | 0 | |
| ipaddress | bigint(20) | YES | MUL | NULL | |
| timestamp | timestamp | NO | | CURRENT_TIMESTAMP | |
+------------+------------------+------+-----+-------------------+----------------+
If you are going to use aggregate functions (like min(), max(), avg(), etc.) you need to tell the database what exactly it needs to take the min() of.
transaction date
one 8/4/09
one 8/5/09
one 8/6/09
two 8/1/09
two 8/3/09
three 8/4/09
I assume you want the following.
transaction date
one 8/4/09
two 8/1/09
three 8/4/09
Then to get that you can use the following query...note the group by clause which tells the database how to group the data and get the min() of something.
select
transaction,
min(date)
from
table
group by
transaction