I'm having some trouble writing succinct code to generate the desired result efficiently (on a multiple million records DB).
items will be grouped by time
items will be selected by provider being that B takes precedence over A (and C over B)
value must match value of selected provider
Table vs wanted result:
// given this table
id | provider | time | value
---+----------+------------+-----------
1 | A | 2013-07-01 | 0.1
2 | A | 2013-07-02 | 0.2
3 | B | 2013-07-02 | 0.3
4 | A | 2013-07-03 | 0.4
// extrapolate this result
---+----------+------------+-----------
1 | A | 2013-07-01 | 0.1
3 | B | 2013-07-02 | 0.3
4 | A | 2013-07-03 | 0.4
The queries to generate table and populate data:
data_teste CREATE TABLE `data_teste` (`id` int(11) unsigned NOT NULL AUTO_INCREMENT,`provider` varchar(12) NOT NULL,`time` date NOT NULL,`value` double NOT NULL,PRIMARY KEY (`id`),UNIQUE KEY `index` (`provider`,`time`),KEY `provider` (`provider`),KEY `time` (`time`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO data_teste(`provider`, `time`, `value`) VALUES('A', '2013-07-01', 0.1),('A', '2013-07-02', 0.2),('B', '2013-07-02', 0.3),('A', '2013-07-03', 0.4);
This is the classic group_by/sort problem windowed.
Thank you very much.
select d.*
from data_teste d
inner join
(
select `time`, max(provider) mp
from data_teste
group by `time`
) x on x.mp = d.provider
and x.`time` = d.`time`
order by `time` asc,
provider desc
How well does this perform?
SELECT
*
FROM
`data_teste` dt1
LEFT JOIN `data_teste` dt2 ON ( dt2.time = dt1.time
AND dt2.provider > dt1.provider )
WHERE
dt2.ID IS NULL
Related
How can I divide ((89-95)/95)*100 or ((95-100)/100)*100
CREATE TABLE `priceindex` (
`priceIndexId` int(11) NOT NULL,
`Date` date DEFAULT NULL,
`Price` decimal(31,9) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
priceIndexId | Date | Price | Currentvalue/previous value
1 | 2017-11-30 | 100 |
2 | 2017-12-06 | 95 | answer should be(95-100)/100)*100 = -0.50
3 | 2017-12-13 | 89 | answer should be(89-95)/95)*100 = -0.63
Thanks in advance
You need to get the previous value. One method uses a correlated subquery. I would suggest using a subquery for the calculation:
select pi.priceIndexId, pi.Date, pi.Price,
(pi.Price - pi.prev_price) / pi.prev_price as change
from (select pi.*,
(select pi2.price
from priceindex pi2
where pi2.date < pi.date
order by pi2.date desc
limit 1
) as prev_price
from priceindex pi
) pi;
We have two tables with a mostly unique email, and a date where a transaction was sent (from one system) and received (in another system):
CREATE TABLE `alpha` (
`id` int(11) NOT NULL,
`email` varchar(255) NOT NULL,
`date_sent` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `alpha`
VALUES
(12344,'loremipsum#example.com','2013-01-01 02:26:04'),
(12345,'foobar#example.com','2013-01-01 04:39:16'),
(12346,'foobar#example.com','2013-01-01 04:43:18');
CREATE TABLE `bravo` (
`id` int(11) NOT NULL,
`email` varchar(60) DEFAULT NULL,
`date_recvd` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `bravo`
VALUES
(98764,'loremipsum#example.com','2013-01-01 03:29:12'),
(98765,'foobar#example.com','2013-01-01 05:42:08'),
(98766,'foobar#example.com','2013-01-01 05:46:08');
With a simple join on email and m/d/y of the date:
select a.id, a.date_sent, b.id, b.date_recvd
from alpha a inner join bravo b
on a.email = b.email and date_format(a.date_sent,'%m/%d/%Y') = date_format(b.date_recvd,'%m/%d/%Y')
We get every permutation of email+date:
| a.id | a.date_sent | b.id | b.date_recvd |
+-------+---------------------+-------+---------------------+
| 12344 | 2013-01-01 02:26:04 | 98764 | 2013-01-01 03:29:12 |
| 12345 | 2013-01-01 04:39:16 | 98765 | 2013-01-01 05:42:08 |
| 12346 | 2013-01-01 04:43:18 | 98765 | 2013-01-01 05:42:08 |
| 12345 | 2013-01-01 04:39:16 | 98766 | 2013-01-01 05:46:08 |
| 12346 | 2013-01-01 04:43:18 | 98766 | 2013-01-01 05:46:08 |
What we want is something more like this, where we join firstly on the email, and then only the dates in an order that they sort of line up:
| a.id | a.date_sent | b.id | b.date_recvd |
+-------+---------------------+-------+---------------------+
| 12344 | 2013-01-01 02:26:04 | 98764 | 2013-01-01 03:29:12 |
| 12345 | 2013-01-01 04:39:16 | 98765 | 2013-01-01 05:42:08 |
| 12346 | 2013-01-01 04:43:18 | 98766 | 2013-01-01 05:46:08 |
But I'm not even certain how to approach this?
Clarification: What we'd like to do is, emails being equal, eliminate the duplicates so that the date gaps are smallest.
Under certain conditions the following query will provide the results you want:
SELECT an.*, bn.*
FROM
(SELECT a.*,
(CASE a.email
WHEN #curEmail THEN #i:=#i+1
ELSE #i:=1 AND #curEmail:=a.email
END) AS rn
FROM (SELECT #i:=0, #curEmail:='') foo, (SELECT * FROM alpha ORDER BY email, date_sent) a) an
JOIN
(SELECT b.*,
(CASE b.email
WHEN #curEmail THEN #i:=#i+1
ELSE #i:=1 AND #curEmail:=b.email
END) AS rn
FROM (SELECT #i:=0, #curEmail:='') foo, (SELECT * FROM bravo ORDER BY email, date_recvd) b) bn
ON an.email=bn.email AND an.rn=bn.rn;
With the limited data you provided, this works. You can see it here: SQLFiddle
What this is doing is:
Adding an rn column to alpha... this is some sort of row numbering within all rows with the same email, sorted by date_sent
Adding an rn column to bravo... same as above
JOINing the two result sets on email and rn
This will work ONLY if alpha and bravo contain good data that matches well.
The conditions are quite strict, especially on the bravo table. In particular, bravo should not contain and early rows... rows that match email with alpha, but have date_recvd less than the first alpha date_sent (with same email).
You could elaborate on this and work out a more complex version that works on email, date (day only) and rownumber... as you suggested in your question. But I don't think this is a good solution. I see you have significant gaps between date_sent and date_recvd. If the gaps roll over midnight you will not be able to match rows correctly.
Let's assume that the following tables in MySQL describe documents contained in folders.
mysql> select * from folder;
+----+----------------+
| ID | PATH |
+----+----------------+
| 1 | matches/1 |
| 2 | matches/2 |
| 3 | shared/3 |
| 4 | no/match/4 |
| 5 | unreferenced/5 |
+----+----------------+
mysql> select * from DOC;
+----+------+------------+
| ID | F_ID | DATE |
+----+------+------------+
| 1 | 1 | 2000-01-01 |
| 2 | 2 | 2000-01-02 |
| 3 | 2 | 2000-01-03 |
| 4 | 3 | 2000-01-04 |
| 5 | 3 | 2000-01-05 |
| 6 | 3 | 2000-01-06 |
| 7 | 4 | 2000-01-07 |
| 8 | 4 | 2000-01-08 |
| 9 | 4 | 2000-01-09 |
| 10 | 4 | 2000-01-10 |
+----+------+------------+
The columns ID are the primary keys and the column F_ID of table DOC is a not-null foreign key that references the primary key of table FOLDER. By using the 'DATE' of documents in the where clause, I would like to find which folders contain only the selected documents. For documents earlier than 2000-01-05, this could be written as:
SELECT DISTINCT d1.F_ID
FROM DOC d1
WHERE d1.DATE < '2000-01-05'
AND d1.F_ID NOT IN (
SELECT d2.F_ID
FROM DOC d2 WHERE NOT (d2.DATE < '2000-01-05')
);
and it correctly returns '1' and '2'. By reading
http://dev.mysql.com/doc/refman/5.5/en/rewriting-subqueries.html
the performance for big tables could be improved if the subquery is replaced with a join. I already found questions related to NOT IN and JOINS but not exactly what I was looking for. So, any ideas of how this could be written with joins ?
The general answer is:
select t.*
from t
where t.id not in (select id from s)
Can be rewritten as:
select t.*
from t left outer join
(select distinct id from s) s
on t.id = s.id
where s.id is null
I think you can apply this to your situation.
select distinct d1.F_ID
from DOC d1
left outer join (
select F_ID
from DOC
where date >= '2000-01-05'
) d2 on d1.F_ID = d2.F_ID
where d1.date < '2000-01-05'
and d2.F_ID is null
If I understand your question correctly, that you want to find the F_IDs representing folders which only contains documents from before '2000-01-05', then simply
SELECT F_ID
FROM DOC
GROUP BY F_ID
HAVING MAX(DATE) < '2000-01-05'
Sample Table and Insert Statements
CREATE TABLE `tleft` (
`id` int(2) NOT NULL,
`name` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
CREATE TABLE `tright` (
`id` int(2) NOT NULL,
`t_left_id` int(2) DEFAULT NULL,
`description` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
INSERT INTO `tleft` (`id`, `name`)
VALUES
(1, 'henry'),
(2, 'steve'),
(3, 'jeff'),
(4, 'richards'),
(5, 'elon');
INSERT INTO `tright` (`id`, `t_left_id`, `description`)
VALUES
(1, 1, 'sample'),
(2, 2, 'sample');
Left Join : SELECT l.id,l.name FROM tleft l LEFT JOIN tright r ON l.id = r.t_left_id ;
Returns Id : 1, 2, 3, 4, 5
Right Join : SELECT l.id,l.name FROM tleft l RIGHT JOIN tright r ON l.id = r.t_left_id ;
Returns Id : 1,2
Subquery Not in tright : select id from tleft where id not in ( select t_left_id from tright);
Returns Id : 3,4,5
Equivalent Join For above subquery :
SELECT l.id,l.name FROM tleft l LEFT JOIN tright r ON l.id = r.t_left_id WHERE r.t_left_id IS NULL;
AND clause will be applied during the JOIN and WHERE clause will be applied after the JOIN .
Example : SELECT l.id,l.name FROM tleft l LEFT JOIN tright r ON l.id = r.t_left_id AND r.description ='hello' WHERE r.t_left_id IS NULL ;
Hope this helps
I got this table "A":
| id | date |
===================
| 1 | 2010-01-13 |
| 2 | 2011-04-19 |
| 3 | 2011-05-07 |
| .. | ... |
and this table "B":
| date | value |
======================
| 2009-03-29 | 0.5 |
| 2010-01-30 | 0.55 |
| 2011-08-12 | 0.67 |
Now I am looking for a way to JOIN those two tables having the "value" column in "B" mapped to the dates in "A". The tricky part for me here is that table "B" only stores the change date and the new value. Now when I need this value in table "A" the SQL needs to look back what date is the next below the date it is asking the value for.
So in the end the JOIN of those tables should look like this:
| id | date | value |
===========================
| 1 | 2010-01-13 | 0.5 |
| 2 | 2011-04-19 | 0.55 |
| 3 | 2011-05-07 | 0.55 |
| .. | ... | ... |
How can I do this?
-- Create and fill first table
CREATE TABLE `id_date` (
`id` int(11) NOT NULL auto_increment,
`iddate` date NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
INSERT INTO `id_date` VALUES(1, '2010-01-13');
INSERT INTO `id_date` VALUES(2, '2011-04-19');
INSERT INTO `id_date` VALUES(3, '2011-05-07');
-- Create and fill second table
CREATE TABLE `date_val` (
`mydate` date NOT NULL,
`myval` varchar(4) collate utf8_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
INSERT INTO `date_val` VALUES('2009-03-29', '0.5');
INSERT INTO `date_val` VALUES('2010-01-30', '0.55');
INSERT INTO `date_val` VALUES('2011-08-12', '0.67');
-- Get the result table as asked in question
SELECT iddate, t2.mydate, t2.myval
FROM `id_date` t1
JOIN date_val t2 ON t2.mydate <= t1.iddate
AND t2.mydate = (
SELECT MAX( t3.mydate )
FROM `date_val` t3
WHERE t3.mydate <= t1.iddate )
What we're doing:
for each date in the id_date table (your table A),
we find the date in the date_val table (your table B)
which is the highest date in the date_val table (but still smaller than the id_date.date)
You could use a subquery with limit 1 to look up the latest value in table B:
select id
, date
, (
select value
from B
where B.date < A.date
order by
B.date desc
limit 1
) as value
from A
I have been inspired by the other answers but ended with my own solution using common table expressions:
WITH datecombination (id, adate, bdate) AS
(
SELECT id, A.date, MAX(B.Date) as Bdate
FROM tableA A
LEFT JOIN tableB B
ON B.date <= A.date
GROUP BY A.id, A.date
)
SELECT DC.id, DC.adate, B.value FROM datecombination DC
LEFT JOIN tableB B
ON DC.bdate = B.bdate
The INNER JOIN return rows when there is at least one match in both tables. Try this.
Select A.id,A.date,b.value
from A inner join B
on A.date=b.date
I have 3 tables with the following schema:
CREATE TABLE `devices` (
`device_id` int(11) NOT NULL auto_increment,
`name` varchar(20) default NULL,
`appliance_id` int(11) default '0',
`sensor_type` int(11) default '0',
`display_name` VARCHAR(100),
PRIMARY KEY USING BTREE (`device_id`)
)
CREATE TABLE `channels` (
`channel_id` int(11) NOT NULL AUTO_INCREMENT,
`device_id` int(11) NOT NULL,
`channel` varchar(10) NOT NULL,
PRIMARY KEY (`channel_id`),
KEY `device_id_idx` (`device_id`)
)
CREATE TABLE `historical_data` (
`date_time` datetime NOT NULL,
`channel_id` int(11) NOT NULL,
`data` float DEFAULT NULL,
`unit` varchar(10) DEFAULT NULL,
KEY `devices_datetime_idx` (`date_time`) USING BTREE,
KEY `channel_id_idx` (`channel_id`)
)
The setup is that a device can have one or more channels and each channel has many (historical) data.
I use the following query to get the last historical data for one device and all it's related channels:
SELECT c.channel_id, c.channel, max(h.date_time), h.data
FROM devices d
INNER JOIN channels c ON c.device_id = d.device_id
INNER JOIN historical_data h ON h.channel_id = c.channel_id
WHERE d.name = 'livingroom' AND d.appliance_id = '0'
AND d.sensor_type = 1 AND ( c.channel = 'ch1')
GROUP BY c.channel
ORDER BY h.date_time, channel
The query plan looks as follows:
+----+-------------+-------+--------+-----------------------+----------------+---------+---------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------+----------------+---------+---------------------------+--------+-------------+
| 1 | SIMPLE | c | ALL | PRIMARY,device_id_idx | NULL | NULL | NULL | 34 | Using where |
| 1 | SIMPLE | d | eq_ref | PRIMARY | PRIMARY | 4 | c.device_id | 1 | Using where |
| 1 | SIMPLE | h | ref | channel_id_idx | channel_id_idx | 4 | c.channel_id | 322019 | |
+----+-------------+-------+--------+-----------------------+----------------+---------+---------------------------+--------+-------------+
3 rows in set (0.00 sec)
The above query is currently taking approximately 15 secs and I wanted to know if there are any tips or way to improve the query?
Edit:
Example data from historical_data
+---------------------+------------+------+------+
| date_time | channel_id | data | unit |
+---------------------+------------+------+------+
| 2011-11-20 21:30:57 | 34 | 23.5 | C |
| 2011-11-20 21:30:57 | 9 | 68 | W |
| 2011-11-20 21:30:54 | 34 | 23.5 | C |
| 2011-11-20 21:30:54 | 5 | 316 | W |
| 2011-11-20 21:30:53 | 34 | 23.5 | C |
| 2011-11-20 21:30:53 | 2 | 34 | W |
| 2011-11-20 21:30:51 | 34 | 23.4 | C |
| 2011-11-20 21:30:51 | 9 | 68 | W |
| 2011-11-20 21:30:49 | 34 | 23.4 | C |
| 2011-11-20 21:30:49 | 4 | 193 | W |
+---------------------+------------+------+------+
10 rows in set (0.00 sec)
Edit 2:
Mutliple channel SELECT example:
SELECT c.channel_id, c.channel, max(h.date_time), h.data
FROM devices d
INNER JOIN channels c ON c.device_id = d.device_id
INNER JOIN historical_data h ON h.channel_id = c.channel_id
WHERE d.name = 'livingroom' AND d.appliance_id = '0'
AND d.sensor_type = 1 AND ( c.channel = 'ch1' OR c.channel = 'ch2' OR c.channel = 'ch2')
GROUP BY c.channel
ORDER BY h.date_time, channel
I've used OR in the c.channel where clause because it was easier to generated pro grammatically but it can be changed to use IN if necessary.
Edit 3:
Example result of what I'm trying to achieve:
+-----------+------------+---------+---------------------+-------+
| device_id | channel_id | channel | max(h.date_time) | data |
+-----------+------------+---------+---------------------+-------+
| 28 | 9 | ch1 | 2011-11-21 20:39:36 | 0 |
| 28 | 35 | ch2 | 2011-11-21 20:30:55 | 32767 |
+-----------+------------+---------+---------------------+-------+
I have added the device_id to the example but my select will only need to return channel_id, channel, last date_time i.e max and the data. The results should be the last record from the historical_data table for each channel for one device.
It seems that removing an re-creating the index on date_time by deleting and creating it again sped up my original SQL up to around 2secs
I haven't been able to test this, so I'd like to ask you to run it and let us know what happens.. if it gives you the desired result and if it runs faster than your current:
CREATE DEFINER=`root`#`localhost` PROCEDURE `GetLatestHistoricalData_EXAMPLE`
(
IN param_device_name VARCHAR(20)
, IN param_appliance_id INT
, IN param_sensor_type INT
, IN param_channel VARCHAR(10)
)
BEGIN
SELECT
h.date_time, h.data
FROM
historical_data h
INNER JOIN
(
SELECT c.channel_id
FROM devices d
INNER JOIN channels c ON c.device_id = d.device_id
WHERE
d.name = param_device_name
AND d.appliance_id = param_appliance_id
AND d.sensor_type = param_sensor_type
AND c.channel = param_channel
)
c ON h.channel_id = c.channel_id
ORDER BY h.date_time DESC
LIMIT 1;
END
Then to run a test:
CALL GetLatestHistoricalData_EXAMPLE ('livingroom', 0, 1, 'ch1');
I tried working it into a stored procedure so that even if you get the desired results using this for one device, you can try it with another device and see the results... Thanks!
[edit] : : In response to Danny's comment here's an updated test version:
CREATE DEFINER=`root`#`localhost` PROCEDURE `GetLatestHistoricalData_EXAMPLE_3Channel`
(
IN param_device_name VARCHAR(20)
, IN param_appliance_id INT
, IN param_sensor_type INT
, IN param_channel_1 VARCHAR(10)
, IN param_channel_2 VARCHAR(10)
, IN param_channel_3 VARCHAR(10)
)
BEGIN
SELECT
h.date_time, h.data
FROM
historical_data h
INNER JOIN
(
SELECT c.channel_id
FROM devices d
INNER JOIN channels c ON c.device_id = d.device_id
WHERE
d.name = param_device_name
AND d.appliance_id = param_appliance_id
AND d.sensor_type = param_sensor_type
AND (
c.channel IN (param_channel_1
,param_channel_2
,param_channel_3
)
c ON h.channel_id = c.channel_id
ORDER BY h.date_time DESC
LIMIT 1;
END
Then to run a test:
CALL GetLatestHistoricalData_EXAMPLE_3Channel ('livingroom', 0, 1, 'ch1', 'ch2' , 'ch3');
Again, this is just for testing, so you'll be able to see if it meets your needs..
I would first add an index on the devices table ( appliance_id, sensor_type, name ) to match your query. I don't know how many entries are in this table, but if large, and many elements per device, get right to it.
Second, on your channels table, index on ( device_id, channel )
Third, on your history data, index on ( channel_id, date_time )
then,
SELECT STRAIGHT_JOIN
PreQuery.MostRecent,
PreQuery.Channel_ID,
PreQuery.Channel,
H2.Data,
H2.Unit
from
( select
c.channel_id,
c.channel,
max( h.date_time ) as MostRecent
from
devices d
join channels c
on d.device_id = c.device_id
and c.channel in ( 'ch1', 'ch2', 'ch3' )
join historical_data h
on c.channel_id = c.Channel_id
where
d.appliance_id = 0
and d.sensor_type = 1
and d.name = 'livingroom'
group by
c.channel_id ) PreQuery
JOIN Historical_Data H2
on PreQuery.Channel_ID = H2.Channel_ID
AND PreQuery.MostRecent = H2.Date_Time
order by
PreQuery.MostRecent,
PreQuery.Channel