Using JOIN with DISTINCT and prioritize one table - mysql

I am trying to combine data from 2 tables.
Those 2 tables both contain data from the same sensor (lets say a sensor that measures CO2 with 1 entry per 10 minutes).
The first table contains validated data. Let's call it station1_validated. The 2nd table contains raw data. Let's call this one station1_nrt.
While the raw-data table contains live data, the validated table contains only data points that are at least 1 month old. (It needs some time to validate those data and to control it manually afterwards, this happens only once every month).
What I am trying to do now is to combine the data of those 2 tables to display live data on a website. However when validated data is available it should prioritize that data point over the raw data-point.
The relevant columns for this are:
timed [bigint(20)]: Contains the datetime as a unix timestamp in milliseconds from 1.1.1970
CO2 [double]: Contains the measured concentration of CO2 in ppm (parts per million)
I wrote this basic SQL:
SELECT
*
FROM
(SELECT
timed, CO2, '2' tab
FROM
station1_nrt
WHERE
TIMED >= 1386932400000
AND TIMED <= 1386939600000
AND TIMED NOT IN (SELECT
timed
FROM
station1_nrt
WHERE
CO2 IS NOT NULL
AND TIMED >= 1386932400000
AND TIMED <= 1386939600000) UNION SELECT
timed, CO2, '1' tab
FROM
station1_validated
WHERE
CO2 IS NOT NULL
AND TIMED >= 1386932400000
AND TIMED <= 1386939600000) a
ORDER BY timed
This does not work correctly as it selects only those data points where both tables have an entry.
However I want to do this with a JOIN now as it would be much faster. However I don't know how to a JOIN with a DISTINCT (or something similar) with prioritizing a table. Could someone help me out with this (or explain it?)

You haven't mentioned if there exist records in station1_validated which don't exist in station1_nrt so I use FULL JOIN. If all rows from station1_validated exist in station1_nrt then you can use LEFT JOIN instead.
Something like this
SELECT IFNULL(n.timed,v.timed) as timed,
CASE WHEN v.timed IS NOT NULL THEN v.CO2 ELSE n.CO2 END as CO2,
CASE WHEN v.timed IS NOT NULL THEN '1' ELSE '2' END as tab
FROM station1_nrt as n
FULL JOIN station1_validated as v ON n.timed=v.timed AND v.CO2 IS NOT NULL
WHERE
( n.TIMED between 1386932400000 AND 1386939600000
or
v.TIMED between 1386932400000 AND 1386939600000
)
AND
(n.CO2 IS NOT NULL OR v.CO2 IS NOT NULL)

MySQL has an IF that would probably work for you. You would have to select specific columns, though, but you could build the query programmatically.
SELECT
IF(DATE_SUB(NOW(), INTERVAL 1 MONTH) < FROM_UNIXTIME(nrt.TIMED),
val.value,
nrt.value
) AS value
-- Similar for other values
FROM
station1_nrt AS nrt
JOIN station1_validated AS val USING(id)
ORDER BY TIMED
Note that the USING(id) is a placeholder. Presumably there is some indexed column you can join the two tables on.

You can join and then use IFs in the fields to choose the validated values if they exist. Something like:
SELECT
IFNULL(s1val.timed,s1.timed) AS timed,
IFNULL(s1val.C02,s1.C02) AS C02,
2 AS 2,
IFNULL(s1val.tab,s1.tab) AS tab,
FROM
station1_nrt s1
LEFT JOIN station1_validated s1val ON (s1.TIMED = s1val.TIMED)
WHERE
-- Any necessary where clauses

#Jim, #valex, #ExplosionPills
I managed to write a SQL select that emulates a FULL OUTER JOIN (as there is no FULL JOIN in MySQL) and returns the value of the validated data if it exists. If no validated data is available it will return the raw value
So this is the SQL I am using now:
SET #StartTime = 1356998400000;
SET #EndTime = 1386546000000;
SELECT
timed,
IFNULL (mergedData.validatedValue, mergedData.rawValue) as value
FROM
((SELECT
from_unixtime(timed / 1000) as timed,
rawData.NOX as rawValue,
validatedData.NOX as validatedValue
FROM
nabelnrt_bas as rawData
LEFT JOIN nabelvalidated_bas as validatedData using(timed)
WHERE
(rawData.timed > #StartTime
AND rawData.timed < #EndTime)
OR (validatedData.timed > #StartTime
AND validatedData.timed < #EndTime)
) UNION (
SELECT
from_unixtime(timed / 1000) as timed,
rawData.NOX as rawValue,
validatedData.NOX as validatedValue
FROM
nabelnrt_bas as rawData
RIGHT JOIN nabelvalidated_bas as validatedData using(timed)
WHERE
(rawData.timed > #StartTime
AND rawData.timed < #EndTime)
OR (validatedData.timed > #StartTime
AND validatedData.timed < #EndTime)
)
ORDER BY timed DESC) as mergedData

Related

Generating complex sql tables

I currently have an employee logging sql table that has 3 columns
fromState: String,
toState: String,
timestamp: DateTime
fromState is either In or Out. In means employee came in and Out means employee went out. Each row can only transition from In to Out or Out to In.
I'd like to generate a temporary table in sql to keep track during a given hour (hour by hour), how many employees are there in the company. Aka, resulting table has columns HourBucket, NumEmployees.
In non-SQL code I can do this by initializing the numEmployees as 0 and go through the table row by row (sorted by timestamp) and add (employee came in) or subtract (went out) to numEmployees (bucketed by timestamp hour).
I'm clueless as how to do this in SQL. Any clues?
Use a COUNT ... GROUP BY query. Can't see what you're using toState from your description though! Also, assuming you have an employeeID field.
E.g.
SELECT fromState AS 'Status', COUNT(*) AS 'Number'
FROM StaffinBuildingTable
INNER JOIN (SELECT employeeID AS 'empID', MAX(timestamp) AS 'latest' FROM StaffinBuildingTable GROUP BY employeeID) AS LastEntry ON StaffinBuildingTable.employeeID = LastEntry.empID
GROUP BY fromState
The LastEntry subquery will produce a list of employeeIDs limited to the last timestamp for each employee.
The INNER JOIN will limit the main table to just the employeeIDs that match both sides.
The outer GROUP BY produces the count.
SELECT HOUR(SBT.timestamp) AS 'Hour', SBT.fromState AS 'Status', COUNT(*) AS 'Number'
FROM StaffinBuildingTable AS SBT
INNER JOIN (
SELECT SBIJ.employeeID AS 'empID', MAX(timestamp) AS 'latest'
FROM StaffinBuildingTable AS SBIJ
WHERE DATE(SBIJ.timestamp) = CURDATE()
GROUP BY SBIJ.employeeID) AS LastEntry ON SBT.employeeID = LastEntry.empID
GROUP BY SBT.fromState, HOUR(SBT.timestamp)
Replace CURDATE() with whatever date you are interested in.
Note this is non-optimal as it calculates the HOUR twice - once for the data and once for the group.
Again you are using the INNER JOIN to limit the number of returned row, this time to the last timestamp on a given day.
To me your description of the FromState and ToState seem the wrong way round, I'd expect to doing this based on the ToState. But assuming I'm wrong on that the following should point you in the right direction:
First, I create a "Numbers" table containing 24 rows one for each hour of the day:
create table tblHours
(Number int);
insert into tblHours values
(0),(1),(2),(3),(4),(5),(6),(7),
(8),(9),(10),(11),(12),(13),(14),(15),
(16),(17),(18),(19),(20),(21),(22),(23);
Then for each date in your employee logging table, I create a row in another new table to contain your counts:
create table tblDailyHours
(
HourBucket datetime,
NumEmployees int
);
insert into tblDailyHours (HourBucket, NumEmployees)
select distinct
date_add(date(t.timeStamp), interval h.Number HOUR) as HourBucket,
0 as NumEmployees
from
tblEmployeeLogging t
CROSS JOIN tblHours h;
Then I update this table to contain all the relevant counts:
update tblDailyHours h
join
(select
h2.HourBucket,
sum(case when el.fromState = 'In' then 1 else -1 end) as cnt
from
tblDailyHours h2
join tblEmployeeLogging el on
h2.HourBucket >= el.timeStamp
group by h2.HourBucket
) cnt ON
h.HourBucket = cnt.HourBucket
set NumEmployees = cnt.cnt;
You can now retrieve the counts with
select *
from tblDailyHours
order by HourBucket;
The counts give the number on site at each of the times displayed, if you want during the hour in question, we'd need to tweak this a little.
There is a working version of this code (using not very realistic data in the logging table) here: rextester.com/DYOR23344
Original Answer (Based on a single over all count)
If you're happy to search over all rows, and want the current "head count" you can use this:
select
sum(case when t.FromState = 'In' then 1 else -1) as Heads
from
MyTable t
But if you know that there will always be no-one there at midnight, you can add a where clause to prevent it looking at more rows than it needs to:
where
date(t.timestamp) = curdate()
Again, on the assumption that the head count reaches zero at midnight, you can generalise that method to get a headcount at any time as follows:
where
date(t.timestamp) = "CENSUS DATE" AND
t.timestamp <= "CENSUS DATETIME"
Obviously you'd need to replace my quoted strings with code which returned the date and datetime of interest. If the headcount doesn't return to zero at midnight, you can achieve the same by removing the first line of the where clause.

Average and count with conditions - mysql

table name is data.
Columns - 'date', 'location, 'fp, 'TV'
Under date I will have multiple different dates but each date has a number of rows with the same date. Same with location.
I am trying to work out the average of TV for every time the date and location are the same and fp = 1, and insert the result into a new column called avgdiff
So I might have a number of rows with the date 2016-12-08 and location LA, with different numbers under fp and TV. So when the date is 2016-12-08 and location is LA, fp might equal 1, 4 times, and TV for those 4 rows might be 7.4, 8.2, 1, -2. So the avg will be 3.65.
I think I need to use avg and count functions with conditions but I am having a lot of trouble with this. I hope this makes sense.
Thanks
You can query for the average using a GROUP BY:
SELECT `date`, `location`, AVG(`TV`) AS `avgtv`
FROM `data`
WHERE `fp` = 1
GROUP BY `date`, `location`
To update another table with your computed averages (which I strongly recommend against), you can use an UPDATE...JOIN with the above as a subquery:
UPDATE ratings r
JOIN ( /* paste above query here */ ) t
ON t.date = r.date AND t.location = r.location
SET r.avgtv = t.avgtv
If, for any reason, you cannot avoid storing aggregated data in the same table (thereby introducing redundancy and possibly incorrect/not up to date values), do an update statement of the following form:
update data,
(select t2.location, t2.date, avg(t2.TV) as avgTV2
from data t2
where t2.fp = 1
group by t2.location, t2.date) aggValues
set avgTV = avgTV2
where data.location = aggValues.location
and data.date = aggValues.date
and data.fp = 1

Finding First Appearing Value in a List of Duplicate Values

I have a table that stores the statuses an applications goes through. Some applications go through the same status multiple times. Each time it goes through a status, the time of the status change is recorded.
How can I pull a list of applications based on the first time applications goes through a particular status within a specified date range. Below is what I have tried thus far:
SELECT d1.STATUS,
d1.APPL_ID
FROM APP_STATUS d1
LEFT JOIN APP_STATUS d2 ON d1.APPL_ID = d2.APPL_ID AND d1.STATUS = 'AT_CUSTOMER' AND d2.STATUS = 'AT_CUSTOMER'
WHERE DATE(d1.STATUS_CREATE_DT) >= '2014-10-26'
AND DATE(d1.STATUS_CREATE_DT) <= '2014-11-25'
AND d2.STATUS IS NULL
GROUP BY d1.APPL_ID;
To get the first time a status goes through, try this query:
select a.appl_id, min(status_create_dt) as first_dt
from ap_status
where d.STATUS_CREATE_DT >= '2014-10-26' and
d.STATUS_CREATE_DT < date('2014-11-25') + interval 1 day and
d2.STATUS IS NULL
group by a.appl_id;
I think this does what you need. If you want more columns, then you can join this back to ap_status.
Note that I changed the date logic a bit. The date functions are only on the constant side of the dates. This allows the query to take advantage of an index on STATUS_CREATE_DT, if appropriate.

MySQL Query - Include dates without records

I have a report that displays a graph. The X axis uses the date from the below query. Where the query returns no date, I am getting gaps and would prefer to return a value. Is there any way to force a date where there are no records?
SELECT
DATE(instime),
CASE
WHEN direction = 1 AND duration > 0 THEN 'Incoming'
WHEN direction = 2 THEN 'Outgoing'
WHEN direction = 1 AND duration = 0 THEN 'Missed'
END AS type,
COUNT(*)
FROM taxticketitem
GROUP BY
DATE(instime),
CASE
WHEN direction = 1 AND duration > 0 THEN 'Incoming'
WHEN direction = 2 THEN 'Outgoing'
WHEN direction = 1 AND duration = 0 THEN 'Missed'
END
ORDER BY DATE(instime)
One possible way is to create a table of dates and LEFT JOIN your table with them. The table could look something like this:
CREATE TABLE `datelist` (
`date` DATE NOT NULL,
PRIMARY KEY (`date`)
);
and filled with all dates between, say Jan-01-2000 through Dec-31-2050 (here is my Date Generator script).
Next, write your query like this:
SELECT datelist.date, COUNT(taxticketitem.id) AS c
FROM datelist
LEFT JOIN taxticketitem ON datelist.date = DATE(taxticketitem.instime)
WHERE datelist.date BETWEEN `2012-01-01` AND `2012-12-31`
GROUP BY datelist.date
ORDER BY datelist.date
LEFT JOIN and counting not null values from right table's ensures that the count is correct (0 if no row exists for a given date).
You would need to have a set of dates to LEFT JOIN your table to it. Unfortunately, MySQL lacks a way to generate it on the fly.
You would need to prepare a table with, say, 100000 consecutive integers from 0 to 99999 (or how long you think your maximum report range would be):
CREATE TABLE series (number INT NOT NULL PRIMARY KEY);
and use it like this:
SELECT DATE(instime) AS r_date, CASE ... END AS type, COUNT(instime)
FROM series s
LEFT JOIN
taxticketitems ti
ON ti.instime >= '2013-01-01' + INTERVAL number DAY
AND ti.instime < '2013-01-01' + INTERVAL number + 1 DAY
WHERE s.number <= DATEDIFF('2013-02-01', '2013-01-01')
GROUP BY
r_date, type
Had to do something similar before.
You need to have a subselect to generate a range of dates. All the dates you want. Easiest with a start date added to a number:-
SELECT DATE_ADD(SomeStartDate, INTERVAL (a.I + b.1 * 10) DAY)
FROM integers a, integers b
Given a table called integers with a single column called i with 10 rows containing 0 to 9 that SQL will give you a range of 100 days starting at SomeStartDate
You can then left join your actual data against that to get the full range.

Finding entire fluctuation in a dataset

I have a table of historic data for a set of tanks in a MySQL database. I want to find fluctuations in the volume of tank contents of greater than 200 gallons/hour. My SQL statement thus far is:
SELECT t1.tankhistid as start, t2.tankhistid as end
FROM
(SELECT * from tankhistory WHERE tankid = ? AND curtime BETWEEN ? AND ?) AS t1
INNER JOIN
(SELECT * from tankhistory WHERE tankid = ? AND curtime BETWEEN ? AND ?) AS t2
ON t1.tankid = t2.tankid AND t1.curtime < t2.curtime
WHERE TIMESTAMPDIFF(HOUR, t1.curtime, t2.curtime) < 1 AND ABS(t1.vol - t2.vol) > 200
ORDER BY t1.tankhistid, t2.tankhistid
In the code above, curtime is a timestamp at the time of inserting the record, tankhistid is the table integer primary key, tankid is the individual tank id, and vol is the volume reading.
This returns too many results since data is collected every 5 minutes and fluctuations could take hours (multiple rows with the same id in an end and then start column) , or just over 10 minutes (multiple rows with the same start or end id). Example output:
7514576,7515478
7515232,7515478
7515314,7515478
7515396,7515478
7515478,7515560
7515478,7515642
7515478,7515724
Note that all of these rows should just be one: 7514576,7515724. The query takes 4 minutes for just one day of a tank's data, so any speed up would be great as well. I am guessing there is a way to take the current query and use it as a subquery, but I am not sure how to do the filtering.