MySQL: self join to produce pairs of dates - mysql

I have a table with entries for Items as being 'lost' and 'found'. Each row has a date for the event. Im hoping to build a query with matching pairs of 'itemid', 'lost date', 'found date' by joining the table to itself.
This works to a point: unfortunately if there are multiple lost and found pairs for a given item each 'lost date' will be joined with all the 'found dates' that follow it.
Still with me?
The query goes something like:
select c0.ItemId, c0.ChangeDate, c1.ChangeDate from Changes c0
join Changes c1 on
c0.ItemId = c1.ItemId and c1.ChangeDate >= c0.ChangeDate
where c0.ChangeType = 9 (lost) and c1.ChangeType = 10 (found);
What Im hoping to achieve is some form of a given 'lost date' paired with only the next 'found date' in sequence (or NULL if no 'found date' exists). Im (pretty) sure this is possible but Im not seeing the path.
I was wondering about putting a sub-select in the first join and using a LIMIT 1 to get only one record but I don't see how to join this to the appropriate row in the main part of the select. MySQL tells me it doesn't exist. Fair enough.

The trick here is to stipulate 'and there is no other lost or found date between the lost and found dates', or, in SQL:
SELECT c0.ItemId, c0.ChangeDate, c1.ChangeDate
FROM Changes AS c0
JOIN Changes AS c1 ON c0.ItemId = c1.ItemId AND c1.ChangeDate >= c0.ChangeDate
WHERE c0.ChangeType = 9 -- Lost
AND c1.ChangeType = 10 -- Found
AND NOT EXISTS(SELECT *
FROM Changes AS c2
WHERE c2.ItemId = c1.ItemID
AND c2.ChangeType IN (9, 10) -- Lost or Found
AND c2.ChangeDate BETWEEN c0.ChangeDate AND c1.ChangeDate
AND (c2.ChangeDate != c0.ChangeDate AND c2.ChangeDate != c1.ChangeDate)
);
Because that is a correlated sub-query, it tends to slow down the query, but it should produce the correct rows.
There is an important caveat about the way I've eliminated the c0 and c1 rows by stipulating that the ChangeDate for the row in c2 should be different from either the lost date or the found date. However, the main query seems to allow for an item to be found on the same day that it is lost. There might be some other column - such as a ChangeId column - that is not mentioned in the query yet that could be used instead:
AND c2.ChangeID NOT IN (c0.ChangeID, c1.ChangeID)
You'll need to think about what happens if an item is lost on, say, 2011-06-07, and lost again on 2011-06-14, and only found on 2011-06-21. And what about if it is also found on 2011-06-28? Such problems should be prevented by the data entry processing, so the query above assumes there won't be such issues.

Generally when dealing with pairs of dates (e.g. start/end for scheduling) the advice is don't put them on separate rows. Put them in two columns of the same row. See Joe Celko's SQL Programming Style.
But that said, you can solve it with your current schema by searching doing another self-join to search for a ChangeDate between the two. If none is found (that is, if c2.* is null because of the outer join), then c0 and c1 are "adjacent."
select c0.ItemId, c0.ChangeDate, c1.ChangeDate
from Changes c0
inner join Changes c1 on
c0.ItemId = c1.ItemId and c1.ChangeDate > c0.ChangeDate
left outer join Changes c2 on
c0.ItemId = c2.ItemId and c2.ChangeDate > c0.ChangeDate
and c2.ChangeDate < c1.ChangeDate
and c2.ChangeType IN (9,10) -- edit
where c0.ChangeType = 9 (lost) and c1.ChangeType = 10 (found)
and c2.ItemId IS NULL;
In the above example, I've assumed that ChangeDate is unique, and I changed the >= to >. If ChangeDate is not unique, you'll have to come up with some other expression to test for c2 "between" c0 and c1.

Related

MySQL in clause slow with 10 or more items

This query takes 18 seconds
SELECT `wd`.`week` AS `start_week`, `wd`.`hold_code`, COUNT(wd.hold_code) AS hold_code_count
FROM `weekly_data` AS `wd`
JOIN aol_reporting_hold_codes hc ON hc.hold_code = wd.hold_code AND chart = 'GR'
WHERE `wd`.`days` <= 6
AND `wd`.`hold_code` IS NOT NULL
AND NOT `wd`.`hold_code` = ''
AND `wd`.`week` >= '201717'
AND `wd`.`itemgroup` IN ('BOTDTO', 'BOTDWG', 'C&FORG', 'C&FOTO', 'MF-SUB', 'MI-SUB', 'PROPRI', 'PROPTO', 'STRSTO', 'STRSUB')
AND `production_type` = 2
AND `contract` = "1234"
AND `project` = 8
GROUP BY `start_week`, `wd`.`hold_code`
This query takes 4 seconds
SELECT `wd`.`week` AS `start_week`, `wd`.`hold_code`, COUNT(wd.hold_code) AS hold_code_count
FROM `weekly_data` AS `wd`
JOIN aol_reporting_hold_codes hc ON hc.hold_code = wd.hold_code AND chart = 'GR'
WHERE `wd`.`days` <= 6
AND `wd`.`hold_code` IS NOT NULL
AND NOT `wd`.`hold_code` = ''
AND `wd`.`week` >= '201717'
AND `wd`.`itemgroup` IN ('BOTDWG', 'C&FORG', 'C&FOTO', 'MF-SUB', 'MI-SUB', 'PROPRI', 'PROPTO', 'STRSTO', 'STRSUB')
AND `production_type` = 2
AND `contract` = "1234"
AND `project` = 8
GROUP BY `start_week`, `wd`.`hold_code`
All I have done is removed one item from the IN clause. I can remove any one of the items. It runs in 4 seconds as long as there are 9 items or less. It takes 18 seconds to run as soon as I increase to 10 items.
I thought MySQL limited length of command by size i.e. 1MB
More than just the EXPLAIN, use EXPLAIN FORMAT=JSON and get the "Optimizer trace" for the query. I suspect the length of the IN leads to picking a different query plan.
There is virtually no limit to the number of items in IN. I have seen as many as 70K.
That aside, you may be able to speed up even the 4-sec version...
I suggest having this index. Grrr... I can't tell which columns are in which tables. So, if these are all in one table, then make such an index:
INDEX(production_type, contract, project) -- in any order
If those are all in wd, then tack on a 4th column - any of week, itemgroup, days.
Be cautious about COUNT(wd.hold_code).
COUNT(x) checks x for being non-NULL; is that what you want? If not, then simply say COUNT(*).
When JOINing, then GROUP BY, you get an "explode-implode". The number of intermediate rows is big; that is when the COUNT is performed.
It seems wrong to both COUNT(hold_code) and GROUP BY hold_code. What are you trying to do?
For further discussion, please provide SHOW CREATE TABLE and EXPLAIN.
Please note MySql IN clause limit is established with max_allowed_packet value. You may check with NOT IN if results are faster. Also I suggest put values to be checked with IN clause under a buffer string instead of comma separated values and then give a try.

How to Find First Valid Row in SQL Based on Difference of Column Values

I am trying to find a reliable query which returns the first instance of an acceptable insert range.
Research:
some of the below links adress similar questions, but I could get none of them to work for me.
Find first available date, given a date range in SQL
Find closest date in SQL Server
MySQL difference between two rows of a SELECT Statement
How to find a gap in range in SQL
and more...
Objective Query Function:
InsertRange(1) = (StartRange(i) - EndRange(i-1)) > NewValue
Where InsertRange(1) is the value the query should return. In other words, this would be the first instance where the above condition is satisfied.
Table Structure:
Primary Key: StartRange
StartRange(i-1) < StartRange(i)
StartRange(i-1) + EndRange(i-1) < StartRange(i)
Example Dataset
Below is an example User table (3 columns), with a set range distribution. StartRanges are always ordered in a strictly ascending way, UserID are arbitrary strings, only the sequences of StartRange and EndRange matters:
StartRange EndRange UserID
312 6896 user0
7134 16268 user1
16877 22451 user2
23137 25142 user3
25955 28272 user4
28313 35172 user5
35593 38007 user6
38319 38495 user7
38565 45200 user8
46136 48007 user9
My current Query
I am trying to use this query at the moment:
SELECT t2.StartRange, t2.EndRange
FROM user AS t1, user AS t2
WHERE (t1.StartRange - t2.StartRange+1) > NewValue
ORDER BY t1.EndRange
LIMIT 1
Example Case
Given the table, if NewValue = 800, then the returned answer should be 23137. This means, the first available slot would be between user3 and user4 (with an actual slot size = 813):
InsertRange(1) = (StartRange(i) - EndRange(i-1)) > NewValue
InsertRange = (StartRange(6) - EndRange(5)) > NewValue
23137 = 25955 - 25142 > 800
More Comments
My query above seemed to be working for the special case where StartRanges where tightly packed (i.e. StartRange(i) = StartRange(i-1) + EndRange(i-1) + 1). This no longer works with a less tightly packed set of StartRanges
Keep in mind that SQL tables have no implicit row order. It seems fair to order your table by StartRange value, though.
We can start to solve this by writing a query to obtain each row paired with the row preceding it. In MySQL, it's hard to do this beautifully because it lacks the row numbering function.
This works (http://sqlfiddle.com/#!9/4437c0/7/0). It may have nasty performance because it generates O(n^2) intermediate rows. There's no row for user0; it can't be paired with any preceding row because there is none.
select MAX(a.StartRange) SA, MAX(a.EndRange) EA,
b.StartRange SB, b.EndRange EB , b.UserID
from user a
join user b ON a.EndRange <= b.StartRange
group by b.StartRange, b.EndRange, b.UserID
Then, you can use that as a subquery, and apply your conditions, which are
gap >= 800
first matching row (lowest StartRange value) ORDER BY SB
just one LIMIT 1
Here's the query (http://sqlfiddle.com/#!9/4437c0/11/0)
SELECT SB-EA Gap,
EA+1 Beginning_of_gap, SB-1 Ending_of_gap,
UserId UserID_after_gap
FROM (
select MAX(a.StartRange) SA, MAX(a.EndRange) EA,
b.StartRange SB, b.EndRange EB , b.UserID
from user a
join user b ON a.EndRange <= b.StartRange
group by b.StartRange, b.EndRange, b.UserID
) pairs
WHERE SB-EA >= 800
ORDER BY SB
LIMIT 1
Notice that you may actually want the smallest matching gap instead of the first matching gap. That's called best fit, rather than first fit. To get that you use ORDER BY SB-EA instead.
Edit: There is another way to use MySQL to join adjacent rows, that doesn't have the O(n^2) performance issue. It involves employing user variables to simulate a row_number() function. The query involved is a hairball (that's a technical term). It's described in the third alternative of the answer to this question. How do I pair rows together in MYSQL?

filed showing null value when joining table

below is my query
select C.cName,DATE_FORMAT(CT.dTransDate,'%d-%M-%Y') as dTransDate,
(c.nOpBalance+IFNULL(CT.nAmount,0)) AS DrAMount,IFNULL(CTR.nAmount,0) AS
CrAMount,((c.nOpBalance+IFNULL(CT.nAmount,0))-IFNULL(CTR.nAmount,0)) AS
Balance,CT.cTransRefType,CT.cRemarks,cinfo.cCompanyName,cinfo.caddress1,cinfo.cP
honeOffice,cinfo.cMobileNo,cinfo.cEmailID,cinfo.cWebsite from Customer
C LEFT JOIN Client_Transaction CT ON CT.nClientPk = C.nCustomerPk AND
CT.cTransRefType='PAYMENT' AND CT.cClientType='CUSTOMER' AND CT.dTransDate
between '' AND '' LEFT JOIN Client_Transaction CTR ON CTR.nClientPk =
C.nCustomerPk AND CTR.cTransRefType='RECEIPT' AND
CTR.cClientType='CUSTOMER' AND CTR.dTransDate between '2015-05-01' AND
'2015-05-29' LEFT JOIN companyinfo cinfo ON cinfo.cCompanyName like
'%Fal%' Where C.nCustomerPk = 4 Order By dTransDate
it's showing all value but dTransDate ,cTransRefType,cRemarks, showing null.
One obvious thing jumps out at us:
CT.dTransDate BETWEEN '' AND ''
^^ ^^
Another thing that jumps out at us is that there's a semi-Cartesian join between rows from CT and rows from CTR. If 5 rows are returned from CT for a given customer, and 5 rows are returned from CTR, that's going to produce a total of 5*5 = 25 rows. That just doesn't seem like a resultset that you'd really want returned.
Also, if more than one row is returned from cinfo, that's also going to cause another semi-Cartesian join. If there's two rows returned from cinfo, the total number or rows in the resultset will be doubled. It's valid to do that in SQL, but this is an unusual pattern.
The calculation of the balance is also very strange. For each row, the nAmount is added/subtracted from opening balance. On the next row, the same thing, on the original opening balance. There's nothing invalid SQL-wise with doing that, but the result being returned just seems bizarre. (It seems much more likely that you'd want to show a running balance, with each transaction.)
Another thing that jumps out at us is that you are ordering the rows by a string representation of a DATE, with the day as the leading portion. (As long as all the rows have date values in the same year and month, that will probably work, but it just seems bizarre that we wouldn't sort on the DATE value, or a canonical string representation.
I strongly suspect that you want to run a query that's more like this. (This doesn't do a "running balance" calculation. It does return the 'PAYMENT' and 'RECEIPT' rows as individual rows, without producing a semi-Cartesian result.
SELECT c.cName
, DATE_FORMAT(t.dTransDate,'%d-%M-%Y') AS dTransDate
, C.nOpBalance
, IF(t.cTransRefType='PAYMENT',IFNULL(t.nAmount,0),0) AS DrAMount
, IF(t.cTransRefType='RECEIPT',IFNULL(t.nAmount,0),0) AS CrAMount
, t.cTransRefType
, t.cRemarks
, ci.*
FROM Customer c
LEFT
JOIN Client_Transaction t
ON t.nClientPk = c.nCustomerPk
AND t.cClientType = 'CUSTOMER'
AND t.dTransDate >= '2015-05-01'
AND t.dTransDate <= '2015-05-29'
AND t.cTransRefType IN ('PAYMENT','RECEIPT')
CROSS
JOIN ( SELECT cinfo.cCompanyName
, cinfo.caddress1
, cinfo.cPhoneOffice
, cinfo.cMobileNo
, cinfo.cEmailID
, cinfo.cWebsite
FROM companyinfo cinfo
WHERE cinfo.cCompanyName LIKE '%Fal%'
ORDER BY cinfo.cCompanyName
LIMIT 1
) ci
WHERE c.nCustomerPk = 4
ORDER BY t.dTransDate, t.cTransRefTpye, t.id

Calculating differences with SQL

I have been doing some research on this subject for a while, and thanks to a solution posted in another topic, I got close to solving this issue.
I am attempting to get the changes in a column of data: row(n) - row(n-1)
update Table tt1
left outer JOIN Table tt2
on tt1.name = tt2.name
and tt1.date-tt2.date=1
set tt1.delta = (tt1.amount-ifnull(tt2.amount, tt1.amount));
Output is
Date | Value | Delta
2013-03-30| 38651 | 393
2013-03-31| 39035 | 384
2013-04-01| 39459 | 0
2013-04-02| 39806 | 347
As you can see, the difference does not calculate for the first of April (the rest of the values are just fine). The same happens for the 1st day of every month.
My guess is that there is something to do with [and tt1.date-tt2.date=1], but I can't figure out exactly what.
Thanks for all your help in advance!
I made some changes to your statement... your error is either on the way you handle the dates or in the way you handle the delta...
update Table tt1
left outer JOIN Table tt2
on tt1.name = tt2.name
and tt1.date = date_sub(tt2.date, interval 1 day)
set tt1.delta = case when tt2.amount is not null then tt1.amount - tt2.amount else -1 end;
try DATEDIFF this will give you the difference between two dates by days.
and DATEDIFF(tt1.date,tt2.date) =1
this because you are differenting 01-31 and its not true thats why you get 0
so you should defference also the month.
One guess would be that the date is not stored as date but rather has a time component. You can get around this by converting to date using date() or using datediff():
update Table tt1 left outer JOIN
Table tt2
on tt1.name = tt2.name and datediff(tt1.date, tt2.date) = 1
set tt1.delta = (tt1.amount-ifnull(tt2.amount, tt1.amount));

MySQL schedule conflicts

Hey, I stumbled upon this site looking for solutions for event overlaps in mySQL tables. I was SO impressed with the solution (which is helping already) I thought I'd see if I could get some more help...
Okay, so Joe want's to swap shifts with someone at work. He has a court date. He goes to the shift swap form and it pull up this week's schedule (or what's left of it). This is done with a DB query. No sweat. He picks a shift. From this point, it gets prickly.
So, first, the form passes the shift start and shift end to the script. It runs a query for anyone who has a shift that overlaps this shift. They can't work two shifts at once, so all user IDs from this query are put on a black list. This query looks like:
SELECT DISTINCT user_id FROM shifts
WHERE
FROM_UNIXTIME('$swap_shift_start') < shiftend
AND FROM_UNIXTIME('$swap_shift_end') > shiftstart
Next, we run a query for all shifts that are a) the same length (company policy), and b) don't overlap with any other shifts Joe is working.
What I currently have is something like this:
SELECT *
FROM shifts
AND shiftstart BETWEEN FROM_UNIXTIME('$startday') AND FROM_UNIXTIME('$endday')
AND user_id NOT IN ($busy_users)
AND (TIME_TO_SEC(TIMEDIFF(shiftend,shiftstart)) = '$swap_shift_length')
$conflict_dates
ORDER BY shiftstart, lastname
Now, you are probably wondering "what is $conflict_dates???"
Well, when Joe submits the swap shift, it reloads his shifts for the week in case he decides to check out another shift's potential. So when it does that first query, while the script is looping through and outputting his choices, it is also building a string that looks kind of like:
AND NOT(
'joe_shift1_start' < shiftend
AND 'joe_shift1_end' > shiftstart)
AND NOT(
'joe_shift2_start' < shiftend
AND 'joe_shift2_end' > shiftstart)
...etc
So that the database is getting a pretty long query along the lines of:
SELECT *
FROM shifts
AND shiftstart BETWEEN FROM_UNIXTIME('$startday') AND FROM_UNIXTIME('$endday')
AND user_id NOT IN ('blacklisteduser1', 'blacklisteduser2',...etc)
AND (TIME_TO_SEC(TIMEDIFF(shiftend,shiftstart)) = '$swap_shift_length')
AND NOT(
'joe_shift1_start' < shiftend
AND 'joe_shift1_end' > shiftstart)
AND NOT(
'joe_shift2_start' < shiftend
AND 'joe_shift2_end' > shiftstart)
AND NOT(
'joe_shift3_start' < shiftend
AND 'joe_shift3_end' > shiftstart)
AND NOT(
'joe_shift4_start' < shiftend
AND 'joe_shift4_end' > shiftstart)
...etc
ORDER BY shiftstart, lastname
So, my hope is that either SQL has some genius way of dealing with this in a simpler way, or that someone can point out a fantastic logical principal that accounts for the potential conflicts in a much smarter way. (Notice the use of the 'start > end, end < start', before I found that I was using betweens and had to subtract a minute off both ends.)
Thanks!
A
I think you should be able to exclude Joe's other shifts using an inner select instead of the generated string, something like:
SELECT *
FROM shifts s1
AND shiftstart BETWEEN FROM_UNIXTIME('$startday') AND FROM_UNIXTIME('$endday')
AND user_id NOT IN ($busy_users)
AND (TIME_TO_SEC(TIMEDIFF(shiftend,shiftstart)) = '$swap_shift_length')
AND (SELECT COUNT(1) FROM shifts s2
WHERE s2.user_id = $joes_user_id
AND s1.shiftstart < s2.shiftend
AND s2.shiftstart < s1.shiftend) = 0
ORDER BY shiftstart, lastname
Basically, each row has an inner query for the count of Joe's shifts which overlap, and makes sure that it's zero. Thus, only rows which don't overlap with any of Joe's existing shifts will be returned.
You could load the joe_shift{1,2,3} values into a TEMPORARY table and then do a query to join against it, using an outer join to find only shift that don't match any:
CREATE TEMPORARY TABLE joes_shifts (
shiftstart DATETIME
shiftend DATETIME
);
INSERT INTO joes_shifts (shiftstart, shiftend) VALUES
('$joe_shift1_start', '$joe_shift1_end'),
('$joe_shift2_start', '$joe_shift2_end'),
('$joe_shift3_start', '$joe_shift3_end'),
('$joe_shift4_start', '$joe_shift4_end');
-- make sure you have validated these variables to prevent SQL injection
SELECT s.*
FROM shifts s
LEFT OUTER JOIN joes_shifts j
ON (j.shiftstart < s.shiftend OR j.shiftend > s.shiftstart)
WHERE j.shiftstart IS NULL
AND s.shiftstart BETWEEN FROM_UNIXTIME('$startday') AND FROM_UNIXTIME('$endday')
AND s.user_id NOT IN ('blacklisteduser1', 'blacklisteduser2',...etc)
AND (TIME_TO_SEC(TIMEDIFF(s.shiftend,s.shiftstart)) = '$swap_shift_length');
Because of the LEFT OUTER JOIN, when there is no matching row in joes_shifts, the columns are NULL.