Seek rows with incorrect dates in historic data - mysql

I had a table that is an historic log, recently I fixed a bug that was writing in that table an incorrect date, the dates should be correlatives, but in some cases there was a date that wasn't it, so much older than the previous date.
How can I get all the rows that aren't correlatives for each entity_id? In the example below I should get the rows 5 and 10.
The table has millions of rows and thousand of differents entities. I was thinking to compare the results of ordering by date and id but that is a lot of manual work.
| id | entity_id | time_stamp |
|--------|-------------|---------------|
| 1 | 7 | 2019-01-22 |
| 2 | 9 | 2019-01-05 |
| 3 | 6 | 2019-03-14 |
| 4 | 9 | 2019-04-20 |
| 5 | 6 | 2015-10-04 | WRONG
| 6 | 9 | 2019-07-15 |
| 7 | 3 | 2019-07-04 |
| 8 | 7 | 2019-06-01 |
| 9 | 6 | 2019-11-04 |
| 10 | 7 | 2019-03-04 | WRONG
Are there any function to compare the previous date by the entity id? I'm completely lost here, not sure how to clean the data. The database is MYSQL by the way.

If you are running MySQL 8.0, you can use lag(); the idea is to order records by id within groups having the same entity_id, and then to filter on records where the current timestamp is smaller than the previous one:
select t.*
from (
select t.*, lag(time_stamp) over(partition by entity_id order by id) lag_time_stamp
from mytable t
) t
where time_stamp < lag_time_stamp
In earlier versions, one option is to use a correlated subquery to get the previous timestamp:
select t.*
from mytable t
where time_stamp < (
select time_stamp
from mytable t1
where t1.entity_id = t.entity_id and t1.id < t.id
order by id desc
limit 1
)

SELECT s1.*
FROM sourcetable s1
WHERE EXISTS ( SELECT NULL
FROM sourcetable s2
WHERE s1.id < s2.id
AND s1.entity_id = s2.entity_id
AND s1.time_stamp > s2.time_stamp )
The index by (entity_id, id, time_stamp) or (entity_id, time_stamp, id) will increase the performance.

Related

How can I create a self-incrementing ID per day in MySQL?

I have a table
bills
( id INT NOT NULL AUTOINCREMENT PRIMARY KEY
, createdAt TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
, idDay INT NULL
);
I want the 1st record of the idDay field of each day to be 1 and from there continue the incremental, example:
| id | createdAt | idDay |
|----------|----------------|-------|
| 1 | 2021-01-10 | 1 |
| 2 | 2021-01-10 | 2 |
| 3 | 2021-01-11 | 1 |
| 4 | 2021-01-11 | 2 |
| 5 | 2021-01-11 | 3 |
| 6 | 2021-01-12 | 1 |
| 7 | 2021-01-13 | 1 |
| 8 | 2021-01-13 | 2 |
It's necessary the idDay field? or can i do this in the select?.
I think I can do this with a procedure but how?.
Thanks for help. 😁
You can use the row_number() window function available since MySQL 8.
SELECT id,
createdat,
row_number() OVER (PARTITION BY date(createdat)
ORDER BY id) idday
FROM bill;
(Or ORDER BY createdat, if that defines the order, not the id.)
But since window functions are calculated after a WHERE clause is applied, the number might be different for a record if previous records for a day are filtered. It's not clear from your question if this is a problem or not. If it is a problem, you can use the query in a derived table or create a view with it and work on that.
Yet another option is a correlated subquery counting the "older" records.
SELECT b1.id,
b1.createdat,
(SELECT count(*) + 1
FROM bill b2
WHERE b2.createdat >= date(b1.cratedat)
AND b2.createdat < date_add(date(b1.createdat), INTERVAL 1 DAY))
AND b2.id < b1.id) idday
FROM bill b1;
(If createdat defines the order, change b2.createdat < date_add(date(b1.createdat), INTERVAL 1 DAY)) to b2.createdat <= b1.createdat.)
That would also work in lower MySQL versions and you can add a WHERE clause (to the outer query) without changing the numbers.
You can just calculate the number in a select (requires an index on createdAt to work well):
select b.id, b.createdAt, count(b2.id)+1 as idDay
from bill b
left join bill b2 on b2.createdAt=b.createdAt and b2.id < b.id
where ...
group by b.id

Using LIMIT in a subquery based on another field in MySQL

Is it possible to use LIMIT based on another column inside a subquery in MySQL? Here is a working query of what I mean.
SELECT id, name,
(SELECT AVG(value) FROM t2 WHERE t1id = t1.id ORDER BY value DESC LIMIT 4) as average
FROM t1
However I'd like to replace the "4" to a field inside t1.
Something like this where table t1 has fields id, name, size:
SELECT id, name,
(SELECT AVG(value) FROM t2 WHERE t1id = t1.id ORDER BY value DESC LIMIT t1.size) as average
FROM t1
I could join t1 and t2, but I'm not sure that works for this. Does it?
Edit:
Here's some sample data to show what I mean:
Table t1
| id | name | Size |
|----|------|------|
| 1 | Bob | 4 |
| 2 | Joe | 3 |
| 3 | Sam | 4 |
Table t2
| t1id | value |
|------|-------|
| 1 | 16 |
| 1 | 14 |
| 1 | 12 |
| 1 | 10 |
| 1 | 8 |
| 2 | 10 |
| 2 | 8 |
| 2 | 6 |
| 2 | 4 |
| 3 | 20 |
| 3 | 15 |
| 3 | 10 |
| 3 | 5 |
| 3 | 2 |
Expected result:
| id | name | avg |
|----|------|------|
| 1 | Bob | 13 |
| 2 | Joe | 8 |
| 3 | Sam | 12.5 |
Notice that the average is the average of only the top t1.size values. For example the average for Bob is 13 and not 12 (based on 4 values and not 5) and the average for Joe is 8 and not 7 (based on 3 values and not 4).
In MySQL, you have little choice other than LEFT JOIN and aggregation:
SELECT t1.id, t1.name, AVG(t2.value) as average
FROM t1 LEFT JOIN
(SELECT t2.*,
ROW_NUMBER() OVER (PARTITION BY t1id ORDER BY VALUE desc) as seqnum
FROM t2
) t2
on t2.t1id = t1.id AND seqnum <= t1.size
GROUP BY t1.id, t1.name;
Here is a db<>fiddle.
No, you cannot use a column reference in a LIMIT clause.
https://dev.mysql.com/doc/refman/8.0/en/select.html has detailed documentation about MySQL's SELECT statement including all its clauses.
It says:
The LIMIT clause can be used to constrain the number of rows returned by the SELECT statement. LIMIT takes one or two numeric arguments, which must both be nonnegative integer constants, with these exceptions:
Within prepared statements, LIMIT parameters can be specified using ? placeholder markers.
Within stored programs, LIMIT parameters can be specified using integer-valued routine parameters or local variables.
Expressions, including subqueries, are not mentioned as legal argument in the LIMIT clause.
A simple solution would be to do your task in two queries: the first to get the size and then use that value as a constant value in the second query that includes the LIMIT.
Not every task needs to be done in a single SQL statement.

Calculate average, minimum, maximum interval between date

I am trying to do this with SQL. I have a transaction table which contain transaction_date. After grouping by date, I got this list:
| transaction_date |
| 2019-03-01 |
| 2019-03-04 |
| 2019-03-05 |
| ... |
From these 3 transaction dates, I want to achieve:
Average = ((4-1) + (5-4)) / 2 = 2 days (calculate DATEDIFF every single date)
Minimum = 1 day
Maximum = 3 days
Is there any good syntax? Before I iterate all of them using WHILE.
Thanks in advance
If your mysql version didn't support lag or lead function.
You can try to make a column use a subquery to get next DateTime. then use DATEDIFF to get the date gap in a subquery.
Query 1:
SELECT avg(diffDt),min(diffDt),MAX(diffDt)
FROM (
SELECT DATEDIFF((SELECT transaction_date
FROM T tt
WHERE tt.transaction_date > t1.transaction_date
ORDER BY tt.transaction_date
LIMIT 1
),transaction_date) diffDt
FROM T t1
) t1
Results:
| avg(diffDt) | min(diffDt) | MAX(diffDt) |
|-------------|-------------|-------------|
| 2 | 1 | 3 |
if your mysql version higher than 8.0 you can try to use LEAD window function instead of subquery.
Query #1
SELECT avg(diffDt),min(diffDt),MAX(diffDt)
FROM (
SELECT DATEDIFF(LEAD(transaction_date) OVER(ORDER BY transaction_date),transaction_date) diffDt
FROM T t1
) t1;
| avg(diffDt) | min(diffDt) | MAX(diffDt) |
| ----------- | ----------- | ----------- |
| 2 | 1 | 3 |
View on DB Fiddle

SQL select rows which are identical in two values in a way that retains Edit features in output

Apologies if the answer is dead obvious but in spite of a lot of research and trying out different commands, the solution escapes me (I'm more of a lexicographer than a dev).
We have a table which for various reasons has ended up with some rows which have duplicated values in critical cells. A mockup looks like this:
Unique_ID | E_ID | Date | User_ID | V_value
1 | 500 | 2012-05-12 | 23 | 3
2 | 501 | 2012-05-12 | 23 | 3
3 | 501 | 2012-05-13 | 23 | 1
4 | 502 | 2012-05-13 | 23 | 2
5 | 503 | 2012-05-12 | 23 | 2
6 | 7721 | 2012-05-22 | 8845 | 3
7 | 7722 | 2012-05-22 | 8845 | 3
8 | 7722 | 2012-05-22 | 8845 | 3
9 | 7723 | 2012-05-22 | 8845 | 3
So the rows I need as output are Unique_ID 2 & 3 and 7 & 8 as they are identical as regards the E_ID and User_ID field. The values of the other fields are not relevant to our problem. So what I want is this, ideally:
Unique_ID | E_ID | Date | User_ID | V_value
2 | 501 | 2012-05-12 | 23 | 3
3 | 501 | 2012-05-13 | 23 | 1
7 | 7722 | 2012-05-22 | 8845 | 3
8 | 7722 | 2012-05-22 | 8845 | 3
For reasons to do with the data, I need the output to appear with the Edit features (in particular the tick-box or at least the Delete feature) because I need to go through the table manually and discard one or the other duplicate based on decisions/conditions that can't be determined with SQL commands.
The closest I have come is this:
SELECT *
FROM ( SELECT E_ID, User_ID, COUNT(Unique_ID)
AS V_Count
FROM TableName
GROUP BY E_ID, User_ID
ORDER BY E_ID )
AS X
WHERE V_Count > 1
ORDER BY User_ID ASC, E_ID ASC
which does give me the rows with the duplications but because I'm creating the V_Count column to give me the duplicates:
E_ID | User_ID | V_Count
501 | 23 | 2
7722 | 8845 | 2
the output does not give me the Delete option I need - it says it's because there is no unique ID and I get that, as it puts them together in the same row. Is there a way to do this without losing the Unique_ID so I don't lose the Delete function?
You can use aggregation to check for a given user_id and e_id if there are more than one rows. Then join it with your table to get all the columns in the result.
select t1.*
from tablename t1
join (
select e_id,
user_id
from tablename
group by e_id,
user_id
having count(*) > 1
) t2
on t1.e_id = t2.e_id
and t1.user_id = t2.user_id
Which can be more cleanly expressed using the USING clause as:
select *
from tablename t1
join (
select e_id,
user_id
from tablename
group by e_id,
user_id
having count(*) > 1
) t2 using (e_id, user_id)
A sort-of simple method uses exists:
select t.*
from tablename t
where exists (select 1
from tablename t2
where t2.e_id = t.e_id and t2.date = t.date and
t2.user_id = t.user_id and t2.v_value = t.v_value and
t2.unique_id <> t.unique_id
);
An alternative way that puts each combination on a single row with all the ids is:
select e_id, date, user_id, v_value,
group_concat(unique_id) as unique_ids
from tablename t
group by e_id, date, user_id, v_value
having count(*) > 1;

Select difference between row dates in MySQL

I want to calculate the difference in unique date fields between different rows in the same table.
For instance, given the following data:
id | date
---+------------
1 | 2011-01-01
2 | 2011-01-02
3 | 2011-01-15
4 | 2011-01-20
5 | 2011-01-10
6 | 2011-01-30
7 | 2011-01-03
I would like to generate a query that produces the following:
id | date | days_since_last
---+------------+-----------------
1 | 2011-01-01 |
2 | 2011-01-02 | 1
7 | 2011-01-03 | 1
5 | 2011-01-10 | 7
3 | 2011-01-15 | 5
4 | 2011-01-20 | 5
6 | 2011-01-30 | 10
Any suggestions for what date functions I would use in MySQL, or is there a subselect that would do this?
(Of course, I don't mind putting WHERE date > '2011-01-01' to ignore the first row.)
A correlated subquery could be of help:
SELECT
id,
date,
DATEDIFF(
(SELECT MAX(date) FROM atable WHERE date < t.date),
date
) AS days_since_last
FROM atable AS t
Something like this should work :
SELECT mytable.id, mytable.date, DATEDIFF(mytable.date, t2.date)
FROM mytable
LEFT JOIN mytable AS t2 ON t2.id = table.id - 1
However, this imply that your id are continuous in your table, otherwise this won't work at all. And maybe MySQL will complain for the first row since t2.date will be null but I don't have the time to check now.