Efficient SQL query to find gap in consecutive numeric data (MySQL) - mysql

I have a table with column "time" (INT unsigned), every row represents one second and I need to find gaps in time (missing seconds).
I have tried with this query (to find the first time before a gap):
SELECT t1.time
FROM `table` AS t1
LEFT JOIN `table` AS t2 ON t2.time=(t1.time+1)
WHERE t2.time IS NULL
ORDER BY TIME ASC
LIMIT 1
And it works but it's too slow for big tables (near 100M rows)
Is there some faster solution?
EXPLAIN query:
SHOW CREATE:
CREATE TABLE `candles` (
`time` int(10) unsigned NOT NULL,
`open` float unsigned NOT NULL,
`high` float unsigned NOT NULL,
`low` float unsigned NOT NULL,
`close` float unsigned NOT NULL,
`vb` int(10) unsigned NOT NULL,
`vs` int(10) unsigned NOT NULL,
`trades` int(10) unsigned NOT NULL,
PRIMARY KEY (`time`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

If DB version is 8.0, then The Recursive Common Table Expression might be used such as
WITH RECURSIVE cte AS
(
SELECT 1 AS n
UNION ALL
SELECT n + 1 AS value
FROM cte
WHERE cte.n < (SELECT MAX(time) FROM tab )
)
SELECT n AS gaps
FROM cte
LEFT JOIN tab
ON n=time
WHERE cte.n > (SELECT MIN(time) FROM tab )
AND time IS NULL
Demo

In MySQL 5.7, this is a use case where user variables might be helpful:
select max(time)
from (
select t.time, #rn := #rn + 1 as rn
from (select time from mytable order by time) t
cross join (select #rn := 0) r
) t
group by time - rn
This addresses the question as a gaps-and-islands problem. The idea is to identify groups of records where time increments without gaps (the islands). For this, we assign an incrementing id to each row, ordered by time; whenever the difference between time and the auto-increment changes, you know there is a gap.

With mysql 8, you can use LEAD():
select time from (
select time, lead(time, 1) over (order by time) next_time
from `table`
) t
where time+1 != next_time
In earlier versions, I might do something like:
select prev_time as time from (
select #prev_time+0 as prev_time,if(#prev_time:=time,time,time) as time
from (select #prev_time:=null) initvars
cross join (select time from `table` order by time) t
) t
where time != prev_time+1
Either will not include the greatest time, where your original query would have.
I think the group by required to treat it as a strict gaps and islands problem would be too expensive with that many records.
fiddle

Related

Different SQL writing methods cause different time cost

I'm trying to select every n-th row from mysql, I read this answer.
There is a table sys_request_log:
CREATE TABLE `sys_request_log`
(
`id` bigint(20) NOT NULL,
`user_id` bigint(20) DEFAULT NULL,
`ip` varchar(50) DEFAULT NULL,
`data` mediumtext,
`create_time` datetime DEFAULT NULL,
PRIMARY KEY (`id`) USING BTREE,
KEY `user_id` (`user_id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;
It contains 11837 rows.
I try to select every 5-th row from table, first I try to execute:
SELECT
*
FROM
(SELECT #ROW := #ROW + 1 AS rownum, log.* FROM ( SELECT #ROW := 0 ) r, sys_request_log log ) ranked
WHERE
rownum % 5 = 1
The result is:
rownum id user_id create_time
-------------------------------------------------------------------
1 1271446699071639552 1 2020-06-12 22:18:10
6 1271446948980854784 1 2020-06-12 22:19:10
11 1271447016878247936 1269884071484461056 2020-06-12 22:19:26
It costs 1.001s time
I found there is a unrelated column rownum. So I modify the SQL like this:
SELECT
log.*
FROM
(SELECT #ROW := #ROW + 1 AS rownum FROM (SELECT #ROW := 0) t) r,
sys_request_log log
WHERE
rownum % 5 = 1
Now the result is clean (no rownum), but It costs 2.516s time!
Why?
Mysql version: 5.7.26-log
In the first case, the row number values are selected during the selection from the table(sys_request_log), but for the second case there occurs a cartesian product among subquery r and the selection from the table because of the CROSS JOIN occurence for each individual rownum versus each individual row value of the table.
If I understand correctly, you can do what you want by moving the variable assignment to the where clause:
select srl.*
from sys_request_log srl cross join
(select #rn := 0) params
where (#rn := (#rn + 1)) % 5 = 1;
Note this happens to work in this case, because the query needs to do a full table scan and run the WHERE clause on each row. It might not work if the query has a JOIN, GROUP BY or ORDR BY.
The use of variables in this way is deprecated in MySQL now. You should upgrade and learn about window functions.
Your second query has different result from the first one and returns all rows, so take much more time from the first one.
To remove rownum from first query, Just name fields in SELECT clause.
try this:
SELECT
ranked.id, ranked.user_id ,ranked.create_time
FROM
( SELECT #ROW := #ROW + 1 AS rownum, log.* FROM ( SELECT #ROW := 0 ) r, sys_request_log log ) ranked
WHERE
rownum % 5 = 1

Get rid of the subqueries for the sake of sorting grouped data

Tables
CREATE TABLE `aircrafts_in` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`city_from` int(11) NOT NULL COMMENT 'Откуда',
`city_to` int(11) NOT NULL COMMENT 'Куда',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=91 DEFAULT CHARSET=utf8 COMMENT='Самолёты по направлениям'
CREATE TABLE `aircrafts_in_parsed_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`price` int(11) NOT NULL COMMENT 'Ценник',
`airline` varchar(255) NOT NULL COMMENT 'Авиакомпания',
`date` date NOT NULL COMMENT 'Дата вылета',
`info_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `info_id` (`info_id`),
KEY `price` (`price`),
KEY `date` (`date`)
) ENGINE=InnoDB AUTO_INCREMENT=940682 DEFAULT CHARSET=utf8
date - departure date
CREATE TABLE `aircrafts_in_parsed_info` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`status` enum('success','error') DEFAULT NULL,
`type` enum('roundtrip','oneway') NOT NULL,
`date` datetime NOT NULL COMMENT 'Дата парсинга',
`aircrafts_in_id` int(11) DEFAULT NULL COMMENT 'ID направления',
PRIMARY KEY (`id`),
KEY `aircrafts_in_id` (`aircrafts_in_id`)
) ENGINE=InnoDB AUTO_INCREMENT=577759 DEFAULT CHARSET=utf8
date - created date, when was parsed
Task
Get lowest price of ticket and date of departure for each month. Be aware that the minimum price is relevant, not just the minimum. If multiple dates with minimum cost, we need a first.
My solution
I think that there's something not quite right.
I don't like subqueries for grouping, how to solve this problem
select *
from (
select * from (
select airline,
price,
pdata.`date` as `date`
from aircrafts_in_parsed_data `pdata`
inner join aircrafts_in_parsed_info `pinfo`
on pdata.`info_id` = pinfo.`id`
where pinfo.`aircrafts_in_id` = {$id}
and pinfo.status = 'success'
and pinfo.`type` = 'roundtrip'
and `price` <> 0
group by pdata.`date`, year(pinfo.`date`) desc, month(pinfo.`date`) desc, day(pinfo.`date`) desc
) base
group by `date`
order by price, year(`date`) desc, month(`date`) desc, day(`date`) asc
) minpriceperdate
group by year(`date`) desc, month(`date`) desc
Takes 0.015 s without cache, table size can view in auto increment
SELECT MIN(price) AS min_price,
LEFT(date, 7) AS yyyy_mm
FROM aircrafts_in_parsed_data
GROUP BY LEFT(date, 7)
will get the lowest price for each month. But it can't say 'first'.
From my groupwise-max cheat-sheet, I derive this:
SELECT
yyyy_mm, date, price, airline -- The desired columns
FROM
( SELECT #prev := '' ) init
JOIN
( SELECT LEFT(date, 7) != #prev AS first,
#prev := LEFT(date, 7)
LEFT(date, 7) AS yyyy_mm, date, price, airline
FROM aircrafts_in_parsed_data
ORDER BY
LEFT(date, 7), -- The 'GROUP BY'
price ASC, -- ASC to do "MIN()"
date -- To get the 'first' if there are dup prices for a month
) x
WHERE first -- extract only the first of the lowest price for each month
ORDER BY yyyy_mm; -- Whatever you like
Sorry, but subqueries are necessary. (I avoided YEAR(), MONTH(), and DAY().)
You are right, your query is not correct.
Let's start with the innermost query: You group by pdata.date + pinfo.date, so you get one result row per date combination. As you don't specify which price or airline you are interested in for each date combination (such as MAX(airline) and MIN(price)), you get one airline arbitrarily chosen for a date combination and one price also arbitrarily chosen. These don't even have to belong to the same record in the table; the DBMS is free to chose one airline and one price matching the dates. Well, maybe the date combination of pdata.date and pinfo.date is already unique, but then you wouldn't have to group by at all. So however we look at this, this isn't proper.
In the next query you group by pdata.date only, thus again getting arbitrary matches for airline and price. You could have done that in the innermost query already. It makes no sense to say: "give me a randomly picked price per pdata.date and pinfo.date and from these give me a randomly picked price per pdata.date", you could just as well say it directly: "give me a randomly picked price per pdata.date". Then you order your result rows. This is completely useless, as you are using the results as a subquery (derived table) again, and such is considered an unordered set. So the ORDER BY gives the DBMS more work to do, but is in no way guaranteed to influence the main queries results.
In your main query then you group by year and month, again resulting in arbitrarily picked values.
Here is the same query a tad shorter and cleaner:
select
pdata.airline, -- some arbitrily chosen airline matching year and month
pdata.price, -- some arbitrily chosen price matching year and month
pdata.date -- some arbitrily chosen date matching year and month
from aircrafts_in_parsed_data pdata
inner join aircrafts_in_parsed_info pinfo on pdata.info_id = pinfo.id
where pinfo.aircrafts_in_id = {$id}
and pinfo.status = 'success'
and pinfo.type = 'roundtrip'
and pdata.price <> 0
group by year(pdata.date), month(pdata.date)
order by year(pdata.date) desc, month(pdata.date) desc
As to the original task (as far as I understand it): Find the records with the lowest price per month. Per month means GROUP BY month. The lowest price is MIN(price).
select
min_price_record.departure_year,
min_price_record.departure_month,
min_price_record.min_price,
full_record.departure_date,
full_record.airline
from
(
select
year(`date`) as departure_year,
month(`date`) as departure_month,
min(price) as min_price
from aircrafts_in_parsed_data
where price <> 0
and info_id in
(
select id
from aircrafts_in_parsed_info
where aircrafts_in_id = {$id}
and status = 'success'
and type = 'roundtrip'
)
group by year(`date`), month(`date`)
) min_price_record
join
(
select
`date` as departure_date,
year(`date`) as departure_year,
month(`date`) as departure_month,
price,
airline
from aircrafts_in_parsed_data
where price <> 0
and info_id in
(
select id
from aircrafts_in_parsed_info
where aircrafts_in_id = {$id}
and status = 'success'
and type = 'roundtrip'
)
) full_record on full_record.departure_year = min_price_record.departure_year
and full_record.departure_month = min_price_record.departure_month
and full_record.price = min_price_record.min_price
order by
min_price_record.departure_year desc,
min_price_record.departure_month desc;

Load top 5 records per date

I have a table, in which there are date wise quiz score of different users. I want to load top 5 scorers for every date.
Table sample create statement:
CREATE TABLE `subscriber_score` (
`msisdn` varchar(25) COLLATE utf8_unicode_ci NOT NULL,
`date` date NOT NULL,
`score` int(11) NOT NULL DEFAULT '0',
`total_questions_sent` int(11) NOT NULL DEFAULT '0',
`total_correct_answers` int(11) NOT NULL DEFAULT '0',
`total_wrong_answers` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`msisdn`,`date`),
KEY `fk_subscriber_score_subscriber1` (`msisdn`),
CONSTRAINT `fk_subscriber_score_subscriber1` FOREIGN KEY (`msisdn`) REFERENCES `subscriber` (`msisdn`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Query which I have tried:
SELECT subscriber.msisdn AS msisdn,subscriber.name AS name,subscriber.gender AS gender,tmp2.score AS score,tmp2.date AS winning_date
FROM subscriber,
(SELECT msisdn,tmp.date,tmp.score
FROM subscriber_score,
(SELECT date,MAX(score) AS score
FROM subscriber_score
WHERE date > '2014-10-10' AND date < '2014-11-10' GROUP BY date)
tmp
WHERE subscriber_score.date=tmp.date AND subscriber_score.score=tmp.score)
tmp2
WHERE subscriber.msisdn=tmp2.msisdn ORDER BY winning_date
Actual output: Only one top scorer for every date is shown.
Wanted Output Top 5(or say 10) records for every date are required.
I think you can do this using variables to assign each row a row number, then filter the top 5 for each date.
SELECT s.name AS name,
s.gender AS gender,
s.msisdn,
ss.date,
ss.score
FROM ( SELECT ss.msisdn,
ss.score,
#r:= CASE WHEN ss.Date = #d THEN #r + 1 ELSE 1 END AS RowNum,
#d:= ss.date AS winning_date
FROM subscriber_score AS ss
CROSS JOIN (SELECT #d:= '', #r:= 0) AS v
WHERE ss.date > '2014-10-10'
AND ss.date < '2014-11-10'
ORDER BY ss.Date, ss.Score DESC
) AS ss
INNER JOIN Subscriber AS s
ON s.msisdn = ss.msisdn
WHERE ss.RowNum <= 5;
Example on SQL Fiddle
refer this query its not complete but hope it helps
SELECT SCORE
FROM table
WHERE date='somedate'
ORDER BY SCORE DESC LIMIT 5
select bc.msisdn msisdn,bc.name name,bc.gender gender,ab.score score,ab.date winning_date
(
select msisdn,date,score,
dense_rank() over (partition by date order by score desc) rnk
from subscriber_score
) ab,subscriber bc
where bc.msisdn=ab.msisdn and ab.rnk<=5
order by winning_date ;
This is how you can get solution of your problem in oracle sql.
try below
SELECT subscriber.msisdn AS msisdn,subscriber.name AS name,subscriber.gender AS gender,tmp2.score AS score,tmp2.date AS winning_date
FROM subscriber inner join
(select msisdn,date, score, ROW_NUMBER() OVER(PARTITION BY date ORDER BY score DESC) AS Row
FROM subscriber_score
WHERE date > '2014-10-10' AND date < '2014-11-10' GROUP BY date)
tmp
on subscriber.msisdn=tmp.msisdn and tmp.row<=5

Order by multiple conditions

im very noob and this became ungoogleable (is that a word?)
the rank is by time but..
time done with ( A=0 ) AND ( B=0 ) beat everyone
time done with ( A=0 ) AND ( B=1 ) beat everyone with ( A=1 )
time done with ( A=1 ) AND ( B=0 ) beat everyone with ( A=1 + B=1 )
rank example (track=desert)
pos--car------time---A----B
1.---yellow----90----No---No
2.---red-------95----No---No
3.---grey-----78-----No---Yes
4.---orange--253---No---Yes
5.---black----86----Yes---No
6.---white----149---Yes---No
7.---pink-----59----Yes---Yes
8.---blue-----61----Yes---Yes
to make it even worst, the table accept multiple records for the same car
here is the entries
create table `rank`
(
`id` int not null auto_increment,
`track` varchar(25) not null,
`car` varchar(32) not null,
`time` int not null,
`a` boolean not null,
`b` boolean not null,
primary key (`id`)
);
insert into rank (track,car,time,a,b) values
('desert','red','95','0','0'),
('desert','yellow','89','0','1'),
('desert','yellow','108','0','0'),
('desert','red','57','1','1'),
('desert','orange','120','1','0'),
('desert','grey','85','0','1'),
('desert','grey','64','1','0'),
('desert','yellow','90','0','0'),
('desert','white','92','1','1'),
('desert','orange','253','0','1'),
('desert','black','86','1','0'),
('desert','yellow','94','0','1'),
('desert','white','149','1','0'),
('desert','pink','59','1','1'),
('desert','grey','78','0','1'),
('desert','blue','61','1','1'),
('desert','pink','73','1','1');
please, help? :p
ps: sorry about the example table
To prioritize a, then b, then time, use order by b, a, time.
You can use a not exists subquery to select only the best row per car.
Finally, you can add a Pos column using MySQL's variables, like #rn := #rn + 1.
Example query:
select #rn := #rn + 1 as pos
, r.*
from rank r
join (select #rn := 0) init
where not exists
(
select *
from rank r2
where r.car = r2.car
and (
r2.a < r.a
or (r2.a = r.a and r2.b < r.b)
or (r2.a = r.a and r2.b = r.b and r2.time < r.time)
)
)
order by
b
, a
, time
See it working at SQL Fiddle.

need select from two fields, unique in first based on highest of second

I have a table with three fields, an ID, a Date(string), and an INT. like this.
+---------------------------
+BH|2012-09-01|56789
+BH|2011-09-01|56765
+BH|2010-08-01|67866
+CH|2012-09-01|58789
+CH|2011-09-01|56795
+CH|2010-08-01|67866
+DH|2012-09-01|52789
+DH|2011-09-01|56665
+DH|2010-08-01|67866
I need to essentially for each ID, i need to return only the row with the highest Date string. From this example, my results would need to be.
+---------------------------
+BH|2012-09-01|56789
+CH|2012-09-01|58789
+DH|2012-09-01|52789
SELECT t.id, t.date_column, t.int_column
FROM YourTable t
INNER JOIN (SELECT id, MAX(date_column) AS MaxDate
FROM YourTable
GROUP BY id) q
ON t.id = q.id
AND t.date_column = q.MaxDate
SELECT id, date, int
FROM ( SELECT id, date, int
FROM table_name
ORDER BY date DESC) AS h
GROUP BY id
Replace table_name and columns to the right ones.
Assuming the following structure:
CREATE TABLE `stackoverflow`.`table_10357817` (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Date` datetime NOT NULL,
`Number` int(11) NOT NULL,
`Code` char(2) NOT NULL,
PRIMARY KEY (`Id`) USING BTREE
) ENGINE=MyISAM AUTO_INCREMENT=11 DEFAULT CHARSET=latin1
The following query will wield the expected results:
SELECT Code, Date, Number
FROM table_10357817
GROUP BY Code
HAVING Date = MAX(Date)
The GROUP BY forces a single result per Code (you called it id) and the HAVING clauses returns only the data where it matches the max date per code/id.
Update
Used the following data script:
INSERT INTO table_10357817
(Code, Date, Number)
VALUES
('BH', '2012-09-01', 56789),
('BH', '2011-09-01', 56765),
('BH', '2010-08-01', 67866),
('CH', '2012-09-01', 58789),
('CH', '2011-09-01', 56795),
('CH', '2010-08-01', 67866),
('DH', '2012-09-01', 52789),
('DH', '2011-09-01', 56665),
('DH', '2010-08-01', 67866)