search by date mysql performance - mysql

I have a large table with about 100 million records, with fields start_date and end_date, with DATE type. I need to check the number of overlaps with some date range, say between 2013-08-20 AND 2013-08-30, So I use.
SELECT COUNT(*) FROM myTable WHERE end_date >= '2013-08-20'
AND start_date <= '2013-08-30'
date column are indexed.
The important points is that the date ranges that I am searching for overlap are always in the future, while the main part of the records in the table are in the past (say about 97-99 million).
So, will this query be faster, if I add a column is_future - TINYINT, so, by checking only that condition like this
SELECT COUNT(*) FROM myTable WHERE is_future = 1
AND end_date >= '2013-08-20' AND start_date <= '2013-08-30'
it will exclude the rest 97 million or so records and will check the date condition for only the remaining 1-3 million records ?
I use MySQL
Thanks
EDIT
The mysql engine is innodb, but will matter considerably if it is say, MyISAM
here is the create table
CREATE TABLE `orders` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`title`
`start_date` date DEFAULT NULL,
`end_date` date DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
EDIT 2 after #Robert Co answer
The partitioning looks like a good idea for this case, but it does not allow me to create partition based on is_future field unless I define it as primary key, otherwise I should remove my main primary key - id, which I can not do. So, if I define that field as primary key, then is there a meaning of partitioning, will not it be fast already if I search by is_future field which is primary key.
EDIT 3
The actual query where I need to use this is to select restaurant that have some free tables for that date range
SELECT r.id, r.name, r.table_count
FROM restaurants r
LEFT JOIN orders o
ON r.id = o.restaurant_id
WHERE o.id IS NULL
OR (r.table_count > (SELECT COUNT(*)
FROM orders o2
WHERE o2.restaurant_id = r.id AND
end_date >= '2013-08-20' AND start_date <= '2013-08-30'
AND o2.status = 1
)
)
SOLUTION
After a lot more research and testing the fastest way for counting the number of rows in my case was to just add one more condition, that start_date is more than current date (because the date ranges for search are always in the future)
SELECT COUNT(*) FROM myTable WHERE end_date >= '2013-09-01'
AND start_date >= '2013-08-20' AND start_date <= '2013-09-30'
also it is necessary to have one index - with start_date and end_date fields (thank you #symcbean).
As a result the execution time on table with 10m rows from 7 seconds - became 0.050 seconds.
SOLUTION 2 (#Robert Co)
partitioning in this case worked as well !! - perhaps it is better solution than indexing. Or they can both be applied together.
Thanks

This is a perfect use case for
table partitioning. If the Oracle INTERVAL feature makes it to MySQL, then it will just add to the awesomeness.

date column are indexed
What type of index? A hash based index is no use for range queries. If it's not a BTREE index then change it now. And you've not shown us *how they are indexed. Are both columns in the same index? Is there other stuff in there too? What order (end_date must appear as the first column)?
There are implicit type conversions in the script - this should be handled automatically by the optimizer, but it's worth checking....
SELECT COUNT(*) FROM myTable WHERE end_date >= 20130820000000
AND start_date <= 20130830235959
if I add a column is_future - TINYINT
First, in order to be of any use, this would require that the future dates be a small proportion of the total data stored in the table (less than 10%). And that's just to make it more efficient than a full table scan.
Secondly, it's going to require very frequent updates to the index to maintain it, which in addition to the overhead of initial populatiopn is likely to lead to fragmentation of the index and degraded performance (depending on how the iondex is constructed).
Thirdly, if this still has to process 3 million rows of data (and specifically, via an index lookup) then it's going to be very slow even with the data pegged in memory.
Further, the optimizer is never likely to use this index without being forced to (due to the low cardinality).

I have done a simple test, just created an index on the tinyint column. The structures may not be the same, but with an index it seems to work.
http://www.sqlfiddle.com/#!2/514ab/1/0
and for count
http://www.sqlfiddle.com/#!2/514ab/2/0
View execution plan there to see that the select just scans one row which means it would process only the lesser number of records in your case.
So the simple answer is yes, with an index it would work.

Related

What would cause performance issues in query A but not query B?

I have a table (MYSQL 8) with apx. 100M records storing various bits of stock data (price, date, etc.), and query A below runs in < 1s, but query B takes over 2mins. Among other indices, I've got an index on the date, and the primary key for the table is (symbol, date). What would cause such a significant difference between the two queries, and what might speed up the poor performer?
Query A:
SELECT symbol, MIN(date)
FROM Stocks
WHERE date BETWEEN '2015-01-01' AND '2020-01-01'
GROUP BY symbol
Query B
SELECT symbol, MIN(date)
FROM Stocks
WHERE date BETWEEN '2015-01-01' AND '2020-01-01' AND market_cap > 20
GROUP BY symbol
The other challenge I'm facing is that at times I want to filter by market_cap, but other times by other numerical fields (gross_profit, total_assets, etc.). The query is being generated by a form with a number of optional inputs that are tied to params.
Table schema
CREATE TABLE IF NOT EXISTS cri_v0_995 (
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
company_id MEDIUMINT UNSIGNED NOT NULL,
dt DATE NOT NULL,
price DECIMAL(18, 2),
market_cap DECIMAL(12, 4),
div_yield DECIMAL(4,2), -- this is always 0
price_to_nte DOUBLE,
price_to_mte DOUBLE,
nte_to_price DECIMAL(16, 10),
ante_to_price DECIMAL(16, 10),
ate_to_price DECIMAL(18, 10),
price_to_sales DOUBLE,
price_to_earnings DOUBLE,
cur_liq_to_mcap DECIMAL(4, 2), -- this is always 0
net_liq_to_mcap DOUBLE,
g3_rev_growth_and_trend DECIMAL(14, 10),
p_cri_score DECIMAL(14, 10),
f_cri_score DECIMAL(10, 7),
cri_score DECIMAL(14, 10),
PRIMARY KEY (id),
FOREIGN KEY (company_id) REFERENCES companies (id),
UNIQUE KEY (company_id, dt)
);
Note that there are a couple cols that I'm unsure about. They've always been zeros but I don't know what the intent may be behind them atm.
(edit 1 to address missing GROUP BYs)
(edit 2 adding table schema)
The first query could simply hop through the table:
For each symbol (which is conveniently the start of the PK)
Find the first row ("first" because of the second part of the index is date) that is also >= start date
Toss the result if that date is not <= the end date
The second query needs to look at each row to check market_cap; it can't jump through the table.
If, instead, you have current_market_cap in the Symbols table you could filter on market_cap before JOINing to this table.
Two ranges in the WHERE clause makes it very difficult to optimize.
INDEXes are one-dimensional.
Using PARTITION BY TODAYS(date) requires a major structural change to the table. It may (or may not!) help your query run faster -- by using 'partition pruning' to limit how many rows need to be checked. (I say "may not" because the query is looking at a 5-year range, which might be a significant fraction of the entire data.)
More discussion of partitioning: http://mysql.rjweb.org/doc.php/partitionmaint and http://mysql.rjweb.org/doc.php/find_nearest_in_mysql -- The latter link discusses a different 2D problem (geographical 'find nearest'); it is something of a stretch to apply it to your query.
Since you have lots of columns that the end-user might filter on, and 100M rows, let's approach from another direction: Minimizing table size. This is especially important if the table cannot be fully cached in the buffer_pool -- leading to being I/O-bound. Show us SHOW CREATE TABLE; let's discuss each column, and whether it can be shrunk.
More
Changing symbol VARCHAR... to company_id MEDIUMINT UNSIGNED may have saved 1GB between the data and the index.
Get rid of id and promote UNIQUE(company_id, dt) to be the PK. That will save a few GB by eliminating the only secondary index. (Your change was probably beneficial.
Most of those DOUBLEs are overkill? FLOAT would save 4 bytes each and still give you 6-7 significant digits.
You may want INDEX(dt) for some other queries.
The filter on market_cap probably gets in the way of groupwise max optimization.
Depending on disk space and other queries, it may be beneficial to PARTITION BY RANGE(TO_DAYS(dt)), but group by years. The (5 year + 1 day) span would hit 6 partitions. (cf "partition pruning") This would not actually change performance much.
(About 18 years ago, I worked with a dataset like this.)
price DECIMAL(18, 2) takes 9 bytes. It allows for a zillion dollars, which has not [yet] been reached. It has only 2 decimal places, so it won't precisely hold amounts before they switched to decimal (from /2, /4, /8, /16, etc).
market_cap DECIMAL(12, 4) (6 bytes) may not be big enough for some companies, and certainly not for indexes. And the 4 decimal places is probably a waste.
Suggest running SELECT MAX(market_cap), MAX(price), ... to see how big the numbers are now.
For both queries, you probably should be aggregating by the symbol. So, the second currently non performant query should be:
SELECT symbol, MIN(date)
FROM Stocks
WHERE date BETWEEN '2015-01-01' AND '2020-01-01' AND market_cap > 20
GROUP BY symbol;
The index you want here should at least cover the entire WHERE clause:
CREATE INDEX ON Stocks (date, market_cap);
If you run EXPLAIN on both queries, after adding GROUP BY, you might find that your current single column index on date isn't even being used in the second query.
maybe you need to force using index in the sql Query B.
SELECT symbol, MIN(date)
FROM Stocks use index (`indexNameOfDate`)
WHERE date BETWEEN '2015-01-01' AND '2020-01-01' AND market_cap > 20
GROUP BY symbol
or you can force to use primaryKey index.
doing that may save times for sql engine choosing index itself.
and you can find which is faster as well.
what's more, if u are using date and market_cap to filter data usually, maybe you need to create a index cover them.
like #Tim Biegeleisen said.
CREATE INDEX ON Stocks (date, market_cap);

MySQL Month and Year grouping slow performance

I've got a report which I need to show the month and year profit from my transaction the query which I made and works is very slow and can not figure out how I can manage to change the query the way that consumes less time to load.
SELECT MONTH(MT4_TRADES.CLOSE_TIME) as MONTH
, YEAR(MT4_TRADES.CLOSE_TIME) as YEAR
, SUM(MT4_TRADES.SWAPS) as SWAPS
, SUM(MT4_TRADES.VOLUME)/100 as VOLUME
, SUM(MT4_TRADES.PROFIT) AS PROFIT
FROM MT4_TRADES
JOIN MT4_USERS
ON MT4_TRADES.LOGIN = MT4_USERS.LOGIN
WHERE MT4_TRADES.CMD < 2
AND MT4_TRADES.CLOSE_TIME <> "1970-01-01 00:00:00"
AND MT4_USERS.AGENT_ACCOUNT <> "1"
GROUP
BY YEAR(MT4_TRADES.CLOSE_TIME)
, MONTH(MT4_TRADES.CLOSE_TIME)
ORDER
BY YEAR
This is the full query, any suggestion would be highly appreciated.
This is the result of explain:
Echoing the comment from #Barmar, look at the EXPLAIN output to see query execution plan. Verify that suitable indexes are being used.
Likely the big rock in terms of performance is the "Using filesort" operation.
To get around that, we would need a suitable index available. and that would require some changes to the table. (The typical question on "improve query performance" topic on SO comes with a restrictions that we "can't add indexes or make any changes to the table".)
I'd be looking at a functional index (feature added in MySQL 8.0, for MySQL 5.7, I'd be looking at adding generated columns and including generated columns in a secondary index, featured added in MySQL 5.7)
CREATE INDEX `MT4_TRADES_ix2` ON MT4_TRADES ((YEAR(close_time)),(MONTH(close_time)))
I'd be tempted to go with a covering index, and also change the grouping to a single expression e.g. DATE_FORMAT(close_time,'%Y-%m')
CREATE INDEX `MT4_TRADES_ix3` ON MT4_TRADES ((DATE_FORMAT(close_time,'%Y-%m'))
,swaps,volume,profit,login,cmd,closetime)
from the query, it looks like login is going to be UNIQUE in MT4_USERS table, likely that's the PRIMARY KEY or a UNIQUE KEY, so an index is going to be available, but we're just guessing...
With suitable indexes available, we could so something like this:
SELECT DATE_FORMAT(close_time,'%Y-%m') AS close_year_mo
, SUM(IF(t.cmd < 2 AND t.close_time <> '1970-01-01', t.swaps ,NULL)) AS swaps
, SUM(IF(t.cmd < 2 AND t.close_time <> '1970-01-01', t.volume ,NULL))/100 AS volume
, SUM(IF(t.cmd < 2 AND t.close_time <> '1970-01-01', t.profit ,NULL)) AS profit
FROM MT4_TRADES t
JOIN MT4_USERS u
ON u.login = t.login
AND u.agent_account <> '1'
GROUP BY close_year_mo
ORDER BY close_year_mo
and we'd expect MySQL to do a loose index scan, with the EXPLAIN output top show "using index for group-by" and not show "Using filesort"
EDIT
For versions of MySQL before 5.7, we could create new columns, e.g.year_close and month_close, populate the columns with the results of expressions YEAR(close_time) and MONTH(close_time) (we could create BEFORE INSERT and BEFORE UPDATE triggers to handle that automatically for us)
Then we could create index with those columns as the leading columns
CREATE INDEX ... ON MT4_TRADES ( year_close, month_close, ... )
And then reference the new columns in the query
SELECT t.year_close AS `YEAR`
, t.month_close AS `MONTH`
FROM MT4_TRADES t
JOIN ...
WHERE ...
GROUP
BY t.year_close
, t.month_close
Ideally include in the index all of referenced columns from MT4_TRADES, to make a covering index for the query.

Seek paginated query gets progressively slower on a big table

I have a biggish table of events. (5.3 million rows at the moment). I need to traverse this table mostly from the beginning to the end in a linear fashion. Mostly no random seeks. The data currently includes about 5 days of these events.
Due to the size of the table I need to paginate the results, and the internet tells me that "seek pagination" is the best method.
However this method works great and fast for traversing the first 3 days, after this mysql really begins to slow down. I've figured out it must be something io-bound as my cpu usage actually falls as the slowdown starts.
I do belive this has something to do with the 2-column sorting I do, and the usage of filesort, maybe Mysql needs to read all the rows to sort my results or something. Indexing correctly might be a proper fix, but I've yet been unable to find an index that solves my problem.
The compexifying part of this database is the fact that the ids and timestamps are NOT perfectly in order. The software requires the data to be ordered by timestamps. However when adding data to this database, some events are added 1 minute after they have actually happened, so the autoincremented ids are not in the chronological order.
As of now, the slowdown is so bad that my 5-day traversal never finishes. It just gets slower and slower...
I've tried indexing the table on multiple ways, but mysql does not seem to want to use those indexes and EXPLAIN keeps showing "filesort". Indexing is used on the where-statement though.
The workaround I'm currently using is to first do a full table traversal and load all the row ids and timestamps in memory. I sort the rows in the python side of the software and then load the full data in smaller chunks from mysql as I traverse (by ids only). This works fine, but is quite unefficient due to the total of 2 traversals of the same data.
The schema of the table:
CREATE TABLE `events` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`server` varchar(45) DEFAULT NULL,
`software` varchar(45) DEFAULT NULL,
`timestamp` bigint(20) DEFAULT NULL,
`data` text,
`event_type` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index3` (`timestamp`,`server`,`software`,`id`),
KEY `index_ts` (`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=7410472 DEFAULT CHARSET=latin1;
The query (one possible line):
SELECT software,
server,
timestamp,
id,
event_type,
data
FROM events
WHERE ( server = 'a58b'
AND ( software IS NULL
OR software IN ( 'ASD', 'WASD' ) ) )
AND ( timestamp, id ) > ( 100, 100 )
AND timestamp <= 200
ORDER BY timestamp ASC,
id ASC
LIMIT 100;
The query is based on https://blog.jooq.org/2013/10/26/faster-sql-paging-with-jooq-using-the-seek-method/ (and some other postings with the same idea). I belive it is called "seek pagination with seek predicate". The basic gist is that I have a starting timestamp and ending timestamp, and I need to get all the events with the software on the servers I've specifed OR only the server-specific events (software = NULL). The weirdish ( )-stuff is due tho python constructing the queries based on the parameters it is given. I left them visible if by a small chance they might have some effect.
I'm excepting the traversal to finish before the heat death of the universe.
First change
AND ( timestamp, id ) > ( 100, 100 )
to
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
This optimisation is suggested in the official documentation: Row Constructor Expression Optimization
Now the engine will be able to use the index on (timestamp). Depending on cardinality of the columns server and software, that could be already fast enough.
An index on (server, timestamp, id) should improve the performance farther.
If still not fast enough, i would suggest a UNION optimization for
AND (software IS NULL OR software IN ('ASD', 'WASD'))
That would be:
(
SELECT software, server, timestamp, id, event_type, data
FROM events
WHERE server = 'a58b'
AND software IS NULL
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
AND timestamp <= 200
ORDER BY timestamp ASC, id ASC
LIMIT 100
) UNION ALL (
SELECT software, server, timestamp, id, event_type, data
FROM events
WHERE server = 'a58b'
AND software = 'ASD'
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
AND timestamp <= 200
ORDER BY timestamp ASC, id ASC
LIMIT 100
) UNION ALL (
SELECT software, server, timestamp, id, event_type, data
FROM events
WHERE server = 'a58b'
AND software = 'WASD'
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
AND timestamp <= 200
ORDER BY timestamp ASC, id ASC
LIMIT 100
)
ORDER BY timestamp ASC, id ASC
LIMIT 100
You will need to create an index on (server, software, timestamp, id) for this query.
There are multiple complications going on.
The quick fix is
INDEX(software, timestamp, id) -- in this order
together with
WHERE server = 'a58b'
AND timestamp BETWEEN 100 AND 200
AND ( software IS NULL
OR software IN ( 'ASD', 'WASD' ) ) )
AND ( timestamp, id ) > ( 100, 100 )
ORDER BY timestamp ASC,
id ASC
LIMIT 100;
Note that server needs to be first in the index, not after the thing you are doing a range on (timestamp). Also, I broke out timestamp BETWEEN ... to make it clear to the optimizer that the next column of the ORDER BY might make use of the index.
You said "pagination", so I assume you have an OFFSET, too? Add it back in so we can discuss the implications. My blog on "remembering where you left off" instead of using OFFSET may (or may not) be practical.

Partitioning or separating a very large table in mysql

We have a very large table in mysql with 500,000,000 records in it with 100 requests ( SELECT ) per second.
This is schema:
id(int),
user_id (int),
content(text),
date(datetime)
Since up to 90% of requests are within last 6 months. My question is about increasing performance.
Is it a good idea to separate those records from last 6 month in another table and SELECT from it, OR a partitioning method to get all records of the last 6 month fast.
Or if there's a better way...
For instance, a query is this.
SELECT content,user_id FROM log
JOIN users ON users.id = log.user_id
WHERE date > DATE_SUB(CURDATE(), INTERVAL 180 DAY)
LIMIT 15
user_id, date is indexed in table Log
There are 2 million users in table Users.
Your edit says you use queries like this at a rate of a third of a million per hour.
SELECT content,user_id
FROM log
JOIN users ON users.id = log.user_id
WHERE date > DATE_SUB(CURDATE(), INTERVAL 180 DAY)
LIMIT 15
I will take the liberty of rewriting this query to fully qualify your column selections.
SELECT log.content,
log.user_id
FROM log /* one half gigarow table */
JOIN users ON users.id = log.user_id /* two megarow table */
WHERE log.date > DATE_SUB(CURDATE(), INTERVAL 180 DAY)
LIMIT 15
(Please consider updating your question if this is not correct.)
Why are you joining the users table in this query? None of your results seem to come from it. Why won't this query do what you need?
SELECT log.content,
log.user_id
FROM log /* one half gigarow table */
WHERE log.date > DATE_SUB(CURDATE(), INTERVAL 180 DAY)
LIMIT 15
If you want to make this query faster, put a compound covering index on (date,user_id, content). This covering index will support a range scan and fast retrieval. If your content column is in fact of type TEXT (a LOB) type, you need to put just (date,user_id) into the covering index, and your retrieval will be a little slower.
Are you using the JOIN to ensure that you get log entries returned which have a matching entry in users? If so, please explain your query better.
You definitely can partition your table based on date ranges. But you will need to either alter your table, or recreate and repopulate it, which will incur either downtime or a giant scramble.
http://dev.mysql.com/doc/refman/5.6/en/partitioning-range.html
Something like this DDL should then do the trick for you
CREATE TABLE LOG (
id INT NOT NULL AUTO_INCREMENT, /*maybe BIGINT? */
user_id INT NOT NULL,
`date` DATETIME NOT NULL,
content TEXT,
UNIQUE KEY (id, `date`),
KEY covering (`date`,user_id)
)
PARTITION BY RANGE COLUMNS(`date`) (
PARTITION p0 VALUES LESS THAN ('2012-01-01'),
PARTITION p1 VALUES LESS THAN ('2012-07-01'),
PARTITION p2 VALUES LESS THAN ('2013-01-01'),
PARTITION p3 VALUES LESS THAN ('2013-07-01'),
PARTITION p4 VALUES LESS THAN ('2014-01-01'),
PARTITION p5 VALUES LESS THAN ('2014-07-01'),
PARTITION p6 VALUES LESS THAN ('2015-01-01'),
PARTITION p7 VALUES LESS THAN ('2015-07-01')
);
Notice that there's some monkey business about the UNIQUE KEY. The column that goes into your partitioning function needs also to appear in the so-called primary key.
Later on, when July 2015 (partition p7's cutoff date) draws near, you can run this statement to add a partition for the next six month segment of time.
ALTER TABLE `log`
ADD PARTITION (PARTITION p8 VALUES LESS THAN ('2016-01-01'))
But, seriously, none of this partitioning junk is going to help much if your queries have unnecessary joins or poor index coverage. And it is going to make your database administration more complex.

MySQL - turning data points into ranges

I have a database of measurements that indicate a sensor, a reading, and the timestamp the reading was taken. The measurements are only recorded when there's a change. I want to generate a result set that shows the range each sensor is reading a particular measurement.
The timestamps are in milliseconds but I'm outputting the result in seconds.
Here's the table:
CREATE TABLE `raw_metric` (
`row_id` BIGINT NOT NULL AUTO_INCREMENT,
`sensor_id` BINARY(6) NOT NULL,
`timestamp` BIGINT NOT NULL,
`angle` FLOAT NOT NULL,
PRIMARY KEY (`row_id`)
)
Right now I'm getting the results I want using a subquery, but it's fairly slow when there's a lot of datapoints:
SELECT row_id,
HEX(sensor_id),
angle,
(
COALESCE((
SELECT MIN(`timestamp`)
FROM raw_metric AS rm2
WHERE rm2.`timestamp` > rm1.`timestamp`
AND rm2.sensor_id = rm1.sensor_id
), UNIX_TIMESTAMP() * 1000) - `timestamp`
) / 1000 AS duration
FROM raw_metric AS rm1
Essentially, to get the range, I need to get the very next reading (or use the current time if there isn't another reading). The subquery finds the minimum timestamp that is later than the current one but is from the same sensor.
This query isn't going to occur very often so I'd prefer to not have to add an index on the timestamp column and slow down inserts. I was hoping someone might have a suggestion as to an alternate way of doing this.
UPDATE:
The row_id's should be incremented along with timestamps but it can't be guaranteed due to network latency issues. So, it's possible that an entry with a lower row_id comes occurs AFTER a later row_id, though unlikely.
This is perhaps more appropriate as a comment than as a solution, but it is too long for a comment.
You are trying to implement the lead() function in MySQL, and MySQL does not, unfortunately, have window functions. You could switch to Oracle, DB2, Postgres, SQL Server 2012 and use the built-in (and optimized) functionality there. Ok, that may not be realistic.
So, given your data structure you need to do either a correlated subquery or a non-equijoin (actually a partial equi-join because there is match on sensor_id). These are going to be expensive operations, unless you add an index. Unless you are adding measurements tens of times per second, the additional overhead on the index should not be a big deal.
You could also change your data structure. If you had a "sensor counter" that was a sequential number enumerating the readings, then you could use this as an equijoin (although for good performance you might still want an index). Adding this in to your table would require having a trigger -- and that is likely to perform even worse than an index for when inserting.
If you only have a handful of sensors, you could create a separate table for each one. Oh, I can feel the groans at this suggestion. But, if you did, then an auto-incremented id would perform the same role. To be honest, I would only do this if I could count the number of sensors on each hand.
In the end, I might suggest that you take the hit during insertion and have "effective" and "end' times on each record (as well as an index on sensor id and either timestamp or id). With these additional columns, you will probably find more uses for the table.
If you are doing this for just one sensor, then create a temoprary table for the information and use an auto-incremented id column. Then insert the data into it:
insert into temp_rawmetric (orig_row_id, sensor_id, timestamp, angle)
select orig_row_id, sensor_id, timestamp, angle
from raw_metric
order by sensor_id, timestamp;
Be sure your table has a temp_rawmetric_id column that is auto-incremented and the primary key (creates an index automatically). The order by makes sure this is incremented according to the timestamp.
Then you can do your query as:
select trm.sensor_id, trm.angle,
trm.timestamp as startTime, trmnext.timestamp as endTime
from temp_rawmetric trm left outer join
temp_rawmetric trmnext
on trmnext.temp_rawmetric_id = trm.temp_rawmetric_id+1;
This will require a pass through the original data to extra the data, and then a primary key join on the temporary table. The first might take some time. The second should be pretty quick.
Select rm1.row_id
,HEX(rm1.sensor_id)
,rm1.angle
,(COALESCE(rm2.timestamp, UNIX_TIMESTAMP() * 1000) - rm1.timestamp) as duration
from raw_metric rm1
left outer join
raw_metric rm2
on rm2.sensor_id = rm1.sensor_id
and rm2.timestamp = (
select min(timestamp)
from raw_metric rm3
where rm3.sensor_id = rm1.sensor_id
and rm3.timestamp > rm1.timestamp
)
If you use auto_increment for primary key, you may replace timestamp by row_id in query condition part. Like this:
SELECT row_id,
HEX(sensor_id),
angle,
(
COALESCE((
SELECT MIN(`timestamp`)
FROM raw_metric AS rm2
WHERE rm2.`row_id` > rm1.`row_id`
AND rm2.sensor_id = rm1.sensor_id
), UNIX_TIMESTAMP() * 1000) - `timestamp`
) / 1000 AS duration
FROM raw_metric AS rm1
It must work some quickly.
Also you can add one more subquery for fast select row id of new senser value. See:
SELECT row_id,
HEX(sensor_id),
angle,
(
COALESCE((
SELECT timestamp FROM raw_metric AS rm1a
WHERE row_id =
(
SELECT MIN(`row_id`)
FROM raw_metric AS rm2
WHERE rm2.`row_id` > rm1.`row_id`
AND rm2.sensor_id = rm1.sensor_id
)
), UNIX_TIMESTAMP() * 1000) - `timestamp`
) / 1000 AS duration
FROM raw_metric AS rm1