Slow self join on large table

Slow self join on large table - mysql

I have a table with 70 million rows which currently is structured like:
id currentDatetime expiration delta strike bid ask
1 2022-01-03 09:30:00 2022-01-03 0.05 100 0.85 0.95
2 2022-01-03 11:30:00 2022-01-03 0.05 100 0.65 0.75
3 2022-01-03 13:30:00 2022-01-03 0.09 100 1.00 1.20
For the same expiration and strike, I want to get the delta/bid/ask columns at time A and time B (eg. 09:30:00 and 13:45:00, for example).
I have this query which hopefully does what I'm looking for, but part of the issue is that it's slow to run across 70m rows. I've indexed the columns involved already. Any suggestions?
SELECT a.id,
a.currentDatetime,
a.strike,
a.expirationDate,
a.callDelta,a.callBid,a.callAsk,
b.currentDatetime as dateTimeB,
b.callDelta as deltaB,
b.callBid as bidB,
b.callAsk as askB
FROM ticker a
INNER JOIN ticker b
ON (DATE(a.currentDatetime) = DATE(b.currentDatetime)
AND a.strike = b.strike
AND a.expirationDate = b.expirationDate)
WHERE (TIME(a.currentDatetime) = :fromTime
AND TIME(b.currentDatetime) = :toTime)

Add a multicolumn index (currentDatetime, strke, expirationDate) in table.
create index idx_ticket on ticker(currentDatetime, strke, expirationDate);

Related

MySQL - Tables With Shared Keys

I have two tables, that look something like this:
Table A
time_unix time_msecs measurementA
1000 329 3.14159
1000 791 9.32821
1001 227 138.3819
1001 599 -15.3289
Table B
time_unix time_msecs measurementB
1000 565 17.2938
1000 791 12348.132
1001 227 -128.3283
1001 293 225.12938
For both tables, I'm using a composite key (made up of time_unix, time_msecs). These measurement data (measurementA and measurementB) are in two different tables, because there are actually many, many columns (many thousands) - too many to keep in a single table.
I want to perform a query such that the result set is simply my keys and a select few columns combined from these two tables. Sometimes the times (keys) line up, sometimes they don't. If they don't, I just would like a null value returned for that column.
Desired Result
time_unix time_msecs measurementA measurementB
1000 329 3.14159 (null)
1000 565 (null) 17.2938
1000 791 9.32821 12348.132
1001 227 138.3819 -128.3283
1001 293 (null) 225.12938
1001 599 -15.3289 (null)
How to achieve this? I don't think JOINs are the way to go. I have otherwise been combining datasets inside Javascript, and it seems terribly cumbersome. There must be a way to do this on the database side.

You want a full outer join between the two tables:
SELECT a.time_unix,
a.time_msecs,
a.measurementA,
b.measurementB
FROM TableA a
LEFT JOIN TableB b
ON a.time_unix = b.time_unix AND
a.time_msecs = b.time_msecs
UNION ALL
SELECT b.time_unix,
b.time_msecs,
a.measurementA,
b.measurementB
FROM TableA a
RIGHT JOIN TableB b
ON a.time_unix = b.time_unix AND
a.time_msecs = b.time_msecs
WHERE a.time_unix IS NULL
ORDER BY 1, 2;

I used union since there are no full joins in MySQL.
select TableA.time_unix
,TableA.time_msecs
,TableA.measurementA
,TableB.measurementB
from TableA
left join TableB on TableA.time_msecs = TableB.time_msecs and
TableA.time_unix = TableB.time_unix
union
select TableB.time_unix
,TableB.time_msecs
,TableA.measurementA
,TableB.measurementB
FROM TableA
right join TableB on TableA.time_msecs = TableB.time_msecs and
TableA.time_unix = TableB.time_unix
order by time_unix
time_unix
time_msecs
measurementA
measurementB
1000
329
3.14159
null
1000
791
9.32821
12348.1
1000
565
null
17.2938
1001
227
138.382
-128.328
1001
599
-15.3289
null
1001
293
null
225.129
Fiddle

Select a distributed sample set of records from a MySQL set of many records

I have a table that has many rows in it, with rows occurring at the rate of 400-500 per minute (I know this isn't THAT many), but I need to do some sort of 'trend' analysis on the data that has been collected over the last 1 minute.
Instead of pulling all records that have been entered and then processing each of those, I would really like to be able to select, say, 10 records - which occur at a -somewhat- even distribution through the timeframe specified.
ID DEVICE_ID LA LO CREATED
-------------------------------------------------------------------
1 1 23.4 948.7 2018-12-13 00:00:01
2 2 22.4 948.2 2018-12-13 00:01:01
3 2 28.4 948.3 2018-12-13 00:02:22
4 1 26.4 948.6 2018-12-13 00:02:33
5 1 21.4 948.1 2018-12-13 00:02:42
6 1 22.4 948.3 2018-12-13 00:03:02
7 1 28.4 948.0 2018-12-13 00:03:11
8 2 23.4 948.8 2018-12-13 00:03:12
...
492 2 21.4 948.4 2018-12-13 00:03:25
493 1 22.4 948.2 2018-12-13 00:04:01
494 1 24.4 948.7 2018-12-13 00:04:02
495 2 27.4 948.1 2018-12-13 00:05:04
Considering this data set, instead of pulling all those rows, I would like to maybe pull a row from the set every 50 records (10 rows for roughly ~500 rows returned).
This does not need to be exact, I just need a sample in which to perform some sort of linear regression on.
Is this even possible? I can do it in my application code if need be, but I wanted to see if there was a function or something in MySQL that would handle this.
Edit
Here is the query I have tried, which works for now - but I would like the results more evenly distributed, not by RAND().
SELECT * FROM (
SELECT * FROM (
SELECT t.*, DATE_SUB(NOW(), INTERVAL 30 HOUR) as offsetdate
from tracking t
HAVING created > offsetdate) as parp
ORDER BY RAND()
LIMIT 10) as mastr
ORDER BY id ASC;

Do not order by RAND() as the rand calculated for every row, then reordered and only then you are selecting a few records.
You can try something like this:
SELECT
*
FROM
(
SELECT
tracking.*
, #rownum := #rownum + 1 AS rownum
FROM
tracking
, (SELECT #rownum := 0) AS dummy
WHERE
created > DATE_SUB(NOW(), INTERVAL 30 HOUR)
) AS s
WHERE
(rownum % 10) = 0
Index on created is "the must".
Also, you might consider to use something like 'AND (UNIX_TIMESTAMP(created) % 60 = 0)' which is slightly different from what you wanted, however might be OK (depends on your insert distribution)

A simple mySQL query + Statistical Analysis

I'm looking for patterns in a database with around 1 million records. I've been experimenting a bit using Keras and TensorFlow, specifically LSTM,
However, as I'm really new in the field, on it, I found better results doing some very specific queries.
Having the following table with the following data:
round value class creacion
1 15.49 H 2018-01-27 14:03:54
2 7.42 H 2018-01-27 14:04:42
3 1.04 L 2018-01-27 14:39:28
4 2.71 H 2018-01-27 14:39:36
5 1.95 L 2018-01-27 14:39:59
6 4 H 2018-01-27 14:40:17
7 4.4 H 2018-01-27 14:40:45
8 1.52 L 2018-01-27 14:41:14
9 28.69 H 2018-01-27 14:41:28
10 7.44 H 2018-01-27 14:42:25
11 1.1 L 2018-01-27 14:43:02
12 1.1 L 2018-01-27 14:43:12
13 1.41 L 2018-01-27 14:43:21
14 1.04 L 2018-01-27 14:53:10
15 1.66 L 2018-01-27 14:53:19
16 8.44 H 2018-01-27 14:53:34
17 1.55 L 2018-01-27 14:54:13
18 2.39 H 2018-01-27 14:55:29
19 2.9 H 2018-01-27 14:55:50
20 1.66 L 2018-01-27 14:56:13
21 2.7 H 2018-01-27 14:56:29
22 7.53 H 2018-01-27 14:56:51
23 2.04 H 2018-01-27 14:57:28
24 1.97 L 2018-01-27 14:57:47
25 1.35 L 2018-01-27 14:58:05
As you can see, I'm classifying all values below 2, as 'L' (low) values, and bigger as H (high) values.
So the main goal here is trying to predict the next value.
I have been using this query, which sums 100 values, considering high values as 2 and low values as 1. The following query sums the last 100 results and provide one number as output, assuming that the number is lower than the median, we can predict that the chances of a high value are increased.
SELECT SUM(n)
FROM (
SELECT *, IF(value < 2, #nvalue := 1, #nvalue := 2) AS n
FROM crawler
ORDER BY round DESC
LIMIT 0, 100
) AS sub
So, the first question is about the query:
I would like to create a new column, adding the sum of the previous 100 values. Do you know how this could be done?
I can replicate the results doing the following query:
SELECT round, value, class, creacion, sum(n)
FROM (
SELECT *, if(value < 2, #nvalue := 1, #nvalue := 2) AS n
FROM crawler
ORDER BY round DESC
LIMIT 0, 100
) AS sub
However, it obviously displays the last record alone:
round value class creacion sum(n)
560894 3.24 hi 2018-06-22 22:58:59 162
When I'm actually looking for the same result, but with every single record with a limit to avoid the large loading times.

The naive way to get the last hundred values is:
select c.*,
(select sum(c2.value)
from (select c3.*
from c3
where c3.creation <= c.creation
order by c3.creation desc
limit 100
) c2
) as sum_last100
from crawler c;
Because the correlation clause is two levels deep, MySQL does not accept this.
In MySQL 8+, this is much easier:
select c.*,
sum(value) over (order by creation rows between 99 preceding and current row) as sum_last100
from crawler c;
At this point, I might suggest that you switch to either MySQL 8 or to some other database (such as Postgres). Getting your desired query to work efficiently on a million rows may not be worth the effort in older versions of MySQL.

How to use get MAX count but keep the repeated calculated value if highest

I have the following table, I am using MYSQL
BayNo FixDateTime FixType
1 04/05/2015 16:15:00 tyre change
1 12/05/2015 00:15:00 oil change
1 12/05/2015 08:15:00 engine tuning
1 04/05/2016 08:11:00 car tuning
2 13/05/2015 19:30:00 puncture
2 14/05/2015 08:00:00 light repair
2 15/05/2015 10:30:00 super op
2 20/05/2015 12:30:00 wiper change
2 12/05/2016 09:30:00 denting
2 12/05/2016 10:30:00 wiper repair
2 12/06/2016 10:30:00 exhaust repair
4 12/05/2016 05:30:00 stereo unlock
4 17/05/2016 15:05:00 door handle repair
on any given day need do find the highest number of fixes made on a given bay number, and if that calculated number is repeated then it should also appear in the resultset
so would like to see the result set as follows
BayNo FixDateTime noOfFixes
1 12/05/2015 00:15:00 2
2 12/05/2016 09:30:00 2
4 12/05/2016 05:30:00 1
4 17/05/2016 15:05:00 1
I manage to get the counts of each but struggling to get the max and keep the highest calculated repeated value. can someone help please

Calculate the fixes per day per BayNo
Find the max daily fixes per BayNo
Use the result from 2 to filter out the result from 1
Something like this:
SELECT fixes.*
FROM (
#1
SELECT BayNo,DATE(FixDateTime) as day,count(*) as noOfFixes
FROM yourTable
GROUP BY BayNo,day
) as fixes
JOIN (
#2
SELECT MAX(noOfFixes) as maxNoOfFixes,BayNo
FROM (
#1
SELECT BayNo,DATE(FixDateTime) as day,count(*) as noOfFixes
FROM yourTable
GROUP BY BayNo,day
) as t
GROUP BY BayNo
) as maxfixes ON fixes.BayNo = maxfixes.BayNo
#3
WHERE fixes.noOfFixes = maxfixes.maxNoOfFixes
You can run the repeated query (1) separately and store the result in a temporary table if needed.
I'm assuming the FixDateTime column is a an actual datetime or timestamp column. If it's not, you will need to use a different method to get the date from it.

Mysql table design advice

I have a general question about MySQL database table design. I have a table that contains ~ 650 thousand records, with approximately 100 thousand added per year. The data is requested quite frequently, 1.6 times per second on average.
It has the following structure right now
id port_id date product1_price product2_price product3_price
1 1 2012-01-01 100.00 200.00 155.00
2 2 2012-01-01 NULL 150.00 255.00
3 3 2012-01-01 300.00 NULL 355.00
4 1 2012-01-02 200.00 250.00 355.00
5 2 2012-01-02 400.00 230.00 255.00
Wouln't it be better to store the data in this manner?
id port_id date product price
1 1 2012-01-01 1 100
1 2 2012-01-01 1 200
1 3 2012-01-01 1 300
1 1 2012-01-02 1 240
Advantages of the alternative design:
with the second design we don't have to store NULL values (if there is no such product in the port)
we can add new products easily - comparing to the first design, where each new product requires a new column
Disadvantages of the alternative design:
The number of records will increase from 650 000 to 650 000 * number_of_products minus all NULL records; that will be approximately 2.1 million records.
In both cases we have id column as PRIMARY_KEY and UNIQUE key on combination of port_id and date.
So the question is: which way to go? Disk space does not matter, the speed of the queries is the most important aspect.
Thank you for your attention.

It seams, that will depend on definition of product table.
If product table is statically compound of maximum three parts, then changing the current design won't help much.
Although the current design smells bad but that will be a business dependent analysis.
BTW change must be done with caution on the side effects with product table and its usages.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Slow self join on large table - mysql

Add a multicolumn index (currentDatetime, strke, expirationDate) in table. create index idx_ticket on ticker(currentDatetime, strke, expirationDate);

Related

MySQL - Tables With Shared Keys

Select a distributed sample set of records from a MySQL set of many records

A simple mySQL query + Statistical Analysis

How to use get MAX count but keep the repeated calculated value if highest

Mysql table design advice

Categories

Resources