I'm looking for patterns in a database with around 1 million records. I've been experimenting a bit using Keras and TensorFlow, specifically LSTM,
However, as I'm really new in the field, on it, I found better results doing some very specific queries.
Having the following table with the following data:
round value class creacion
1 15.49 H 2018-01-27 14:03:54
2 7.42 H 2018-01-27 14:04:42
3 1.04 L 2018-01-27 14:39:28
4 2.71 H 2018-01-27 14:39:36
5 1.95 L 2018-01-27 14:39:59
6 4 H 2018-01-27 14:40:17
7 4.4 H 2018-01-27 14:40:45
8 1.52 L 2018-01-27 14:41:14
9 28.69 H 2018-01-27 14:41:28
10 7.44 H 2018-01-27 14:42:25
11 1.1 L 2018-01-27 14:43:02
12 1.1 L 2018-01-27 14:43:12
13 1.41 L 2018-01-27 14:43:21
14 1.04 L 2018-01-27 14:53:10
15 1.66 L 2018-01-27 14:53:19
16 8.44 H 2018-01-27 14:53:34
17 1.55 L 2018-01-27 14:54:13
18 2.39 H 2018-01-27 14:55:29
19 2.9 H 2018-01-27 14:55:50
20 1.66 L 2018-01-27 14:56:13
21 2.7 H 2018-01-27 14:56:29
22 7.53 H 2018-01-27 14:56:51
23 2.04 H 2018-01-27 14:57:28
24 1.97 L 2018-01-27 14:57:47
25 1.35 L 2018-01-27 14:58:05
As you can see, I'm classifying all values below 2, as 'L' (low) values, and bigger as H (high) values.
So the main goal here is trying to predict the next value.
I have been using this query, which sums 100 values, considering high values as 2 and low values as 1. The following query sums the last 100 results and provide one number as output, assuming that the number is lower than the median, we can predict that the chances of a high value are increased.
SELECT SUM(n)
FROM (
SELECT *, IF(value < 2, #nvalue := 1, #nvalue := 2) AS n
FROM crawler
ORDER BY round DESC
LIMIT 0, 100
) AS sub
So, the first question is about the query:
I would like to create a new column, adding the sum of the previous 100 values. Do you know how this could be done?
I can replicate the results doing the following query:
SELECT round, value, class, creacion, sum(n)
FROM (
SELECT *, if(value < 2, #nvalue := 1, #nvalue := 2) AS n
FROM crawler
ORDER BY round DESC
LIMIT 0, 100
) AS sub
However, it obviously displays the last record alone:
round value class creacion sum(n)
560894 3.24 hi 2018-06-22 22:58:59 162
When I'm actually looking for the same result, but with every single record with a limit to avoid the large loading times.
The naive way to get the last hundred values is:
select c.*,
(select sum(c2.value)
from (select c3.*
from c3
where c3.creation <= c.creation
order by c3.creation desc
limit 100
) c2
) as sum_last100
from crawler c;
Because the correlation clause is two levels deep, MySQL does not accept this.
In MySQL 8+, this is much easier:
select c.*,
sum(value) over (order by creation rows between 99 preceding and current row) as sum_last100
from crawler c;
At this point, I might suggest that you switch to either MySQL 8 or to some other database (such as Postgres). Getting your desired query to work efficiently on a million rows may not be worth the effort in older versions of MySQL.
Related
I have a table with 70 million rows which currently is structured like:
id currentDatetime expiration delta strike bid ask
1 2022-01-03 09:30:00 2022-01-03 0.05 100 0.85 0.95
2 2022-01-03 11:30:00 2022-01-03 0.05 100 0.65 0.75
3 2022-01-03 13:30:00 2022-01-03 0.09 100 1.00 1.20
For the same expiration and strike, I want to get the delta/bid/ask columns at time A and time B (eg. 09:30:00 and 13:45:00, for example).
I have this query which hopefully does what I'm looking for, but part of the issue is that it's slow to run across 70m rows. I've indexed the columns involved already. Any suggestions?
SELECT a.id,
a.currentDatetime,
a.strike,
a.expirationDate,
a.callDelta,a.callBid,a.callAsk,
b.currentDatetime as dateTimeB,
b.callDelta as deltaB,
b.callBid as bidB,
b.callAsk as askB
FROM ticker a
INNER JOIN ticker b
ON (DATE(a.currentDatetime) = DATE(b.currentDatetime)
AND a.strike = b.strike
AND a.expirationDate = b.expirationDate)
WHERE (TIME(a.currentDatetime) = :fromTime
AND TIME(b.currentDatetime) = :toTime)
Add a multicolumn index (currentDatetime, strke, expirationDate) in table.
create index idx_ticket on ticker(currentDatetime, strke, expirationDate);
I have a table that has many rows in it, with rows occurring at the rate of 400-500 per minute (I know this isn't THAT many), but I need to do some sort of 'trend' analysis on the data that has been collected over the last 1 minute.
Instead of pulling all records that have been entered and then processing each of those, I would really like to be able to select, say, 10 records - which occur at a -somewhat- even distribution through the timeframe specified.
ID DEVICE_ID LA LO CREATED
-------------------------------------------------------------------
1 1 23.4 948.7 2018-12-13 00:00:01
2 2 22.4 948.2 2018-12-13 00:01:01
3 2 28.4 948.3 2018-12-13 00:02:22
4 1 26.4 948.6 2018-12-13 00:02:33
5 1 21.4 948.1 2018-12-13 00:02:42
6 1 22.4 948.3 2018-12-13 00:03:02
7 1 28.4 948.0 2018-12-13 00:03:11
8 2 23.4 948.8 2018-12-13 00:03:12
...
492 2 21.4 948.4 2018-12-13 00:03:25
493 1 22.4 948.2 2018-12-13 00:04:01
494 1 24.4 948.7 2018-12-13 00:04:02
495 2 27.4 948.1 2018-12-13 00:05:04
Considering this data set, instead of pulling all those rows, I would like to maybe pull a row from the set every 50 records (10 rows for roughly ~500 rows returned).
This does not need to be exact, I just need a sample in which to perform some sort of linear regression on.
Is this even possible? I can do it in my application code if need be, but I wanted to see if there was a function or something in MySQL that would handle this.
Edit
Here is the query I have tried, which works for now - but I would like the results more evenly distributed, not by RAND().
SELECT * FROM (
SELECT * FROM (
SELECT t.*, DATE_SUB(NOW(), INTERVAL 30 HOUR) as offsetdate
from tracking t
HAVING created > offsetdate) as parp
ORDER BY RAND()
LIMIT 10) as mastr
ORDER BY id ASC;
Do not order by RAND() as the rand calculated for every row, then reordered and only then you are selecting a few records.
You can try something like this:
SELECT
*
FROM
(
SELECT
tracking.*
, #rownum := #rownum + 1 AS rownum
FROM
tracking
, (SELECT #rownum := 0) AS dummy
WHERE
created > DATE_SUB(NOW(), INTERVAL 30 HOUR)
) AS s
WHERE
(rownum % 10) = 0
Index on created is "the must".
Also, you might consider to use something like 'AND (UNIX_TIMESTAMP(created) % 60 = 0)' which is slightly different from what you wanted, however might be OK (depends on your insert distribution)
I have a requirement where I need to check the sent date of various email threads against date column in another table and determine version number if email sent date occurs after the date specified in the table. I used datediff() for this but I get negative values. I can't use ABS() because it doesn't make sense here.Is there any way I can get the desired result? A sample of the query I am using is
select distinct
x.OPTENTION_DATE,
email.ID,
x.NAME,
email.SEND_DATE,
DATEDIFF(email.SEND_DATE, x.OPTENTION_DATE) AS DT_DIFF
from
classification_version x,
classification_element y,
email
where
x.id_project = y.ID_PROJECT and
x.ID_PROJECT = email.ID_PROJECT_AMENDMENT and
y.ID_PROJECT = 11 and
y.ID_COMPANY=1
order by
email.SEND_DATE ASC
The output for this query is
OPTENTION_DATE ID NAME SEND_DATE DT_DIFF
2014-11-05 3 Version 2 2014-01-13 14:09:34 -296
2015-02-18 3 Version 3 2014-01-13 14:09:34 -401
2014-01-09 3 Version 1 2014-01-13 14:09:34 4
2014-11-05 62 Version 2 2015-01-12 18:46:10 68
2015-02-18 62 Version 3 2015-01-12 18:46:10 -37
2014-01-09 62 Version 1 2015-01-12 18:46:10 368
2014-11-05 61 Version 2 2015-01-19 20:50:09 75
2015-02-18 61 Version 3 2015-01-19 20:50:09 -30
2014-01-09 61 Version 1 2015-01-19 20:50:09 375
My desired output is for email id 3 version 1 should be selected,for email id 62 version 2 should be selected. If I use ABS() and then MIN() version 3 will be selected which is wrong because the send date is before the actual date.Can anyone suggest how to solve this?
You can use email.SEND_DATE > x.OPTENTION_DATE condition in WHERE clause and use MIN(DATEDIFF(email.SEND_DATE, x.OPTENTION_DATE)) and GROUP BY ID. It will find minimum positive value for each id.
Try below query :
select x.OPTENTION_DATE,
email.ID,
x.NAME,
email.SEND_DATE,
MIN(DATEDIFF(email.SEND_DATE, x.OPTENTION_DATE)) AS DT_DIFF
from
classification_version x,
classification_element y,
email
where
x.id_project = y.ID_PROJECT
and x.ID_PROJECT = email.ID_PROJECT_AMENDMENT
and y.ID_PROJECT = 11
and y.ID_COMPANY=1
and email.SEND_DATE > x.OPTENTION_DATE
group by
ID
order by
email.SEND_DATE ASC
I want a list of employees who
have worked on the activity that has the highest Total Pay value.
don't use code such as …where actid = 151…ect
• Note: Total Pay worked for an activity is the sum of the (Total Hours Worked * matching
Hourly Rate)
(e.g. Total Pay for Activity 151 is 10.5 hrs # $50.75 + 11.5 hrs # $25 + 3hrs # $33,)
You must use a subquery in your
solution.
ACTID HRSWORKED HOURLYRATE Total Pay
163 10 45.5 455
163 8 45.5 364
163 6 45.5 273
151 5 50.75 253.75
151 5.5 50.75 279.125
155 10 30 300
155 10 30 300
165 20 25 500
155 10 30 300
155 8 27 216
151 11.5 25 287.5
151 1 33 33
151 1 33 33
151 1 33 33
You time and effort much appreciated. Thanks !!
Without knowledge of the schema, I can only provide a possible sketch (you'll have to compute total pay and provide all necessary JOINs and predicates):
SELECT DISTINCT(employee id) -- reconfigure if more then just employee id
FROM <table(s)>
[WHERE...]
{ WHERE | AND } total pay = (SELECT MAX(total pay)
FROM <table(s)>
[WHERE...]);
I used DISTINCT because it's possible to have more than one activity with the same MAX value and overlapping employees. If you're including ACTID in the output, then you won't need DISTINCT because the same employee shouldn't be on a project twice (unless they are tracked by roles on a project in which case a single employee might have multiple roles - it all depends on the data set).
I have the following table:
NAMES:
Fname | stime | etime | Ver | Rslt
x 4 5 1.01 Pass
x 8 10 1.01 Fail
x 6 7 1.02 Pass
y 4 8 1.01 Fail
y 9 10 1.01 Fail
y 11 12 1.01 Pass
y 10 14 1.02 Fail
m 1 2 1.01 Fail
m 4 6 1.01 Fail
The result I am trying to output is:
x 8 10 1.01 Fail
x 6 7 1.02 Pass
y 11 12 1.01 Pass
y 10 14 1.02 Fail
m 4 6 1.01 Fail
What the result means:
Fnames are an example of tests that are run. Each test was run on different platforms of software (The version numbers) Some tests were run on the same platform twice: It passed the first time and failed the second time or vice versa. My required output is basically the latest result of each case for each version. So basically the results above are all unique by their combination of Fname and Ver(sion), and they are selected by the latest etime from the unique group.
The query I have so far is:
select Fname,stime,max(etime),ver,Rslt from NAMES group by Fname,Rslt;
This however, does not give me the required output.
The output I get is (wrong):
x 4 10 1.01 Fail
x 6 7 1.02 Pass
y 4 12 1.01 Pass
y 10 14 1.02 Fail
m 1 6 1.01 Fail
Basically it takes the max time, but it does not really print the correct data out, it prints the max time, but it prints the initial time of the whole unique group of data, instead of the initial time of that particular test (record).
I have tried so long to fix this, but I seem to be going no where. I have a feeling there is a join somewhere in here, but I tried that too, no luck.
Any help is appreciated,
Thank you.
Use a subquery to get the max ETime by FName and Ver, then join your main table to it:
SELECT
NAMES.FName,
NAMES.STime,
NAMES.ETime,
NAMES.Ver,
NAMES.Rslt
FROM NAMES
INNER JOIN (
SELECT FName, Ver, MAX(ETime) AS MaxETime
FROM NAMES
GROUP BY FName, Ver
) T ON NAMES.FName = T.FName AND NAMES.Ver = T.Ver AND NAMES.ETime = T.MaxETime
You could first find which is the latests=max(etime) for each case for each version ?
select Fname,Ver,max(etime) from NAMES group by Fname,Ver;
From there you would display the whole thing via joining it again?
select *
from
NAMES
inner join
(select Fname,Ver,max(etime) as etime from NAMES group by Fname,Ver ) sub1
using (Fname,Ver,etime)
order by fname,Ver;