SQL: get only sampled data from large dataset - mysql

So I get a large amount of data from server using this SQL:
SELECT value,DATE_FORMAT(`time`,'%Y-%m-%dT%H:%i:%sZ') AS `time`
FROM history WHERE :id=reference AND
(time BETWEEN :start AND :end) ORDER BY time LIMIT 100 ";
Limit is set to fixed 100 entries.
But in given time range there could be 5 000 entries.
Here's my goal: I want to sample these entries by time between each entry.
So for example this interval between each entry will be 60 seconds (let's say it is parameter), then I will receive 100 entries (from 5000), but there will be always one minute difference between each one of them.
E.g.
value1,14:40:40
value2,14:41:40
...
value100,16:20:40
Is this doable via SQL? Or do I have to parse through this large data with PHP?
If it is not doable just with SQL, is it possible to get this 100 entries equally spread across this 5000 entries? (so not by time, but I'd get fixed entry id1,id50,id100,id150,...,id5000). Again just with sql.
Thanks!

Just as Kristof sais in his answer: Order the rows and take each nth row by applying a row number. This is how it is done in MySQL:
select
rows.value,
date_format(rows.`time`,'%Y-%m-%dT%H:%i:%sZ') AS `time`
from
(
select
#row_number := #row_number + 1 as row_number,
history.*
from history
cross join (select #row_number := 0) as t
where reference = :id and `time` between :start and :end
order by `time`
) as rows
cross join
(
select count(*) as cnt
from history
where reference = :id and `time` between :start and :end
) as rowcount
where mod(rows.row_number - 1, ceil(rowcount.cnt / 100)) = 0;
And this is how the same would look in another dbms, Oracle for instance, using analytic functions:
select
rows.value,
to_char(rows."time",'yyyy-mm-dd hh24:mi:ss') AS "time"
from
(
select
row_number() over (order by "time") as rown,
count(*) over () as cnt,
history.*
from history
where reference = :id and "time" between :start and :end
) rows
where mod(rows.rown - 1, ceil(rows.cnt / 100)) = 0;
These queries result in 100 records or a little less, depending on how many rows the table contains exactly. You can also use TRUNCATE(rowcount.cnt,0) instead of CEIL(rowcount.cnt) in MySQL, thus getting hundred rows or a little more and additionally apply LIMIT 100 to get exactly 100 rows (provided there are at least 100 rows in the table).

What you could is select the rowNumber and calculate the modulo of that rowNumber.
Not sure how it would be done in mysql but t-sql goes like this :
SELECT ROW_NUMBER() over( order by idField) % 50 as selector, *
FROM history
WHERE selector = 1
This will count the rows and reset the counter every 50th record, giving you a spread out result.

Related

finding a percentile value in mysql 5.7? [duplicate]

I have a table which contains thousands of rows and I would like to calculate the 90th percentile for one of the fields, called 'round'.
For example, select the value of round which is at the 90th percentile.
I don't see a straightforward way to do this in MySQL.
Can somebody provide some suggestions as to how I may start this sort of calculation?
Thank you!
First, lets assume that you have a table with a value column. You want to get the row with 95th percentile value. In other words, you are looking for a value that is bigger than 95 percent of all values.
Here is a simple answer:
SELECT * FROM
(SELECT t.*, #row_num :=#row_num + 1 AS row_num FROM YOUR_TABLE t,
(SELECT #row_num:=0) counter ORDER BY YOUR_VALUE_COLUMN)
temp WHERE temp.row_num = ROUND (.95* #row_num);
Compare solutions:
Number of seconds it took on my server to get 99 percentile of 1.3 million rows:
LIMIT x,y with index and no where: 0.01 seconds
LIMIT x,y with no where: 0.7 seconds
LIMIT x,y with where: 2.3 seconds
Full scan with no where: 1.6 seconds
Full scan with where: 5.7 seconds
Fastest solution for large tables using LIMIT x,y ():
Get count of values: SELECT COUNT(*) AS cnt FROM t
Get nth value, where n = (cnt - 1) * (1 - 0.95) : SELECT k FROM t ORDER BY k DESC LIMIT n,1
This solution requires two queries, because mysql does not support specifying variables in LIMIT clause, except for stored procedures (can be optimized with stored procedure). Usually additional query overhead is very low
This solution can be further optimized if you add index to k column and do not use complex where clauses (like 0.01 second for table with 1 million rows, because sorting is not needed).
Implementation example in PHP (can calculate percentile not only of columns, but also of expressions):
function get_percentile($table, $where, $expr, $percentile) {
if ($where) $subq = "WHERE $where";
else $subq = "";
$r = query("SELECT COUNT(*) AS cnt FROM $table $subq");
$w = mysql_fetch_assoc($r);
$num = abs(round(($w['cnt'] - 1) * (100 - $percentile) / 100.0));
$q = "SELECT ($expr) AS prcres FROM $table $subq ORDER BY ($expr) DESC LIMIT $num,1";
$r = query($q);
if (!mysql_num_rows($r)) return null;
$w = mysql_fetch_assoc($r);
return $w['prcres'];
}
// Usage example
$time = get_percentile(
"state", // table
"service='Time' AND cnt>0 AND total>0", // some filter
"total/cnt", // expression to evaluate
80); // percentile
The SQL standard supports the PERCENTILE_DISC and PERCENTILE_CONT inverse distribution functions for precisely this job. Implementations are available in at least Oracle, PostgreSQL, SQL Server, Teradata. Unfortunately not in MySQL. But you can emulate PERCENTILE_DISC in MySQL 8 as follows:
SELECT DISTINCT first_value(my_column) OVER (
ORDER BY CASE WHEN p <= 0.9 THEN p END DESC /* NULLS LAST */
) x,
FROM (
SELECT
my_column,
percent_rank() OVER (ORDER BY my_column) p,
FROM my_table
) t;
This calculates the PERCENT_RANK for each row given your my_column ordering, and then finds the last row for which the percent rank is less or equal to the 0.9 percentile.
This only works on MySQL 8+, which has window function support.
I was trying to solve this for quite some time and then I found the following answer. Honestly brilliant. Also quite fast even for big tables (the table where I used it contained approx 5 mil records and needed a couple of seconds).
SELECT
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(field_name ORDER BY
field_name SEPARATOR ','), ',', 95/100 * COUNT(*) + 1), ',', -1) AS DECIMAL)
AS 95th Per
FROM table_name;
As you can imagine just replace table_name and field_name with your table's and column's names.
For further information check Roland Bouman's original post
In MySQL 8 there is the ntile window function you can use:
SELECT SomeTable.ID, SomeTable.Round
FROM SomeTable
JOIN (
SELECT SomeTable, (NTILE(100) OVER w) AS Percentile
FROM SomeTable
WINDOW w AS (ORDER BY Round)
) AS SomeTablePercentile ON SomeTable.ID = SomeTablePercentile.ID
WHERE Percentile = 90
LIMIT 1
https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html#function_ntile
http://www.artfulsoftware.com/infotree/queries.php#68
SELECT
a.film_id ,
ROUND( 100.0 * ( SELECT COUNT(*) FROM film AS b WHERE b.length <= a.length ) / total.cnt, 1 )
AS percentile
FROM film a
CROSS JOIN (
SELECT COUNT(*) AS cnt
FROM film
) AS total
ORDER BY percentile DESC;
This can be slow for very large tables
As pert Tony_Pets answer, but as I noted on a similar question: I had to change the calculation slightly, for example the 90th percentile - "90/100 * COUNT(*) + 0.5" instead of "90/100 * COUNT(*) + 1". Sometimes it was skipping two values past the percentile point in the ordered list, instead of picking the next higher value for the percentile. Maybe the way integer rounding works in mysql.
ie:
.... SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(fieldValue ORDER BY fieldValue SEPARATOR ','), ',', 90/100 * COUNT(*) + 0.5), ',', -1) as 90thPercentile ....
The most common definition of a percentile is a number where a certain percentage of scores fall below that number. You might know that you scored 67 out of 90 on a test. But that figure has no real meaning unless you know what percentile you fall into. If you know that your score is in the 95th percentile, that means you scored better than 95% of people who took the test.
This solution works also with the older MySQL 5.7.
SELECT *, #row_num as numRows, 100 - (row_num * 100/(#row_num + 1)) as percentile
FROM (
select *, #row_num := #row_num + 1 AS row_num
from (
SELECT t.subject, pt.score, p.name
FROM test t, person_test pt, person p, (
SELECT #row_num := 0
) counter
where t.id=pt.test_id
and p.id=pt.person_id
ORDER BY score desc
) temp
) temp2
-- optional: filter on a minimal percentile (uncomment below)
-- having percentile >= 80
An alternative solution that works in MySQL 8: generate a histogram of your data:
ANALYZE TABLE my_table UPDATE HISTOGRAM ON my_column WITH 100 BUCKETS;
And then just select the 95th record from information_schema.column_statistics:
SELECT v,c FROM information_schema.column_statistics, JSON_TABLE(histogram->'$.buckets',
'$[*]' COLUMNS(v VARCHAR(60) PATH '$[0]', c double PATH '$[1]')) hist
WHERE column_name='my_column' LIMIT 95,1
And voila! You will still need to decide whether you take the lower or upper limit of the percentile, or perhaps take an average - but that is a small task now. Most importantly - this is very quick, once the histogram object is built.
Credit for this solution: lefred's blog.

SQL Query to get distinct values from a table and the difference between ordered rows

I have a real time data table with time stamps for different data points
Time_stamp, UID, Parameter1, Parameter2, ....
I have 400 UIDs so each time_stamp is repeated 400 times
I want to write a query that uses this table to check if the real time data flow to the SQL database is working as expected - new timestamp every 5 minute should be available
For this what I usually do is query the DISTINCT values of time_stamp in the table and order descending - do a visual inspection and copy to excel to calculate the difference in minutes between subsequent distinct time_stamp
Any difference over 5 min means I have a problem. I am trying to figure out how I can do something similar in SQL, maybe get a table that looks like this. Tried to use LEAD and DISTINCT together but could not write the code myself, im just getting started on SQL
Time_stamp, LEAD over last timestamp
Thank you for your help
You can use lag analytical function as follows:
select t.* from
(select t.*
lag(Time_stamp) over (order by Time_stamp) as lg_ts
from your_Table t)
where timestampdiff('minute',lg_ts,Time_stamp) > 5
Or you can also use the not exists as follows:
select t.*
from your_table t
where not exists
(select 1 from your_table tt
where timestampdiff('minute',tt.Time_stamp,t.Time_stamp) <= 5)
and t.Time_stamp <> (select min(tt.Time_stamp) from your_table tt)
lead() or lag() is the right approach (depending on whether you want to see the row at the start or end of the gap).
For the time comparison, I recommend direct comparisons:
select t.*
from (select t.*
lead(Time_stamp) over (partition by uid order by Time_stamp) as next_time_stamp
from t
) t
where next_timestamp > time_stamp + interval 5 minute;
Note: exactly 5 minutes seems unlikely. You might want a fudge factor such as:
where next_timestamp > time_stamp + interval 5*60 + 10 second;
timestampdiff() counts the number of "boundaries" between two values. So, the difference in minutes between 00:00:59 and 00:01:02 is 1. And the difference between 00:00:00 and 00:00:59 is 0.
So, a difference of "5 minutes" could really be 4 minutes and 1 second or could be 5 minutes and 59 seconds.

MySQL Matching date-based First Instance of value

I have a table containing stock market data (open, hi, lo, close prices) but in a random order of date:
Date Open Hi Lo Close
12/10/2019 313.82 314.54 312.81 313.58
11/22/2019 311.09 311.24 309.85 310.96
11/25/2019 311.98 313.37 311.98 313.37
11/26/2019 313.41 314.28 313.06 314.08
11/27/2019 314.61 315.48 314.37 315.48
11/29/2019 314.86 315.13 314.06 314.31
12/2/2019 314.59 314.66 311.17 311.64
12/3/2019 308.65 309.64 307.13 309.55
I have another value in a PHP variable (say $BaseValue),and a start date and end date ($startdt and $enddt).
1) My requirement is to pick-up the value from the HI column, if it exceeds the $BaseValue on the very FIRST date in a chronological order between the given start and end dates.
For example, if the $BaseValue=314, startdt=11/22, enddt=12/2, then I want to retrieve the Date (11/26/19) as it is the earliest date on which the Hi value (314.28) exceeded the $Basevalue within the given date range. The select statement should return both the Hi value (314.28) and the Date (11/26/19).
2) Additionally, I also need to retrieve the HIGHEST value and date from the HI column during the given date duration. In the above scenario, it should return 315.48 and corresponding date 11/27.
The table is NOT in a chronological order - its randomly filled.
I am unable to get the first query at all with the use of MAX function and its various combinations. Makes me wonder if that is possible at all in SQL or not.
While the second is straightforward, I was wondering if it is more efficient and less complex to club the two queries and get the four values in one single shot.
Any ideas on how can I approach the need to fulfill this requirement please?
Thanks
You could use two subqueries for filtering, one per criteria, like:
select t.*
from mytable t
where
t.date = (
select min(t1.date)
from mytable t1
where t1.date between :datedt and :enddt and t1.hi >= :basevalue
)
or t.hi = (
select max(t1.hi)
from mytable t1
where t1.date between datedt and :enddt and t1.hi >= :basevalue
)
Another option is to union two queries with orer by and limit:
(
select t.*
from mytable
where t.date between :datedt and :enddt and t1.hi >= :basevalue
order by t.date
limit 1
)
union
(
select t.*
from mytable t
where t.date between :datedt and :enddt and t1.hi >= :basevalue
order by t.hi desc, t.date
limit 1
)
Please note that both queries do not do exactly the same thing. If there are ties for the highest hi in the period, the first query will return all ties, while the second will pick the earliest one. It's up to you to decide which solution better fits your use case.

MySQL Display 4th smallest value for each team

I am using phpMyAdmin on MySQL 5.7
The code below selects the lowest values excluding any zero values and gives me a nice table of all the teamids with the lowest times in seconds next to them for that event (zid).
SELECT teamid, MIN(time) AS 'fastest time'
FROM data
WHERE time > 0 AND zid = 217456
GROUP BY teamid
How do I adapt it to get the 4th lowest values?
I have tried countless suggestions found via searching but none work
Table Headings:
id (AI column set as Primary Index)
zid (this is an event identification number)
teamid
name
time (given in seconds)
I could add a position in team column which would make this very easy? Then I just ask MySQL to get me all the positions = to 4 ?
MySQL 8: Use Window functions.
Dense Rank
Window Function Concept & Syntax
SELECT
teamid,
time '4th_Lowest'
FROM data
WHERE time > 0 AND zid = 217456
AND (dense_rank() OVER (PARTITION BY teamid ORDER BY time ASC)) = 4;
Mysql 5.7 and Lower: We will use following variables to calculate this on the sorted data(teamid and then time)
rank - to set rank for each unique(teamid, time)
c_time - whenever there is a change between time of two consecutive rows, we will increase the rank. IF(#c_time = d.time, #rank, #rank:= #rank + 1)
c_team_id - we will check whether two consecutive rows have same or different team, if different then reset rank to 1. Check else part IF(#c_team_id = d.teamid, ...,#rank:= 1)
SELECT
t.teamid,
t.`time`
FROM(
SELECT
d.teamid, -- Represent current row team id
d.`time`, -- Represent current row time
IF(#c_team_id = d.teamid, IF(#c_time = d.`time`, #rank, #rank:= #rank + 1), #rank:= 1) as rank, -- determine rank based on above explanation.
#c_team_id:= d.teamid, -- We are setting this variable to current row team id after using it in rank column, so rank column of next row will have this row team id for comparison using #c_team_id variable.
#c_time:= d.`time`
FROM `data` AS d,
(SELECT #c_time:= 0 as tim, #c_team_id:= 0 as tm_id, #rank:= 0 as rnk) AS t
WHERE d.`time` > 0 AND d.zid = 217456
ORDER BY d.teamid, d.`time` ASC -- Use to make sure we have same team records in sequence and with ascending order of time.
) AS t
WHERE t.rank = 4
GROUP BY t.teamid;
If your version supports window-functions (since 8.0):
SELECT teamid, time 'fourth_time'
FROM data
WHERE time > 0
AND zid = 217456
AND (dense_rank() OVER (PARTITION BY teamid ORDER BY time ASC)) = 4
EDIT: dense_rank seems to fit better, it will give the fourth-best time now, ignoring multiple appearances of the best to third-best times. The earlier version used row_number, not ignoring multiple apperances. Thanks for mentioning in the comments.
Since your version does not support window-functions, you can use a subselect with a LIMIT (I assume you have a field id, that is a primary key. If your primary key is another field, just replace this. If there is more than one field in your primary key, you will need to check all of them):
SELECT d.teamid, MIN(d.time) fourth_time
FROM data d
WHERE d.time > 0
AND d.zid = 217456
AND d.time > (SELECT t.time
FROM ( SELECT DISTINCT d2.time
FROM data d2
WHERE d2.time > 0
AND d2.zid = 217456
AND d2.teamid = d.teamid
) t
ORDER BY t.time ASC
LIMIT 1
OFFSET 2)
GROUP BY d.teamid

SQL: Skip entries in an order without knowing total entry amount

The title is a bit confusing, but I'm wondering if there is a way to do a query like this:
SELECT * FROM table ORDER BY timestamp LIMIT 10
and then only take the ones after the 10th one (or none if there are less than or equal to 10 entries).
EDIT I guess another way to do this would be to order them by timestamp, descending, and then somehow limit to 0, (total-someNumber).
By specifying an OFFSET you can get the rows after a specified number. You combine this with limit.
In MySQL you achieve this with LIMIT [offset], limit.
Example - get 10 records after the oldest 10 records:
SELECT * FROM table ORDER BY timestamp LIMIT 10, 10; # Retrieve rows 11-20
Example - get 20 records after the newest 5 records:
SELECT * FROM table ORDER BY timestamp DESC LIMIT 5, 20; # Retrieve rows 6-25
If you want to get ALL rows after a certain number (eg. 10) then you pass an arbitrarily big number for the limit since it is required by the clause:
SELECT * FROM table ORDER BY timestamp LIMIT 10,18446744073709551615; # Retrieve rows 11-BIGINT
Note: 18446744073709551615 is the maximum of an unsigned BIGINT and is provided as the solution within the MySQL documentation.
See:
http://dev.mysql.com/doc/refman/5.5/en/select.html
I'd try something like this and then just add a where clause that skips the first n (n=10 in this case) rows.
i.e. using the linked example:
SELECT
*
FROM
(select #n := #n + 1 RowNumber, t.* from (select #n:=0) initvars, tbl t)
WHERE
RowNumber > 10