Is there a way to calculate entropy in sql \ mysql? - mysql

I would like to calculate the entropy of a list in mysql.
Now I run this and move to python:
select group_concat(first_name), last_name
from table
group by last name
What I am looking for would be the equivalent of
entropy(first_name)
Returning a single number for each.
Similar to the below usage for numericals:
std(age)/avg(age)
EDIT- Partially answered: Thank you to commenter #IVO GELOV for a very efficient approximation:
SELECT LOG2(COUNT(DISTINCT column)) FROM Table

Based on solution above and an approximate of the t-test we reach comparative weighted entropy. Hacky, but works like a charm:
CASE
WHEN count(*)-1 < 6 THEN (1 + LOG2(COUNT(distinct first_name)))*5.61*power(count(*)-1,-0.71)
WHEN count(*)-1 >= 6 and cnt-1 < 27 THEN (1 + LOG2(COUNT(distinct first_name)))*2.2*power(count(*)-1,-0.081)
ELSE (1 + LOG2(COUNT(distinct first_name)))*1.815*power(count(*)-1,-0.02)
END as entropy
Defined for rows with count(*) > 1

Related

DATEDIFF vs (w1.date = w2.date +1) difference? MySQL syntax

I was working on a SQL database question using MySQL. The goal is to find all IDs that satisfy today is warmer than yesterday. I'll show you my original code, which passed 2 out of 3 test cases and then a revised code which satisfies all 3.
What is the functional difference between these two? Is it a MySQL thing, leetcode thing, or something else?
Original
SELECT DISTINCT w2.id
FROM weather w1, weather w2
WHERE w2.RecordDate = w1.RecordDate +1 AND w2.temperature > w1.temperature
Revised
SELECT DISTINCT w2.id
FROM weather w1, weather w2
WHERE DATEDIFF(w2.RecordDate,w1.RecordDate) =1 AND w2.temperature > w1.temperature
The only differences is the use of DATEDIFF or use of w2.recordDate = w1.recordDate + 1.
I'd like to know, what is the difference between these two?
Edit: here's the LC problem https://leetcode.com/problems/rising-temperature/
This does not do what you want:
w2.RecordDate = w1.RecordDate + 1
Because you are using number arithmetics on date, this expression implicitly converts the dates to numbers, adds 1 to one of them, and then compares the results. Depending on the exact dates, it might work sometimes, but it is just a wrong approach. As an example, say your date is '2020-01-31', then adding 1 to it would produce integer number 20200132.
MySQL understands date arithmetics, so I would use:
w2.RecordDate = w1.RecordDate + interval 1 day

Speed up SQL SELECT with arithmetic and geometric calculations

This is a follow-up to my previous post How to improve wind data SQL query performance.
I have expanded the SQL statement to also perform the first part in the calculation of the average wind direction using circular statistics. This means that I want to calculate the average of the cosines and sines of the wind direction. In my PHP script, I will then perform the second part and calculate the inverse tangent and add 180 or 360 degrees if necessary.
The wind direction is stored in my table as voltages read from the sensor in the field 'dirvolt' so I first need to convert it to radians.
The user can look at historical wind data by stepping backwards using a pagination function, hence the use of LIMIT which values are set dynamically in my PHP script.
My SQL statement currently looks like this:
SELECT ROUND(AVG(speed),1) AS speed_mean, MAX(speed) as speed_max,
MIN(speed) AS speed_min, MAX(dt) AS last_dt,
AVG(SIN(2.04*dirvolt-0.12)) as dir_sin_mean,
AVG(COS(2.04*dirvolt-0.12)) as dir_cos_mean
FROM table
GROUP BY FLOOR(UNIX_TIMESTAMP(dt) / 300)
ORDER BY FLOOR(UNIX_TIMESTAMP(dt) / 300) DESC
LIMIT 0, 72
The query takes about 3-8 seconds to run depending on what value I use to group the data (300 in the code above).
In order for me to learn, is there anything I can do to optimize or improve the SQL statement otherwise?
SHOW CREATE TABLE table;
From that I can see if you already have INDEX(dt) (or equivalent). With that, we can modify the SELECT to be significantly faster.
But first, change the focus from 72*300 seconds worth of readings to datetime ranges, which is 6(?) hours.
Let's look at this query:
SELECT * FROM table
WHERE dt >= '...' - INTERVAL 6 HOUR
AND dt < '...';
The '...' would be the same datetime in both places. Does that run fast enough with the index?
If yes, then let's build the final query using that as a subquery:
SELECT FORMAT(AVG(speed), 1) AS speed_mean,
MAX(speed) as speed_max,
MIN(speed) AS speed_min,
MAX(dt) AS last_dt,
AVG(SIN(2.04*dirvolt-0.12)) as dir_sin_mean,
AVG(COS(2.04*dirvolt-0.12)) as dir_cos_mean
FROM
( SELECT * FROM table
WHERE dt >= '...' - INTERVAL 6 HOUR
AND dt < '...'
) AS x
GROUP BY FLOOR(UNIX_TIMESTAMP(dt) / 300)
ORDER BY FLOOR(UNIX_TIMESTAMP(dt) / 300) DESC;
Explanation: What you had could not use an index, hence had to scan the entire table (which is getting bigger and bigger). My subquery could use an index, hence was much faster. The effort for my outer query was not "too bad" since it worked with only N rows.

SQL - Calculating variable moving average over variable lenghts

FIRST: This question is NOT a duplicate. I have asked this on here already and it was closed as a duplicate. While it is similar to other threads on stackoverflow, it is actually far more complex. Please read the post before assuming it is a duplicate:
I am trying to calculate variable moving averages crossover with variable dates.
That is: I want to prompt the user for 3 values and 1 option. The input is through a web front end so I can build/edit the query based on input or have multiple queries if needed.
X = 1st moving average term (N day moving average. Any number 1-N)
Y = 2nd moving average term. (N day moving average. Any number 1-N)
Z = Amount of days back from present to search for the occurance of:
option = Over/Under: (> or <. X passing over Y, or X passing Under Y)
X day moving average passing over OR under Y day moving average
within the past Z days.
My database is structured:
tbl_daily_data
id
stock_id
date
adj_close
And:
tbl_stocks
stock_id
symbol
I have a btree index on:
daily_data(stock_id, date, adj_close)
stock_id
I am stuck on this query and having a lot of trouble writing it. If the variables were fixed it would seem trivial but because X, Y, Z are all 100% independent of each other (could look, for example for 5 day moving average within the past 100 days, or 100 day moving average within the past 5) I am having a lot of trouble coding it.
Please help! :(
Edit: I've been told some more context might be helpful?
We are creating an open stock analytic system where users can perform trend analysis. I have a database containing 3500 stocks and their price histories going back to 1970.
This query will be running every day in order to find stocks that match certain criteria
for example:
10 day moving average crossing over 20 day moving average within 5
days
20 day crossing UNDER 10 day moving average within 5 days
55 day crossing UNDER 22 day moving average within 100 days
But each user may be interested in a different analysis so I cannot just store the moving average with each row, it must be calculated.
I am not sure if I fully understand the question ... but something like this might help you get where you need to go: sqlfiddle
SET #X:=5;
SET #Y:=3;
set #Z:=25;
set #option:='under';
select * from (
SELECT stock_id,
datediff(current_date(), date) days_ago,
adj_close,
(
SELECT
AVG(adj_close) AS moving_average
FROM
tbl_daily_data T2
WHERE
(
SELECT
COUNT(*)
FROM
tbl_daily_data T3
WHERE
date BETWEEN T2.date AND T1.date
) BETWEEN 1 AND #X
) move_av_1,
(
SELECT
AVG(adj_close) AS moving_average
FROM
tbl_daily_data T2
WHERE
(
SELECT
COUNT(*)
FROM
tbl_daily_data T3
WHERE
date BETWEEN T2.date AND T1.date
) BETWEEN 1 AND #Y
) move_av_2
FROM
tbl_daily_data T1
where
datediff(current_date(), date) <= #z
) x
where
case when #option ='over' and move_av_1 > move_av_2 then 1 else 0 end +
case when #option ='under' and move_av_2 > move_av_1 then 1 else 0 end > 0
order by stock_id, days_ago
Based on answer by #Tom H here: How do I calculate a moving average using MySQL?

Retrieving objects with condition on their duration

I have a model "Competition" with attributes start_at and end_at, both of type datetime. I would like to retrieve only competitions with a duration shorter that a given amount, for example 3 days. I would expect to do this easily:
Competition.where('end_at - start_at < ?', X)
The problem: what do I use for X?
In my database I have one competition object with a duration of slightly less than one day (start_at = 2011-04-27 00:00:00, end_at = 2011-04-27 23:59:59) and one slightly less than 3 days (start_at = 2013-02-05 00:00:00, end_at 2013-02-07 23:59:59), all others are much longer.
To retrieve the shorter of the two, I expected to use X = 60*60*24 (number of seconds in 24 hours). Doesn't work. SO I tried to multiply by an increasing factor and found that multiplying X by 2.7 will not retrieve it, but 2.8 will ! So OK, I need to use this strange factor 2.8...
But this does not work for retrieving objects shorter than 3 days. Here I need to multiply by 8.7. Does anybody know what's going on here?
The functions available depend on the database you're using. For example, with SQLite, you can use:
Competition.where('julianday(end_at) - julianday(start_at) < ?', 3)
Try this:
Competition.where('DATEDIFF(end_at, start_at) < ?', x)
For more accurate
Competition.where('TIMESTAMPDIFF(SECOND, start_at, end_at) < ?', x*86400) #where 86400 is the seconds in a day and x is no. of days

Calculating the Median with Mysql

I'm having trouble with calculating the median of a list of values, not the average.
I found this article
Simple way to calculate median with MySQL
It has a reference to the following query which I don't understand properly.
SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2
If I have a time column and I want to calculate the median value, what do the x and y columns refer to?
I propose a faster way.
Get the row count:
SELECT CEIL(COUNT(*)/2) FROM data;
Then take the middle value in a sorted subquery:
SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit #middlevalue) x;
I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.
This will find an arbitrary percentile by replacing the COUNT(*)/2 with COUNT(*)*n where n is the percentile (.5 for median, .75 for 75th percentile, etc).
val is your time column, x and y are two references to the data table (you can write data AS x, data AS y).
EDIT:
To avoid computing your sums twice, you can store the intermediate results.
CREATE TEMPORARY TABLE average_user_total_time
(SELECT SUM(time) AS time_taken
FROM scores
WHERE created_at >= '2010-10-10'
and created_at <= '2010-11-11'
GROUP BY user_id);
Then you can compute median over these values which are in a named table.
EDIT: Temporary table won't work here. You could try using a regular table with "MEMORY" table type. Or just have your subquery that computes the values for the median twice in your query. Apart from this, I don't see another solution. This doesn't mean there isn't a better way, maybe somebody else will come with an idea.
First try to understand what the median is: it is the middle value in the sorted list of values.
Once you understand that, the approach is two steps:
sort the values in either order
pick the middle value (if not an odd number of values, pick the average of the two middle values)
Example:
Median of 0 1 3 7 9 10: 5 (because (7+3)/2=5)
Median of 0 1 3 7 9 10 11: 7 (because 7 is the middle value)
So, to sort dates you need a numerical value; you can get their time stamp (as seconds elapsed from epoch) and use the definition of median.
Finding median in mysql using group_concat
Query:
SELECT
IF(count%2=1,
SUBSTRING_INDEX(substring_index(data_str,",",pos),",",-1),
(SUBSTRING_INDEX(substring_index(data_str,",",pos),",",-1)
+ SUBSTRING_INDEX(substring_index(data_str,",",pos+1),",",-1))/2)
as median
FROM (SELECT group_concat(val order by val) data_str,
CEILING(count(*)/2) pos,
count(*) as count from data)temp;
Explanation:
Sorting is done using order by inside group_concat function
Position(pos) and Total number of elements (count) is identified. CEILING to identify position helps us to use substring_index function in the below steps.
Based on count, even or odd number of values is decided.
Odd values: Directly choose the element belonging to the pos using substring_index.
Even values: Find the element belonging to the pos and pos+1, then add them and divide by 2 to get the median.
Finally the median is calculated.
If you have a table R with a column named A, and you want the median of A, you can do as follows:
SELECT A FROM R R1
WHERE ( SELECT COUNT(A) FROM R R2 WHERE R2.A < R1.A ) = ( SELECT COUNT(A) FROM R R3 WHERE R3.A > R1.A )
Note: This will only work if there are no duplicated values in A. Also, null values are not allowed.
Simplest ways me and my friend have found out... ENJOY!!
SELECT count(*) INTO #c from station;
select ROUND((#c+1)/2) into #final;
SELECT round(lat_n,4) from station a where #final-1=(select count(lat_n) from station b where b.lat_n > a.lat_n);
Here is a solution that is easy to understand. Just replace Your_Column and Your_Table as per your requirement.
SET #r = 0;
SELECT AVG(Your_Column)
FROM (SELECT (#r := #r + 1) AS r, Your_Column FROM Your_Table ORDER BY Your_Column) Temp
WHERE
r = (SELECT CEIL(COUNT(*) / 2) FROM Your_Table) OR
r = (SELECT FLOOR((COUNT(*) / 2) + 1) FROM Your_Table)
Originally adopted from this thread.