correlation query in mysql

correlation query in mysql - mysql

I have also seen this link which calculates the correlation between two columns. IMy question is different from that because my query is more complex, since it's not between column. I want to find the correlation between two different conditions in query.
I have a table which has data of search query history of website in it. I want to calculate the correlation of the search_no in different days. To calculate the number of search queries I have implemented the following:
select to_date(time), query, platform, count(query) as search_no
from search
where `_month` = 2 and time between '2021-02-05 00:00:00' and '2021-02-05 23:59:59' and platform = 'application'
group by to_date(time), query, platform
order by search_no desc limit 1000
It works perfect. It calculates the number of searches as search_no for 2021-02-05. What I want to find is the correlation between two different dates like 2021-02-05 and 2021-01-29.
The correlation formula is as follows:
PS: x is data of the first day (2021-02-05) and y is the data of the second day (2021-01-29).
What I have tried
select (sum((x.search_no - avg(x.search_no)) * (y.search_no - avg(y.search_no))) / ((count(x.search_no) - 1) * (stddev_samp(x.search_no) * stddev_samp(y.search_no))
from (
(
select to_date(time), query, platform, count(query) as search_no
from search
where `_month` = 2 and time between '2021-02-05 00:00:00' and '2021-02-05 23:59:59' and platform = 'application'
group by to_date(time), query, platform
order by search_no desc limit 1000
) as x,
(
select to_date(time), query, platform, count(query) as search_no
from search
where `_month` = 1 and time between '2021-01-29 00:00:00' and '2021-01-29 23:59:59' and platform = 'application'
group by to_date(time), query, platform
order by search_no desc limit 1000
) as y
)
I don't know how can I implement it.

If I understand correctly, you want the correlation of the summaries from two different days. That would be starting with this data:
select query,
sum(date(time) = '2021-02-05') as x,
sum(date(time) = '2021-02-06') as y,
count(*) as cnt
from search
where `_month` = 2 and
time >= '2021-02-05' and
time < '2021-02-07' and
platform = 'application'
group by query;
You can then plug this directly into your formula:
with dataset as (
select query,
sum(date(time) = '2021-02-05') as x,
sum(date(time) = '2021-02-06') as y,
count(*) as cnt
from search
where `_month` = 2 and
time >= '2021-02-05' and
time < '2021-02-07' and
platform = 'application'
group by query
)
select (sum( (x - avg_x) * (y - avg_y) ) /
sqrt(nullif( sum(power(x - avg_x, 2) * power(y - avg_y, 2)), 0))
) as pearson_correlation
from (select d.*,
avg(x) over () as avg_x,
avg(y) over () as avg_y
from dataset d
) d;
Obviously, you need to adjust the date range in the where clause for whatever day you want. I see no reason to use limit -- that will just through the population of queries off.

Related

Mysql improve sampling query speed

I have a table with 3,000,000 records.I tried to randomly extract 300,000 records using the following method,but it takes about 7 minutes.
SELECT * FROM mytable WHERE `class`='faq' ORDER BY RAND() LIMIT 300000
I want to improve the speed of random extraction, what should I do?
Mysql version is 5.6.

The cost is most likely due to sorting all the matching data. You don't specify how many rows match the condition, so this sort is likely to be some fraction of 3,000,000 rows.
If you can deal with approximately 300,000, you can use sampling logic in the WHERE clause:
SELECT t.*
FROM mytable t CROSS JOIN
(SELECT COUNT(*) as cnt
FROM t
WHERE class = 'faq'
) x
WHERE t.class = 'faq' AND
rand() < (300000 / cnt);
To be more precise, you can take a slightly larger random sample and then use order by/limit:
SELECT t.*
FROM mytable t CROSS JOIN
(SELECT COUNT(*) as cnt
FROM t
WHERE class = 'faq'
) x
WHERE t.class = 'faq' AND
rand() < (300000 / cnt) * 1.1
ORDER BY rand()
LIMIT 300000;

SQL query with a major NOT IN not working

Does anyone know what's wrong with this query?
This works perfectly on its own:
SELECT * FROM
(SELECT * FROM data WHERE site = '".$id."'
AND disabled = '0'
AND carvotes NOT LIKE '0'
AND (time > ( now( ) - INTERVAL 14 DAY ))
GROUP BY car ORDER BY carvotes DESC LIMIT 0 , 10)
X order by time DESC
So does this:
SELECT * FROM data WHERE site = '".$id."' AND disabled = '0' GROUP BY car DESC ORDER BY time desc LIMIT 0 , 30
But combining them like this:
SELECT * FROM data WHERE site = '".$id."' AND disabled = '0' AND car NOT IN (SELECT * FROM
(SELECT * FROM data WHERE site = '".$id."'
AND disabled = '0'
AND carvotes NOT LIKE '0'
AND (time > ( now( ) - INTERVAL 14 DAY ))
GROUP BY car ORDER BY carvotes DESC LIMIT 0 , 10)
X order by time DESC) GROUP BY car DESC ORDER BY time desc LIMIT 0 , 30
Gives errors. Any ideas?

Please try the following...
$result = mysqli_query( $con,
"SELECT *
FROM data
WHERE site = '" . $id .
"' AND disabled = '0'
AND car NOT IN ( SELECT car
FROM ( SELECT car,
carvotes
FROM data
WHERE site = '" . $id .
"' AND disabled = '0'
AND carvotes NOT LIKE '0'
AND ( time > ( NOW( ) - INTERVAL 14 DAY ) )
GROUP BY car
ORDER BY carvotes DESC
LIMIT 10 ) X
)
GROUP BY car
ORDER BY time DESC
LIMIT 30" );
The main cause of your problem is that with car NOT IN ( SELECT * FROM ( SELECT *... you are trying to compare each record's value of car with each row returned by your subquery. IN requires you to have the same number of fields on both sides of the comparison. By using SELECT * at both levels of the subquery you were ensuring that the right side of the comparison had however many fields are in data versus your single field on the left, which confused MySQL.
Since you are aiming to compare to a single field, namely car, our subquery has to select just the car field from its dataset. Since the sort order of the subquery's results has no effect upon the IN comparison, and since our innermost query will be returning just car, I have removed the outer level of the subquery.
Beyond changing the first part of the subquery to SELECT car, the only other change that I have made to the subquery is to change LIMIT 0, 10 to LIMIT 10. The former means limit to the the 10 records that are offset by 0 from the first record. This is useful if you want records 6 to 15, but redundant for 1 to 10 as LIMIT 10 has the same affect and is slightly simpler. Ditto for LIMIT 0, 30 at the end of your overall statement.
As for the main body of the statement, I have not made any attempt to specify what fields (or aggregate functions of those fields) should be returned since you have made no statement indicating what your requirements / preferences are. If you are satisfied that GROUP BY has left you with a still valid set of values, then all the good, but if not then I recommend that you rewrite your Question to be specific about that detail.
By default, MySQL sorts the data subjected to a GROUP BY into ascending order, but if an ORDER BY clause is also present then it overrides the GROUP BY's sort pattern. As such, there is no benefit to specifying DESC after either of your GROUP BY car clauses, so I have removed it where it occurs.
Interesting Sidenote : You can override a GROUP BY's sort by specifying ORDER BY NULL.
If you have any questions or comments, then please feel free to post a Comment accordingly.
Further Reading
https://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html - on optimising your ORDER BY sorting
https://dev.mysql.com/doc/refman/5.7/en/select.html - on the SELECT statement's syntax - specifically the parts to do with LIMIT.
https://www.w3schools.com/php/php_mysql_select_limit.asp - a simpler explanation of LIMIT

This is your query:
SELECT *
FROM data
WHERE site = '".$id."' AND disabled = '0' AND
car NOT IN (SELECT *
FROM (SELECT *
FROM data
WHERE site = '".$id."' AND
disabled = '0' AND
carvotes NOT LIKE '0' AND
(time > ( now( ) - INTERVAL 14 DAY ))
GROUP BY car
ORDER BY carvotes DESC
LIMIT 0 , 10
) x
ORDER BY time DESC
)
GROUP BY car DESC
ORDER BY time desc
LIMIT 0 , 30 ;
Several comments:
Do not wrap integer constants in single quotes. This can mislead people. This can mislead optimizers.
Do not use string functions on integers (such as like). Same reason.
NOT IN with subqueries is dangerous. The construct does not handle NULL values the way you expect. Use NOT EXISTS or LEFT JOIN instead.
When using subqueries, ORDER BY is almost never appropriate.
Never use SELECT * with GROUP BY. It is just wrong. Happily, MySQL 5.7 has changed its defaults to reject this anti-pattern
So, a better way to write this query is something like this:
SELECT d.car, MAX(time) as time
FROM data d LEFT JOIN
(SELECT d2.*
FROM data d2
WHERE d2.site = '".$id."' AND
d2.disabled = 0 AND
d2.carvotes NOT LIKE 0 AND
(d2.time > ( now( ) - INTERVAL 14 DAY ))
GROUP BY d2.car
ORDER BY carvotes DESC
LIMIT 0 , 10
) car10
ON d.car = car10.car
WHERE d.site = '".$id."' AND d.disabled = 0' AND
car10.car IS NOT NULL
GROUP BY car DESC
ORDER BY MAX(time) desc
LIMIT 0 , 30 ;
Alternatively, use SELECT * and remove the GROUP BY in the outer query.

SQL select aggregate values in columns

I have a table in this structure:
editor_id
rev_user
rev_year
rev_month
rev_page
edit_count
here is the sqlFiddle: http://sqlfiddle.com/#!2/8cbb1/1
I need to surface the 5 most active editors during March 2011 for example - i.e. for each rev_user - sum all of the edit_count for each rev_month and rev_year to all of the rev_pages.
Any suggestions how to do it?
UPDATE -
updated fiddle with demo data

You should be able to do it like this:
Select the total using SUM and GROUP BY, filtering by rev_year and rev_month
Order by the SUM in descending order
Limit the results to the top five items
Here is how:
SELECT * FROM (
SELECT rev_user, SUM(edit_count) AS total_edits
FROM edit_count_user_date
rev_year='2006' AND rev_month='09'
GROUP BY rev_user
) x
ORDER BY total_edits DESC
LIMIT 5
Demo on sqlfiddle.

Surely this is as straightforward as :
SELECT rev_user, SUM(edit_count) as TotalEdits
FROM edit_count_user_date
WHERE rev_month = 'March' and rev_year = '2014'
GROUP BY rev_user
ORDER BY TotalEdits DESC
LIMIT 5;
SqlFiddle here
May I also suggest using a more appropriate DATE type for the year and month storage?
Edit, re new Info
The below will return all edits for the given month for the 'highest' MonthTotal editor, and then re-group the totals by the rev_page.
SELECT e.rev_user, e.rev_page, SUM(e.edit_count) as TotalEdits
FROM edit_count_user_date e
INNER JOIN
(
SELECT rev_user, rev_year, rev_month, SUM(edit_count) AS MonthTotal
FROM edit_count_user_date
WHERE rev_month = '09' and rev_year = '2010'
GROUP BY rev_user, rev_year, rev_month
ORDER BY MonthTotal DESC
LIMIT 1
) as x
ON e.rev_user = x.rev_user AND e.rev_month = x.rev_month AND e.rev_year = x.rev_year
GROUP BY e.rev_user, e.rev_page;
SqlFiddle here - I've adjusted the data to make it more interesting.
However, if you need to do this across several months at a time, it will be more difficult given MySql's lack of partition by / analytical windowing functions.

sql calculate change and percent by year

I have an data set that simulates the rate of return for a trading account. There is an entry for each day showing the balance and the open equity. I want to calculate the yearly, or quarterly, or monthly change and percent gain or loss. I have this working for daily data, but for some reason I can't seem to get it to work for yearly data.
The code for daily data follows:
SELECT b.`Date`, b.Open_Equity, delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM tim_account_history p
CROSS JOIN
(SELECT #pequity:= NULL
FROM tim_account_history
ORDER by `Date` LIMIT 1) as a
ORDER BY `Date`) as b
ORDER by `Date` ASC
Grouping by YEAR(Date) doesn't seem to make the desired difference. I have tried everything I can think of, but it still seems to return daily rate of change even if you group by month or year, etc. I think I'm not using windowing correctly, but I can't seem to figure it out. If anyone knows of a good book about this sort of query I'd appreciate that also.
Thanks.sqlfiddle example
Using what Lolo contributed, I have added some code so the data comes from the last day of the year, instead of the first. I also just need the Open_Equity, not the sum.
I'm still not certain I understand why this works, but it does give me what I was looking for. Using another select statement as a from seems to be the key here; I don't think I would have come up with this without Lolo's help. Thank you.
SELECT b.`yyyy`, b.Open_Equity,
concat('$',round(delta, 2)) as delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM (SELECT (EXTRACT(YEAR FROM `Date`)) as `yyyy`,
(SUBSTRING_INDEX(GROUP_CONCAT(CAST(`Open_Equity` AS CHAR) ORDER BY `Date` DESC), ',', 1 )) AS `Open_Equity`
FROM tim_account_history GROUP BY `yyyy` ORDER BY `yyyy` DESC) p
CROSS JOIN
(SELECT #pequity:= NULL) as a
ORDER BY `yyyy` ) as b
ORDER by `yyyy` ASC

Try this:
SELECT b.`Date`, b.Open_Equity, delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM (SELECT YEAR(`Date`) `Date`, SUM(Open_Equity) Open_Equity FROM tim_account_history GROUP BY YEAR(`Date`)) p
CROSS JOIN
(SELECT #pequity:= NULL) as a
ORDER BY `Date` ) as b
ORDER by `Date` ASC

postgresql to mssql query conversion

can any one help me to convert the following query into mssql which is working on postgresql now
query is to take the updated datetime of the report in the asc order of the date
select
count(*) as count,
TO_CHAR(RH.updated_datetime,'dd-mm-YYYY') as date,
SUM(
extract (
epoch from (
RH.updated_datetime - PRI.procedure_performed_datetime
)
)
)/count(*) as average_reporting_tat
from
report R,
report_history RH,
study S,
procedure_runtime_information PRI,
priorities PP,
patient P,
procedure PR
where
RH.report_fk=R.pk and RH.pk IN (
select pk from (
select * from report_history where report_fk=r.pk order by revision desc limit 1
) as result
where old_status_fk IN (21, 27)
) AND R.study_fk = S.pk
AND S.procedure_runtime_fk = PRI.pk
AND PRI.procedure_fk = PR.pk
AND S.priority_fk = PP.pk
AND PRI.patient_fk = P.pk
AND RH.updated_datetime >= '2013-05-01'
AND RH.updated_datetime <= '2013-05-12'
group by date

If I read your query properly, your problem is that you need to list everything in the group by clause that is in your column list which is not part of an aggregate. So your group by needs to be:
GROUP BY RH.updated_datetime
If this doesn't fix it, please post the error message you are getting.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

correlation query in mysql - mysql

Related

Mysql improve sampling query speed

SQL query with a major NOT IN not working

SQL select aggregate values in columns

sql calculate change and percent by year

postgresql to mssql query conversion

Categories

Resources