I'm attempting to create a reddit style score degradation system for entries on a system. I've got a MySQL view setup to calculate the total "Score" (sum of all up/down votes). I'm having trouble creating a simple, but effective system for moving entries down the page (so that newer entries end up at the top, but a high score can move entries to the top that would otherwise have aged off)...
Here's the closest bit of SQL I've been able to create thus far:
(SUM(v.Score) - (TIMESTAMPDIFF(MINUTE, t.Genesis, NOW()) *
IF(TIMESTAMPDIFF(MINUTE, t.Genesis, NOW()) > 1440,
0.1, 0.003))
) as "Weight",
v.Score is a 1 or a -1 dependent on user votes. t.Genesis is the timestamp on the entry itself.
Any help or suggestions would be appreciated.
One solution could be to use a sort of exponential decay for the relevance of time as a ranking parameter. For example:
SELECT
article, ranking
FROM (
SELECT
article,
(upvotes + downvotes) AS Total,
(upvotes - downvotes) AS Score,
(EXP(-(Published - Genesis) * Constant / 86400) * (Score / Total)) AS Ranking
FROM Table)
ORDER BY ranking DESC
Where Published is the time of publishing, Genesis is some really early date, and Constant is a scaling factor, to determine how late the weight should drop to zero at:
For example: if you want to give all posts a very small score advantage after 7 days from now (say 0.1) then -ln(0.1) / 7 is your Constant.
Score / Total for the average rating rather than the absolute value and 86400 for a one day in seconds (assuming that's how you're measuring your time).
Once again, apologies for my lack of knowledge on SQL functions, I know that EXP is definitely possible, only the time difference function can be adjusted in order to get the time difference in seconds.
You can implement the same ranking algorithm than Hacker News :
Implementing the Hacker News ranking algorithm in SQL
#OMG Ponies solution:
SELECT x.*
FROM POSTS x
JOIN (SELECT p.postid,
SUM(v.vote) AS points
FROM POSTS p
JOIN VOTES v ON v.postid = p.postid
GROUP BY p.postid) y ON y.postid = x.postid
ORDER BY (y.points - 1)/POW(((UNIX_TIMESTAMP(NOW()) - UNIX_TIMESTAMP(x.timestamp))/3600)+2, 1.5) DESC
LIMIT n
x.timestamp is your t.Genesis, v.vote is your v.Score
Related
I have the following two tables:
movie_sales (provided daily)
movie_id
date
revenue
movie_rank (provided every few days or weeks)
movie_id
date
rank
The tricky thing is that every day I have data for sales, but only data for ranks once every few days. Here is an example of sample data:
`movie_sales`
- titanic (ID), 2014-06-01 (date), 4.99 (revenue)
- titanic (ID), 2014-06-02 (date), 5.99 (revenue)
`movie_rank`
- titanic (ID), 2014-05-14 (date), 905 (rank)
- titanic (ID), 2014-07-01 (date), 927 (rank)
And, because the movie_rate.date of 2014-05-14 is closer to the two sales dates, the output should be:
id date revenue closest_rank
titanic 2014-06-01 4.99 905
titanic 2014-06-02 5.99 905
The following query works to get the results by getting the min date difference in the sub-select:
SELECT
id,
date,
revenue,
(SELECT rank from movie_rank where id=s.id ORDER BY ABS(DATEDIFF(date, s.date)) ASC LIMIT 1)
FROM
movie_sales s
But I'm afraid that this would have terrible performance as it will literally be doing millions of subselects...on millions of rows. What would be a better way to do this, or is there really no proper way to do this since an index can not be properly done with a DATEDIFF ?
Unfortunately, you are right. The movie rank table must be searched for each movie sale and of all matching movie rows the closest be picked.
With an index on movie_rank(id) the DBMS finds the movie rows quickly, but an index on movie_rank(id, date) would be better, because the date could be read from the index and only the one best match would be read from the table.
But you also say that there are new ranks every few dates. If it is guaranteed to find a rank in a certain range, e.g. for each date there will be at least one rank in the twenty days before and at least one rank in the twenty days after, you can limit the search accordingly. (The index on movie_rank(id, date) would be essential for this, though.)
SELECT
id,
date,
revenue,
(
select r.rank
from movie_rank r
where r.id = s.id
and r.date between s.date - interval 20 days
and s.date + interval 20 days
order by abs(datediff(date, s.date)) asc
limit 1
)
FROM movie_sales s;
This is difficult to get quick with SQL. In a programming language I would choose this algorithm:
Sort the two tables by date and point to the first rows.
Move the rank pointer forward until we match the sales date or are beyond it. (If we aren't there already.)
Compare the sales date with the rank date we are pointing at and with the rank date of the previous row. Take the closer one.
Move the sales pointer one row forward.
Go to 2.
With this algorithm we would already be in about the position we want to be. Let's see, if we can do the same with SQL. Iterations are done with recursive queries in SQL. These are available in MySQL as of version 8.0.
We start with sorting the rows, i.e. giving them numbers. Then we iterate through both data sets.
with recursive
sales as
(
select *, row_number() over (partition by movie_id order by date) as rn
from movie_sales
),
ranks as
(
select *, row_number() over (partition by movie_id order by date) as rn
from movie_rank
),
cte (movie_id, revenue, srn, rrn, sdate, rdate, rrank, closest_rank) as
(
select
movie_id, s.revenue, s.rn, r.rn, s.date, r.date, r.ranking,
case when s.date <= r.date then r.ranking end
from (select * from sales where rn = 1) s
join (select * from ranks where rn = 1) r using (movie_id)
union all
select
cte.movie_id,
cte.revenue,
coalesce(s.rn, cte.srn),
coalesce(r.rn, cte.rrn),
coalesce(s.date, cte.sdate),
coalesce(r.date, cte.rdate),
coalesce(r.ranking, cte.rrank),
case when coalesce(r.date, cte.rdate) >= coalesce(s.date, cte.sdate) then
case when abs(datediff(coalesce(r.date, cte.rdate), coalesce(s.date, cte.sdate))) <
abs(datediff(cte.rdate, coalesce(s.date, cte.sdate)))
then coalesce(r.ranking, cte.rrank)
else cte.rrank
end
end
from cte
left join sales s on s.movie_id = cte.movie_id and s.rn = cte.srn + 1 and cte.closest_rank is not null
left join ranks r on r.movie_id = cte.movie_id and r.rn = cte.rrn + 1 and cte.rdate < cte.sdate
where s.movie_id is not null or r.movie_id is not null
-- where cte.closest_rank is null
)
select
movie_id,
sdate,
revenue,
closest_rank
from cte
where closest_rank is not null;
(BTW: I named the column ranking, because rank is a reserved word in SQL.)
Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=e994cb56798efabc8f7249fd8320e1cf
This is probably still slow. The reason for this is: there are no pointers to a row in SQL. If we want to go from row #1 to row #2, we must search that row, while in a programming language we would really just move the pointer one step forward. If the tables had an ID, we could build a chain (next_row_id) instead of using row numbers. That could speed this process up. But well, I guess you already notice: this is not an algorithm made for SQL.
Another approach... Avoid the problem by cleansing the data.
Make sure the rank is available for every day. When a new date comes in, find the previous rank, then fill in all the rows for the intervening days.
(This will take some initial effort to 'fix' all the previous missing dates. After that, it is a small effort when a new list of ranks comes in.)
The "report" would be a simple JOIN on the date. You would probably need a 2-column INDEX(movie_id, date) or something like that.
Ultimate solution would be not to calculate all the ranks every time, but store them (in a new column, or even in a new table if you don't want to change existing tables).
Each time you update you could look for sales data without rank and calculate only for those.
With above approach you get rank always from last available rank BEFORE sales data (e.g. if you've data 14 days before and 1 days after, still the one before would be used)
If you strictly need to use ranking closest in time, then you need to run UPDATE also for newly arrived ranking info. I believe it would still be more efficient in the long run.
The slope of line on X-axis is days and Y-axis is balance.
I need to find the steepest slope.
I am thinking to try
SELECT (MAX(balance)-MIN(balance)) / DATEDIFF(MAX(date),MIN(date)) AS time
FROM account
GROUP BY account_id
Does this work?
Anyone having simple ways to solve it?
Thanks in advance :)
I don't think your answer will work.
It will get the max date, max balance, and min date, and min balance for a given account. Your max date will be the most recent date, but max balance may be from months ago. So your growth over time gets messed up. Your balance may be $1 on day 1, $1M on day 2, and $2 on day 1000. Your math will show $1M - $1 over 1000 days. I think you want it to show $2 - $1 over 1000 days.
I prefer the following. You may have to check it for syntax but the logic is there. Basically you partition over rows to order all the balances for a given account in order of the date of the balance. The most recent balance for each account is ranked 1. Your WHERE clause then makes your table only contain the most recent balances for each account. You then join that to a table that does the same thing, except its all the old balances. So now you have the oldest and newest balance for each account.
Then you can select to get the newest less the oldest balance, and divide that by the number of days between those balance. The top is the growth, the denominator is the time. The largest factorial that pops out will be your answer - the growth over time.
SELECT
mostRecentBalances.balance as mostRecentbalance,
mostRecentBalances.date as recentDate,
oldestBalances.balance as oldestBalance,
oldestBalances.date as OldestDate,
(mostRecentBalances.balance - oldestBalances.balance) / DATEDIFF(MAX(mostRecentBalances.date),MIN(oldestBalances.date)) as growthFactor
FROM
(SELECT
balance,
(ROW_NUMBER() over (PARTITION BY balance GROUP BY account_id ORDER BY date DESC) as RowNumber
WHERE
RowNumber = 1 ) mostRecentBalances
INNER JOIN
(SELECT
balance,
(ROW_NUMBER() over (PARTITION BY balance GROUP BY account_id ORDER BY date ASC) as RowNumber
WHERE
RowNumber = 1 ) OldestBalances
on mostRecentBalances.account_id = oldestBalances.account_id
ORDER BY
growthFactor DESC
I have a SQL question. First of all I'd like to know is it even possible with just SQL, and if not does anyone know a good workaround.
We are building a site, where users can vote for videos.
The users can vote by SMS or directly on site after Facebook authentication.
We have to make a top list of all videos, and calculate the "position" on the list for each video.
So far, we have done that with a simple subquery, something like this:
SELECT v.video_id AS id,
(SELECT (COUNT(*)+1) FROM videos AS v2
WHERE (v2.SMS_votes + v2.facebook_votes) > (v.SMS_votes + v.facebook_votes)) AS total_position
FROM videos AS v
SMS_votes and facebook_votes are aggregated fields. There are separate tables for each kind of votes, with records for each vote, including the time the vote has been set.
This works fine, the positions are calculated... if 2 or more videos have the same number of votes, they "share" the position.
Unfortunately there can be no position sharing, and we have to resolve it by the following rules:
if 2 videos have the same number of votes, the one with more SMS votes has the advantage
if they also have the same number of SMS votes, the one which has more SMS votes in the last hour has the advantage
if they also have the same number of SMS votes in the last hour, they are compared by the hour before, and recursively like that, until there is a difference between the two
Is it possible to do this kind of recursive ordering only in SQL, or do we have to resolve this manually in code? All ideas are welcomed. Just to note, performance is important here, because the top list is used all over the site.
I don't think it's feasible to perform this kind of ordering with a recusive calculation (which is potentially unbounded), but if you're willing to limit the amount of time you look back, there are ways it could be done.
Here's one possibility.
SELECT video_id,
SMS_votes + facebook_votes AS total_votes,
SMS_votes,
COUNT(CASE WHEN time > NOW() - INTERVAL 1 HOUR THEN 1 END) AS h1,
COUNT(CASE WHEN time > NOW() - INTERVAL 2 HOUR THEN 1 END) AS h2,
COUNT(CASE WHEN time > NOW() - INTERVAL 3 HOUR THEN 1 END) AS h3
FROM videos
JOIN SMS_votes USING(video_id)
GROUP BY video_id
ORDER BY total_votes DESC, SMS_votes DESC, h1 DESC, h2 DESC, h3 DESC;
This assumes you have a table called SMS_votes tracking each vote, with a video_id field and a time field.
For each video, it calculates the total votes, the SMS votes, the SMS votes in the past hour, the past two hours, and the past three hours. It then does an ORDER BY on all those values to get the correct position.
It's fairly easy to extend this to include a wider range of hours, but you might also want to consider using an increasing time range as you go back in time. For example, you first look at votes in the past hour, then the past day, then the past week, etc. I suspect that would lower your chance of videos having the same votes without having to add as many extra calculations.
SQL Fiddle example
This question already has answers here:
How to calculated multiple moving average in MySQL
(3 answers)
Closed 9 years ago.
I am trying to calculate moving averages crossover with variable dates.
My database is structured:
id
stock_id
date
closing_price
And:
stock_id
symbol
For example, I'd like to find out if the average price going back X days ever gets greater than the average price going back Y days within the past Z days. Each of those time periods is variable. This needs to be run for every stock in the database (about 3000 stocks with prices going back 100 years).
I'm a bit stuck on this, what I currently have is a mess of SQL subqueries that don't work because they cant account for the fact that X, Y, and Z can all be any value (0-N). That is, in the past 5 days I could be looking for a stock where the 40 day average is > than 5, or the 5 > 40. Or I could be looking over the past 40 days to find stocks where the 10 day moving average is > 30 day moving average.
This question is different from the other questions as there is variable short and long dates as well as a variable term.
Please find see these earlier posts on Stackoverflow:
How to calculated multiple moving average in MySQL
Calculate moving averages in SQL
These posts have solutions to your question.
I think the most direct way to do a moving average in MySQL is using a correlated subquery. Here is an example:
select p.*,
(select avg(closing_price)
from prices p2
where p2.stock_id = p.stock_id and
p2.date between p.date - interval x day and pdate
) as MvgAvg_X,
(select avg(closing_price)
from prices p2
where p2.stock_id = p.stock_id and
p2.date between p.date - interval y day and pdate
) as MvgAvg_Y
from prices p
You need to fill in the values for x and y.
For performance reasons, you will want an index on prices(stock_id, date, closing_price).
If you have an option for another database, Oracle, Postgres, and SQL Server 2012 all offer much better performing solutions for this problem.
In Postgres, you can write this as:
select p.*,
avg(p.price) over (partition by stock_id rows x preceding) as AvgX,
avg(p.price) over (partition by stock_id rows y preceding) as AvgY
from p
I want to implement a 'logarithmic' score-decay based on aging, and I'm trying to figure out the SUM/LOG combination. Here you have a simplified version of the current query:
SELECT SUM(1) as total_score FROM posts
JOIN votes ON votes.post_id = posts.post_id
WHERE 1
GROUP BY post_id
ORDER BY total_score DESC
I'm currently doing SELECT 'SUM(1) as total_score' but I want to modify the query to take the date/age of the vote into consideration; where a vote from today weights 1, a vote from 15 days ago weights close to .8 and a vote from 30 days ago close to 0. I'm storing the date field on the votes table (vote_date) as a unix_timestamp.
I'm not really concerned about the WHERE clausule; that's pretty straightforward. What I'm trying to figure out is the logarithmic aging part.
I think there are two parts to your answer. First, the weighting function and then the SQL implementation.
Wegighting function:
According to your graph, you don't want a log weight buit rather parabolic.
From this you have to solve
Xc = y
where
X = [1 1 1 ;
15^2 15 1;
30^2 30 1];
and
y = [1;.8;0];
you get c = X^(-1)y or in matlab
c = X\y
Now you have the appropriate wieights of the quadratic function you depicted; namely y = ax^2+bx+c with (a,b,c) =(-.0013,.0073,.9941).
SQL part:
you select statement should look like (assuming the column of interest is named "age")
SELECT (-.0013*age*age + .0073*age + .9941) as age_weighted
Hope it helps
Cheers
Here 's the complete Matlab code (also to doublecheck solution)
X = [1 1 1 ;
15^2 15 1;
30^2 30 1];
y = [1;.8;0];
c = X\y;
x= (1:30)';
y = [x.^2 x ones(30,1)]*c;
figure(1)
clf;hold on
plot(x,y)
plot([1 15 30],[1 .8 0],'o')
Suppose you have a function WEIGHT(age) that gives the weight of a vote that's age days old.
Then your query would be
SELECT SUM(WEIGHT(DATEDIFF(CURRENT_DATE, votes.date_vote_cast))) as total_score,
posts.post_id
FROM posts
JOIN votes ON votes.post_id = posts.post_id
WHERE votes.date_vote_cast <= CURRENT_DATE
AND votes.date_vote_cast > CURRENT_DATE - INTERVAL 30 DAY
GROUP BY post_id
ORDER BY total_score DESC
I am afraid I don't know exactly what function you want for WEIGHT(age). But you do, and you can work it out.
I havent done the SQL part but I found a function that will provide the decay you are after, mathematically at least
y=(sqrt(900-(x^2)))/30
or in your case
score=(sqrt(900-(days^2)))/30
Hope it can help!