Ranking algorithm using likes / dislikes and average views per day - mysql

I'm currently ranking videos on a website using a bayesian ranking algorithm, each video has:
likes
dislikes
views
upload_date
Anyone can like or dislike a video, a video is always views + 1 when viewed and all videos have a unique upload_date.
Data Structure
The data is in the following format:
| id | title | likes | dislikes | views | upload_date |
|------|-----------|---------|------------|---------|---------------|
| 1 | Funny Cat | 9 | 2 | 18 | 2014-04-01 |
| 2 | Silly Dog | 9 | 2 | 500 | 2014-04-06 |
| 3 | Epic Fail | 100 | 0 | 200 | 2014-04-07 |
| 4 | Duck Song | 0 | 10000 | 10000 | 2014-04-08 |
| 5 | Trololool | 25 | 30 | 5000 | 2014-04-09 |
Current Weighted Ranking
The following weighted ratio algorithm is used to rank and sort the videos so that the best rated are shown first.
This algorithm takes into account the bayesian average to give a better overall ranking.
Weighted Rating (WR) = ((AV * AR) + (V * R))) / (AV + V)
AV = Average number of total votes
AR = Average rating
V = This items number of combined (likes + dislikes)
R = This items current rating (likes - dislikes)
Example current MySQL Query
SELECT id, title, (((avg_vote * avg_rating) + ((likes + dislikes) * (likes / dislikes)) ) / (avg_vote + (likes + dislikes))) AS score
FROM video
INNER JOIN (SELECT ((SUM(likes) + SUM(dislikes)) / COUNT(id)) AS avg_vote FROM video) AS t1
INNER JOIN (SELECT ((SUM(likes) - SUM(dislikes)) / COUNT(id)) AS avg_rating FROM video) AS t2
ORDER BY score DESC
LIMIT 10
Note: views and upload_date are not factored in.
The Issue
The ranking currently works well but it seems we are not making full use of all the data at our disposal.
Having likes, dislikes, views and an upload_date but only using two seems a waste because the views and upload_date are not factored in to account how much weight each like / dislike should have.
For example in the Data Structure table above, items 1 and 2 both have the same amount of likes / dislikes however item 2 was uploaded more recently so it's average daily views are higher.
Since item 2 has more likes and dislikes in a shorter time than those likes / dislikes should surely be weighted stronger?
New Algorithm Result
Ideally the new algorithm with views and upload_date factored in would sort the data into the following result:
Note: avg_views would equal (views / days_since_upload)
| id | title | likes | dislikes | views | upload_date | avg_views |
|------|-----------|---------|------------|---------|---------------|-------------|
| 3 | Epic Fail | 100 | 0 | 200 | 2014-04-07 | 67 |
| 2 | Silly Dog | 9 | 2 | 500 | 2014-04-06 | 125 |
| 1 | Funny Cat | 9 | 2 | 18 | 2014-04-01 | 2 |
| 5 | Trololool | 25 | 30 | 5000 | 2014-04-09 | 5000 |
| 4 | Duck Song | 0 | 10000 | 10000 | 2014-04-08 | 5000 |
The above is a simple representation, with more data it gets a lot more complex.
The question
So to summarise, my question is how can I factor views and upload_date into my current ranking algorithm in a style to improve the way that videos are ranked?
I think the above example by calculating the avg_views is a good way to go but where should I then add that into the ranking algorithm that I have?
It's possible that better ranking algorithms may exist, if this is the case then please provide an example of a different algorithm that I could use and state the benefits of using it.

Taking a straight percentage of views doesn't give an accurate representation of the item's popularity, either. Although 9 likes out of 18 is "stronger" than 9 likes out of 500, the fact that one video got 500 views and the other got only 18 is a much stronger indication of the video's popularity.
A video that gets a lot of views usually means that it's very popular across a wide range of viewers. That it only gets a small percentage of likes or dislikes is usually a secondary consideration. A video that gets a small number of views and a large number of likes is usually an indication of a video that's very narrowly targeted.
If you want to incorporate views in the equation, I would suggest multiplying the Bayesian average you get from the likes and dislikes by the logarithm of the number of views. That should sort things out pretty well.
Unless you want to go with multi-factor ranking, where likes, dislikes, and views are each counted separately and given individual weights. The math is more involved and it takes some tweaking, but it tends to give better results. Consider, for example, that people will often "like" a video that they find mildly amusing, but they'll only "dislike" if they find it objectionable. A dislike is a much stronger indication than a like.

I can point you to a non-parametric way to get the best ordering with respect to a weighted linear scoring system without knowing exactly what weights you want to use (just constraints on the weights). First though, note that average daily views might be misleading because movies are probably downloaded less in later years. So the first thing I would do is fit a polynomial model (degree 10 should be good enough) that predicts total number of views as a function of how many days the movie has been available. Then, once you have your fit, then for each date you get predicted total number of views, which is what you divide by to get "relative average number of views" which is a multiplier indicator which tells you how many times more likely (or less likely) the movie is to be watched compared to what you expect on average given the data. So 2 would mean the movie is watched twice as much, and 1/2 would mean the movie is watched half as much. If you want 2 and 1/2 to be "negatives" of each other which sort of makes sense from a scoring perspective, then take the log of the multiplier to get the score.
Now, there are several quantities you can compute to include in an overall score, like the (log) "relative average number of views" I mentioned above, and (likes/total views) and (dislikes / total views). US News and World Report ranks universities each year, and they just use a weighted sum of 7 different category scores to get an overall score for each university that they rank by. So using a weighted linear combination of category scores is definitely not a bad way to go. (Noting that you may want to do something like a log transform on some categories before taking the linear combination of scores). The problem is you might not know exactly what weights to use to give the "most desirable" ranking. The first thing to note is that if you want the weights on the same scale, then you should normalize each category score so that it has standard deviation equal to 1 across all movies. Then, e.g., if you use equal weights, then each category is truly weighted equally. So then the question is what kinds of weights you want to use. Clearly the weights for relative number of views and proportion of likes should be positive, and the weight for proportion of dislikes should be negative, so multiply the dislike score by -1 and then you can assume all weights are positive. If you believe each category should contribute at least 20%, then you get that each weight is at least 0.2 times the sum of weights. If you believe that dislikes are more important that likes, then you can say (dislike weight) >= c*(like weight) for some c > 1, or (dislike_weight) >= c*(sum of weights) + (like weight) for some c > 0. Similarly you can define other linear constraints on the weights that reflect your beliefs about what the weights should be, without picking exact values for the weights.
Now here comes the fun part, which is the main thrust of my post. If you have linear inequality constraints on the weights, all of the form that a linear combination of the weights is greater than or equal to 0, but you don't know what weights to use, then you can simply compute all possible top-10 or top-20 rankings of movies that you can get for any choice of weights that satisfy your constraints, and then choose the top-k ordering which is supported by the largest VOLUME of weights, where the volume of weights is the solid angle of the polyhedral cone of weights which results in the particular top-k ordering. Then, once you've chosen the "most supported" top-k ranking, you can restrict the scoring parameters to be in the cone that gives you that ranking, and remove the top k movies, and compute all possibilities for the next top-10 or top-20 ranking of the remaining movies when the weights are restricted to respect the original top-k movies' ranking. Computing all obtainale top-k rankings of movies for restricted weights can be done much, much faster than enumerating all n(n-1)...(n-k+1) top-k possible rankings and trying them all out. If you have two or three categories then using polytope construction methods the obtainable top-k rankings can be computed in linear time in terms of the output size, i.e. the number of obtainable top-k rankings. The polyhedral computation approach also gives the inequalities that define the cone of scoring weights that give each top-k ranking, also in linear time if you have two or three categories. Then to get the volume of weights that give each ranking, you triangulate the cone and intersect with the unit sphere and compute the areas of the spherical triangles that you get. (Again linear complexity if the number of categories is 2 or 3). Furthermore, if you scale your categories to be in a range like [0,50] and round to the nearest integer, then you can prove that the number of obtainable top-k rankings is actually quite small if the number of categories is like 5 or less. (Even if you have a lot of movies and k is high). And when you fix the ordering for the current top group of movies and restrict the parameters to be in the cone that yields the fixed top ordering, this will further restrict the output size for the obtainable next best top-k movies. The output size does depend (polynomially) on k which is why I recommended setting k=10 or 20 and computing top-k movies and choosing the best (largest volume) ordering and fixing it, and then computing the next best top-k movies that respect the ordering of the original top-k etc.
Anyway if this approach sounds appealing to you (iteratively finding successive choices of top-k rankings that are supported by the largest volume of weights that satisfy your weight constraints), let me know and I can produce and post a write-up on the polyhedral computations needed as well as a link to software that will allow you to do it with minimal extra coding on your part. In the meantime here is a paper http://arxiv.org/abs/0805.1026 I wrote on a similar study of 7-category university ranking data where the weights were simply restricted to all be non-negative (generalizing to arbitrary linear constraints on weights is straightforward).

A simple approach would be to come up with a suitable scale factor for each average - and then sum the "weights". The difficult part would be tweaking the scale factors to produce the desired ordering.
From your example data, a starting point might be something like:
Weighted Rating = (AV * (1 / 50)) + (AL * 3) - (AD * 6)
Key & Explanation
AV = Average views per day:
5000 is high so divide by 50 to bring the weight down to 100 in this case.
AL = Average likes per day:
100 in 3 days = 33.33 is high so multiply by 3 to bring the weight up to 100 in this case.
AD = Average dislikes per day:
10,000 seems an extreme value here - would agree with Jim Mischel's point that dislikes may be more significant than likes so am initially going with a negative scale factor of twice the size of the "likes" scale factor.
This gives the following results (see SQL Fiddle Demo):
ID TITLE SCORE
-----------------------------
3 Epic Fail 60.8
2 Silly Dog 4.166866
1 Funny Cat 1.396528
5 Trololool -1.666766
4 Duck Song -14950
[Am deliberately keeping this simple to present the idea of a starting point - but with real data you might find linear scaling isn't sufficient - in which case you could consider bandings or logarithmic scaling.]

Every video have:
likes
dislikes
views
upload_date
So we can deduct the following parameters from them:
like_rate = likes/views
dislike_rate = likes/views
view_rate = views/number_of_website_users
video_age = count_days(upload_date, today)
avg_views = views/upload_age
avg_likes = likes/upload_age
avg_dislikes = dislikes/upload_age
Before we can set the formula to be used, we need to specify how different videos popularity should work like, one way is to explain in points the property of a popular video:
A popular video is a recent one in most cases
The older a video gets, the higher avg_views it requires to become popular
A video with a like_rate over like_rate_threshold or a dislike_rate over dislike_rate_threshold, can compete by the difference from its threshold with how old it gets
A high view_rate of a video is a good indicator to suggest that video to a user who have not watched it before
If avg_likes or avg_dislikes make most of avg_views, the video is considered active in the meantime, in case of active videos we don't really need to check how old it's
Conclusion: I don't have a formula, but one can be constructed by converting one unit into another's axis, like cutting a video age by days based on a calculation made using avg_likes, avg_dislikes, and avg_views

Since no one has pointed it out yet (and I'm a bit surprised), I'll do it. The problem with any ranking algorithm we might come up with is that it's based on our point of view. What you're certainly looking for is an algorithm that accomodates the median user point of view.
This is no new idea. Netflix had it some time ago, only they personalized it, basing theirs on individual selections. We are looking - as I said - for the median user best ranking.
So how to achieve it? As others have suggested, you are looking for a function R(L,D,V,U) that returns a real number for the sort key. R() is likely to be quite non-linear.
This is a classical machine learning problem. The "training data" consists of user selections. When a user selects a movie, it's a statement about the goodness of the ranking: selecting a high-ranked one is a vote of confidence. A low-ranked selection is a rebuke. Function R() should revise itself accordingly. Initially, the current ranking system can be used to train the system to mirror its selections. From there it will adapt to user feedback.
There are several schemes and a huge research literature on machine learning for problems like this: regression modeling, neural networks, representation learning, etc. See for example the Wikipedia page for some pointers.
I could suggest some schemes, but won't unless there is interest in this approach. Say "yes" in comments if this is true.
Implementation will be non-trivial - certainly more than just tweaking your SELECT statement. But on the plus side you'll be able to claim your customers are getting what they're asking for in very good conscience!

Related

Multinomial Logistic Regression Predictors Set Up

I would like to use a multinomial logistic regression to get win probabilities for each of the 5 horses that participate in any given race using each horses previous average speed.
RACE_ID H1_SPEED H2_SPEED H3_SPEED H4_SPEED H5_SPEED WINNING_HORSE
1 40.482081 44.199627 42.034929 39.004813 43.830139 5
2 39.482081 42.199627 41.034929 41.004813 40.830139 4
I am stuck on how to handle the independent variables for each horse given that any of the 5 horses average speed can be placed in any of H1_SPEED through H5_SPEED.
Given the fact that for each race I can put any of the 5 horses under H1_SPEED meaning there is no real relationship between H1_SPEED from RACE_ID 1 and H1_SPEED from RACE_ID 2 other than the arbitrary position I selected.
Would there be any difference if the dataset looked like this -
For RACE_ID 1 I swapped H3_SPEED and H5_SPEED and changed WINNING_HORSE from 5 to 3
For RACE_ID 2 I swapped H4_SPEED and H1_SPEED and changed WINNING_HORSE from 4 to 1
RACE_ID H1_SPEED H2_SPEED H3_SPEED H4_SPEED H5_SPEED WINNING_HORSE
1 40.482081 44.199627 43.830139 39.004813 42.034929 3
2 41.004813 42.199627 41.034929 39.482081 40.830139 1
Is this an issue, if so how should this be handled? What if I wanted to add more independent features per horse?
You cannot change in that way your dataset, because each feature (column) has a meaning and probably it depends on the values of the other features. You can imagine it as a six dimensional hyperplane, if you change the value of a feature the position of the point in the hyperplane changes, it does not remain stationary.
If you deem that a feature is useless to solve your problem (i.e. it is independent from the target), you can drop it or avoid to use it during the training phase of your model.
Edit
To solve your specific problem you may add a parameter for each speed column that takes care of the specific horse which is running with that speed. It is a sort of data augmentation, in order to add more problem related features to your model.
RACE_ID H1_SPEED H1_HORSE H2_SPEED H2_HORSE ... WINNING_HORSE
1 40.482081 1 44.199627 2 ... 5
2 39.482081 3 42.199627 5 ... 4
I've invented the number associated to each horse, but it seems that this information is present in your dataset.

Sql Popularity algorithm with weighted score

I'm implement an algorithm that returns popular posts at the moment, given his likes and dislikes.
To do this, for each post I add all his likes (1) and dislikes (-1) to get his score but each like/dislike is weighted : the latest, the heaviest. For example, at the moment an user likes a post, his like weights 1. After 1 day, it weights 0.95 (or -0.95 if it's a dislike), after 2 days, 0.90, and so on... With a minimal of 0.01 reached after 21 days. (PS: Theses are totally approximate values)
Here are how my tables are made :
Posts table
id | Title | user_id | ...
-------------------------------------------
1 | Random post | 10 | ...
2 | Another post | 36 | ...
n | ... | n | ...
Likes table
id | vote | post_id | user_id | created
----------------------------------------
1 | 1 | 2 | 10 | 2014-08-18 15:34:20
2 | -1 | 1 | 24 | 2014-08-15 18:54:12
3 | 1 | 2 | 54 | 2014-08-17 21:12:48
Here is the SQL query I'm currently using which does the job
SELECT Post.*, Like.*,
SUM(Like.vote *
(1 - IF((TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21 > 0.99, 0.99, (TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21))
) AS score
FROM posts Post
LEFT JOIN likes Like ON (Post.id = Like.post_id)
GROUP BY Post.id
ORDER BY score DESC
PS: I'm using TIMESTAMPDIFF with MINUTE and not DAY directly because I'm calculating the day myself otherwise it returns me an integrer and I want a float value, in order to gradually decay overtime and not day per day. So TIMESTAMPDIFF(MINUTE, Like.created, NOW())/60/24 just gives me the number of day passed since the like creation with the decimal part.
Here are my questions :
Look at the IF(expr1, expr2, expr3) part : it is necessary in order to set minimal value for the like's weight, so it will not go under 0.01 and become negative (and so the like, even older still has a little weight). But I'm calculating 2 times the same thing : expr1 is the same as expr2. Isn't there a way to avoid this duplicate expression ?
I was going to cache this query and update it every 5 minutes, as I think it will be pretty heavy on a big Post and Like table. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entries Like table).
Should I create Index in Like table for post_id ? And for created ?
Thank you !
EDIT: Imagine a Post can have multiple tags, and each tag can belong to multiple posts. If I want to get populars Posts given a Tag or multiple Tag, I can't cache each query ; as there is a good amount of possible queries. Is the query still viable so ?
EDIT FOR FINAL SOLUTION: I finally did some tests. I created a table Post with 30 000 entries and Like with 250 000 entries.
Without index, the query was incredibly long (timed out > 10mn), but with indexes on Post.id (primary), Like.id(primary) and Like.post_id it took ~0.5s.
So I'm not caching the data, neither using update every 5mn. If the table keeps growing this is still possible solution (over 1s it's not acceptable).
2: I was going to cache this query and update it every 5 minutes, as I think it will be pretty heavy on a big Post and Like table. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entries Like table).
10000 and 50000 are considered small on current hardware. With those table sizes you probably won't need any cache, unless the query will run several times per second.
Anyway, I would do a performance test before deciding to have a cache.
3: Should I create Index in Like table for post_id ? And for created ?
I would create an index for (post_id, created, vote). That way the query can get all information from the index and doesn't need to read the table at all.
Edit (response to comments):
An extra index will slow down inserts/updates slightly. In the end, the path you choose will dictate the characteristics of what you need in terms of CPU/RAM/Disk I/O.
If you have enough RAM for the DB so that you expect the entire Like table to be cached in RAM then you might be better off with an index on just post_id.
In terms of total load you need to consider the ratio between insert and select and the relative cost of insert and select with or without the index.
My gut feeling is that the total load will be lower with the index.
Regarding your question on concurrency (selecting and inserting simultaneously). What happens depends on the isolation level. The general advice is to keep inserts/updates as short as possible. If you don't do unneccessary things between the start of the insert and the commit you should be fine.

Selecting rows if the total sum of a row is equal to X

I have a table that holds items and their "weight" and it looks like this:
items
-----
id weight
---------- ----------
1 1
2 5
3 2
4 9
5 8
6 4
7 1
8 2
What I'm trying to get is a group where the sum(weight) is exactly X, while honouring the order in which were inserted.
For example, if I were looking for X = 3, this should return:
id weight
---------- ----------
1 1
3 2
Even though the sum of ids 7 and 8 is 3 as well.
Or if I were looking for X = 7 should return
id weight
---------- ----------
2 5
3 2
Although the sum of the ids 1, 3 and 6 also sums 7.
I'm kind of lost in this problem and haven't been able to come up with a query that does at least something similar, but thinking this problem through, it might get extremely complex for the RDBMS to handle. Could this be done with a query? if not, what's the best way I can query the database to get the minimum amount of data to work with?
Edit: As Twelfth says, I need to return the sum, regardless of the amount of rows it returns, so if I were to ask for X = 20, I should get:
id weight
---------- ----------
1 1
3 2
4 9
5 8
This could turn out to be very difficult in sql. What you're attempting to do is solve the knapsack problem, which is non-trivial.
The knapsack problem is interesting from the perspective of computer science for many reasons:
The decision problem form of the knapsack problem (Can a value of at least V be achieved without exceeding the weight W?) is NP-complete, thus there is no possible algorithm both correct and fast (polynomial-time) on all cases, unless P=NP.
While the decision problem is NP-complete, the optimization problem is NP-hard, its resolution is at least as difficult as the decision problem, and there is no known polynomial algorithm which can tell, given a solution, whether it is optimal (which would mean that there is no solution with a larger, thus solving the decision problem NP-complete).
There is a pseudo-polynomial time algorithm using dynamic programming.
There is a fully polynomial-time approximation scheme, which uses the pseudo-polynomial time algorithm as a subroutine, described below.
Many cases that arise in practice, and "random instances" from some distributions, can nonetheless be solved exactly.

get row associated with the nearest result in mysql

I am trying to get a recipe scaling method in my app which will return some nicer measurements adapted to the amount the use is serving.
For example, the recipe for 6 people calls for 1 cup of flour. If you scale that for one person, 1/6 a cup of flour is 2.5 tablespoons, which is a nicer way of saying it (why search for and dirty a measuring cup when you can just use a spoon?).
So I have in the db a weights table with weight in grams, corresponding measurement and amount.
eg. for flour
amount | measure | grams
----------+-------------+---------
1 | cup | 160
1 | tbsp | 10
1 | pound | 454
in my app (using activerecord preferably) I'm trying to get the best fit measurement for each ingredient in the recipe.
#ingredients = Recipe.select('food_names.name,
ABS(ingredients.grams-weights.grams) as nearest_weight,
weights.amount,
weights.measure'
).joins(
{:ingredients=> :food_name},
{:food_names=> :weights}
).where(
"recipes.recipe_id", :recipe_id
).order(
:nearest_weight
).reverse_order
the nearest_weight searches for the closest match of weights in the database, but I need to find the weight and measure associated with that row, and at the moment, I'm getting all the rows returned.
What I need to do is somehow limit the nearest_weight to one row, and then get that row so I know what the weight and measure are, and I'm hoping I can do that all in one query.

Should I worry about 1B+ rows in a table?

I've got a table which keeps track of article views. It has the following columns:
id, article_id, day, month, year, views_count.
Let's say I want to keep track of daily views / each day for every article. If I have 1,000 user written articles. The number of rows would compute to:
365 (1 year) * 1,000 => 365,000
Which is not too bad. But let say. The number of articles grow to 1M. And as time passes by to 3 years. The number of rows would compute to:
365 * 3 * 1,000,000 => 1,095,000,000
Obviously, over time, this table will keep growing. And quite fast. What problems will this cause? Or should I not worry since RDBM's handle situations like this quite commonly?
I plan on using the views data in our reports. Either break it down to months or even years. Should I worry about 1B+ rows in a table?
The question to ask yourself (or your stakeholders) is: do you really need 1-day resolution on older data?
Have a look into how products like MRTG, via RRD, do their logging. The theory is you don't store all the data at maximum resolution indefinitely, but regularly aggregate them into larger and larger summaries.
That allows you to have 1-second resolution for perhaps the last 5-minutes, then 5-minute averages for the last hour, then hourly for a day, daily for a month, and so on.
So, for example, if you have a bunch of records like this for a single article:
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 1 | 5 | day
2011 | 12 | 2 | 7 | day
2011 | 12 | 3 | 10 | day
2011 | 12 | 4 | 50 | day
You would then at regular periods create a new record(s) that summarises these data, in this example just the total count for the month
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 0 | 72 | month
Or the average per day:
year | month | day | count | type
-----+-------+-----+-------+------
2011 | 12 | 0 | 2.3 | month
Of course you may need some flag to indicate the "summarised" status of the data, in this case I've used a 'type' column for finding the "raw" records and the processed records, allowing you to purge out the day records as required.
INSERT INTO statistics (article_id, year, month, day, count, type)
SELECT article_id, year, month, max(day), sum(count), 'month'
FROM statistics
WHERE type = 'day'
GROUP BY article_id, year, month, type
(I haven't tested that query, it's just an example)
The answer is "it depends". but yes, it will probably be a lot to deal with.
However - this is generally a problem of "cross that bridge when you need to". It's a good idea to think about what you could do if this becomes a problem for you in the future.. but it's probably too early to actually implement any suggestions until they're necessary.
My suggestion, if it ever occurs, is to not keep the individual records for longer than X-months (where you adjust X according to your needs). Instead, you'd store the aggregated data that you currently feed into your reports. What you'd do is run, say, a daily script that looks at your records and grabs any that are over X months old... and create a "daily_stats" object of some sort, then delete the originals (or better yet, archives them somewhere).
This will ensure that only X-months worth of data are ever in the db - but you still have quick access to an aggregated form of the stats for long-timeline reports.
It's not something you need to worry about if you can put some practices in place.
Partition the table; this should make archiving easier to do
Determine how much data you need at present
Determine how much data you can archive
Ensure that the table has the right build, perhaps in terms of data types and indexes
Schedule for a time when you will archive partitions that meet the aging requirements
Schedule for index checking (and other table checks)
If you have a DBA in your team, then you can discuss it with him/her, and I'm sure they'll be glad to assist.
Also, like what is used in many data warehouses, and I just saw #Taryn's post (which I agree with -> )store aggregated data as well. This is quickly suggested based on the data you keep in the involved table. If you have trouble with possible editing/updating of records, then it brings to light (even more) the fact that you will just have to set restrictions like how much data to keep (which means this data is what can be modified) and have procedures+jobs in place to ensure that the aggregated data is checked/updated daily and can be updated/checked manually when any changes are made. This way, data integrity is maintained. Discuss with your DBA what other approaches you can take...
By the way, in case you didn't already know.. Aggregated data are normally needed for weekly or monthly reports, and many other reports based upon an interval. Granulize your aggregation as needed, but not so much that it becomes too tedious or seemingly exaggerated.