I am building a little app for users to create collections. I want to have a rating system in there. And now, since I want to cover all my fields, let's pretend that I have a lot of visitors. Performance comes into play, especially with rates.
Let's suppose that I have rates table, and there I have id, game_id, user_id and rate. Data comes simple, for every user there is one entry. Let's suppose again, that 1000 users will rate one game. And I want to print out average rate on that game subpage (and somewhere else, like on the games list). For now, I have two scenarios to go with:
Getting AVG each time the game is displayed.
Creating another column in games, called temprate and store there rate for the game. It would be updated evey time someone votes.
Those two scenarios have obvious flaws. First one is more stressful to my host, since it definietly will consume more power of the machine. Secound is more work while rating (getting all the game data, submitting rate, getting new AVG).
Please advice me, which scenario should I go with? Or maybe you have some other ideas?
I work with PDO and no framework.
So I've finally manage to solve this issue. I used file caching based on dumping arrays into files. I just go with something like if (cache) { $var = cache } else { $var = db }. I am using JG Cache, for now, but propably I'll write myself something similar soon, but for now - it's a great solution.
I'd have gone with a variation of your "number 2" solution (update a separate rating column), maybe in a separate table just for this.
If the number of writes becomes a problem, then that'd be well after select avg(foo) from ... does, and there are lots of ways to mitigate it by just updating the average rating periodically or just processing new votes every so often.
Likely then eventually you can't just do an avg() anyway because you have to consider each vote for fraud, calculating a sort score and who knows what else.,
Related
I'm having a hard time wrapping my head around the issue of an ELO-score-like calculation for a large amount of users on our platform.
For example. For every user in a large set of users, a complex formule, based on variable amounts of "things done", will result in a score for each user for a match-making-like principle.
For our situation, it's based on the amount of posts posted, connections accepted, messages sent, amount of sessions in a time period of one month, .. other things done etc.
I had two ideas to go about doing this:
Real-time: On every post, message, .. run the formula for that user
Once a week: Run the script to calculate everything for all users.
The concerns about these two I have:
Real-time: Would be an overkill of queries and calculations for each action a user performs. If let's say, 500 users are active, all of them are performing actions, the database would be having a hard time I think. There would them also run a script to re-calculate the score for inactive users (to lower their score)
Once a week: If we have for example 5.000 users (for our first phase), than that would result into running the calculation formula 5.000 times and could take a long time and will increase in time when more users join.
The calculation-queries for a single variable in a the entire formula of about 12 variables are mostly a simple 'COUNT FROM table', but a few are like counting "all connections of my connections" which takes a few joins.
I started with "logging" every action into a table for this purpose, just the counter values and increase/decrease them with every action and running the formula with these values (a record per week). This works but can't be applied for every variable (like the connections of connections).
Note: Our server-side is based on PHP with MySQL.
We're also running Redis, but I'm not sure if this could improve those bits and pieces.
We have the option to export/push data to other servers/databases if needed.
My main example is the app 'Tinder' which uses a sort-like algorithm for match making (maybe with less complex data variables because they're not using groups and communities that you can join)
I'm wondering if they run that real-time on every swipe, every setting change, .. or if they have like a script that runs continiously for a small batch of users each time.
Where it all comes down to. What would be the most efficient/non-database-table-locking way to do this, with keeping the idea in mind that there will be a moment that we're having 50.000 users for example?
The way I would handle this:
Implement the realtime algorithm.
Measure. Is it actually slow? Try optimizing
Still slow? Move the algorithm to a separate asynchronous process. Have the process run whenever there's an update. Really this is the same thing as 1, but it doesn't slow down PHP requests and if it gets busy, it can take more time to catch up.
Still slow? Now you might be able to optimize by batching several changes.
If you have 5000 users right now, make sure it runs well with 5000 users. You're not going to grow to 50.000 overnight, so adjust and invest in this as your problem changes. You might be surprised where your performance problems are.
Measuring is key though. If you really want to support 50K users right now, simulate and measure.
I suspect you should use the database as the "source of truth" aka "persistent storage".
Then fetch whatever is needed from the dataset when you update the ratings. Even lots of games by 5000 players should not take more than a few seconds to fetch and compute on.
Bottom line: Implement "realtime"; come back with table schema and SELECTs if you find that the table fetching is a significant fraction of the total time. Do the "math" in a programming language, not SQL.
I have to paginate and display a large count of articles ordered by "score", this score is not saved anywhere but calculated on the page load.
The score of articles depends on a lot of things like hits, shares, likes, favorites, and what I mean is that I can't put the logic in an SQL query.
So, what i did is : get full data -> calculate score for all data -> order by top score -> display as array chunks (with Laravel custom paginator)
$Articles = DB::table('articles')->get();
//for test purpose here i can run a for-loop and print its values
foreach ($Articles as $Article) {
echo "Article id : ".$Article->id";
//here i cant print $Article->score
}
$Articles = $this->likesScoreFunction1($Articles);
$Articles = $this->scorefunction2($Articles);
$Articles = $this->scorefunction3($Articles);
What I am doing in scoreFunctions is adding an extra array value with name "score" and next function update with latest score and so on. I mean after these functions for testing I can run a for loop like below :
foreach ($Article as $Article) {
echo "Article id : ".$Article->id." score is ".$Article->score."\n";
}
usort($Articles, function($a, $b) {
return $b['score'] - $a['score'];
});
Finally I got articles sorted by score here at $Articles;
Then I pass first array chunks to view.
I know this is not a good method because I load all articles' values in memory. Can anyone please recommend any better method for this?
The score can be different at different minutes or seconds..
One possibility is to run a cron each minute and update the score fields. It's not practical because this processes all articles from all organization in the system.
My client doesn't want to go with the above method because when any user posts an article it has the highest score and it should come to top. Here we won't see the new article on top until the cron completes it's job.
I had a similar problem for a complex ranking algorithm a few weeks ago and tried various approaches you described.
Pure SQL was difficult to debug and maintain and slow to execute. The database must select all articles, join in all relevant votes/scores/etc, sort the entire collection then return the paginated result. In some cases it can't use indexes for this. The database cache was being filled with (to me) irrelevant records.
A front-end cache was applied to the pure SQL version. We couldn't find a comfortable caching age - either ranks were outdated or too many requests were missing the cache.
SQL and PHP for realtime calculations wasn't attempted. It could only have been slower than the pure SQL method, I think.
To get timely results we had to store the calculated rank values ahead of time. There are a few ways to achieve it, as you're already exploring;
Re-calculate the entire catalog as a cron job - as you know is too slow and dance-y
Event listeners re-calculate the ranks of specific articles when an event takes place (a vote, social share, etc) - we used this to good effect, but too many events were causing too many simultaneous calculations in our scenario
An event queue keeps track of all recent events. A process or cron job re-calculates the ranks of articles which have the most activity currently in the queue, thereby wiping out the most possible queued jobs and using CPU cycles for most effect.
The event queue worked for us so I stopped looking for other solutions. We did add a microcache with Varnish of 1-3 seconds to relieve the database from the traffic burden to ensure a good balance between realtime work and rank calculations.
SQL is still good with time-based calculations, I wouldn't look to fire regular events on articles to degrade ranks over time, for example. If ranks are a combination of events and times I would store pre-calculated event-based ranks, the posted time, and have the SQL query calculate a combination.
When our ranks got that complicated we used a separate table purely for ranks (fixed-width data columns, clever indexing) with a one-to-one relationship to articles. Or a many-to-one relationship if users can choose how to rank them. This way the database doesn't need to load all article data and trawl through inefficient indexes each time. Simply join in article data after the rank pagination.
If you're using Doctrine watch out for greedy hydration.
HTH!
I currently have a system where users can register and bet the scores from soccer games.
Right now i Have over 20k users, and more than 3 million bets. Every week I have to generate a ranking, so I have a query that loads on the memory all the users with bets just like this:
from u in context.Set<User>().include("Bets").OrderByDescending(u => u.Points)
select u
Where Points is the sum of the points earned from each guess.
After this query is over, I save the user and his position on another table to build my ranking.
The thing is, this query is consuming too much memory! Over 4gb!!! And I need all the users and bets to calculate the ranking.
The first alternative that I tried was: Create another table to store the user and its points. I would iterate over this query loading 500 users each time, than calculating and saving, but I am still stuck at the memory problem:
int page = 0;
int step = 500;
while (page * step < count)
{
foreach (var u in context.Set<User>()
.Skip(page * step)
.Take(step).ToList())
{
//Saves in another table
}
page++;
}
//Sorts based on the data from this other table
Since this didn't work, I gave up and then I tried to do something like this:
var users = (from u in context.Set<User>().Include("Bets")
select u).ToList();
context.Dispose()
var sortedUsers = from u in users.OrderByDescending(u => u.Points)
select u;
But didn't solve as well...
I guess the problem is about the context holding all the tracking information.
Does anyone have any clues? How to handle large amount of data using EF 4.1?
Thanks
Another thing that I noticed.
Let's say I have user A and user B,
Each has only 1 Bet to the same Match, I expected to have something like this:
User A ---> Bet
\
Match
/
User B ---> Bet
But I'm having two different instances of Match having the same Data.
Is there a way to avoid this?
Why I am not willing to put this at a stored procedure:
The ranking is based on the guesses, and there are some rules about this sorting.
A user has N bets. Each bet associated with a Game which has a Score.
The first sorting criterion is the points. So I would need to calculate the points for every bet (each user has about 200 bets and will have about 300 by the end of the championship). That's the first join.
To calculate the points of each bet I need the final score of the match. That's another join.
After having the sum of points of each bet(which has about 10 conditionals), and sorting by it I still have to sort based on:
No. of correct bets,
No. of bets where the winner was guessed,
No. of bets where one score was guessed,
Date of the last bet,
Date of registering.
So that's a huge sort with about 6 criterions, and about 3 joins and lots of logic. To calculate this in LINQ is very trivial, and If I'd have to put this on a SP it would take a lot of time, and more error prone. (Never tried TDD and even unit testing in SPs... This ranking have tests for everything)
I agree with #Allan that this would ideally be done in a stored procedure. If you could post the details of the calculation, maybe others could suggest ways to do it in a stored proc.
If you want to keep things as they are, there are a couple of things you could try:
Use AsNoTracking to avoid caching: context.Set<User>().AsNoTracking() // etc
If your User or Bet classes have a lot of properties that you don't need for the calculation, project them into anonymous types that only have the properties you need.
Hope this helps, and if you do try AsNoTracking, I would be curious to know how much difference it makes.
Dude I think it would be wiser if you just compute those data in the stored procedure and not on your C# code. No need to save this data if they can be computed by the use of existing data. Saving them in another table would cause you data redundancy and will be offending the rules of good database normalization.
This is a follow up from my last question: MySQL - Best method to saving and loading items
Anyways, I've looked at some other examples and sources, and most of them have the same method of saving items. Firstly, they delete all the rows that's already inserted into the database containing the character's reference, then they insert the new rows accordingly to the current items that the character has.
I just wanted to ask if this is a good way, and if it would cause a performance hit if i were to save 500 items per each character or so. If you have a better solution, please tell me!
Thanks in advance, AJ Ravindiran.
It would help if you talked about your game so we could get a better idea of your data requirements.
I'd say it depends. :)
Are the slot/bank updates happening constantly as the person plays, or just when the person savles their game and leaves. Also does the order of the slots really matter for the bank slots? Constantly deleting and inserting 500 records certainly can have a performance hit, but there may be a better way to do it, possibly you could just update the 500 records without deleting them. Possibly your first idea of 0=4151:54;1=995:5000;2=521:1;
wasn't SO bad. If the database is only being used for storing that information, and the game itself is managing that information once its loaded. But if you might want to use it for other things like "What players have item X", or "What is the total value of items in Player Ys bank". Then storing it like that won't allow you to ask the database, it would have to be computed by the game.
Hey, does anyone know the proper way to set up a MySQL database to gather pageviews? I want to gather these pageviews to display in a graph later. I have a couple ways mapped out below.
Option A:
Would it be better to count pageviews each time someone visits a site and create a new row for every pageview with a time stamp. So, 50,000 views = 50,000 rows of data.
Option B:
Count the pageviews per day and have one row that counts the pageviews. every time someone visits the site the count goes up. So, 50,000 views = 1 row of data per day. Every day a new row will be created.
Are any of the options above the correct way of doing what I want? or is there a better more efficient way?
Thanks.
Option C would be to parse access logs from the web server. No extra storage needed, all sorts of extra information is stored, and even requests to images and JavaScript files are stored.
..
However, if you just want to track visits to pages where you run your own code, I'd definitely go for Option A, unless you're expecting extreme amounts of traffic on your site.
That way you can create overviews per hour of the day, and store more information than just the timestamp (like the visited page, the user's browser, etc.). You might not need that now, but later on you might thank yourself for not losing that information.
If at some point the table grows too large, you can always think of ways on how to deal with that.
If you care about how your pageviews vary with time in a day, option A keeps that info (though you might still do some bucketing, say per-hour, to reduce overall data size -- but you might do that "later, off-line" while archiving all details). Option B takes much less space because it throws away a lot of info... which you might or might not care about. If you don't know whether you care, I think that, in doubt, you should keep more data rather than less -- it's reasonably easy to "summarize and archive" overabundant data, but it's NOT at all easy to recover data you've aggregated away;-). So, aggregating is riskier...
If you do decide to keep abundant per-day data, one strategy is to use multiple tables, say one per day; this will make it easiest to work with old data (summarize it, archive it, remove it from the live DB) without slowing down current "logging". So, say, pageviews for May 29 would be in PV20090529 -- a different table than the ones for the previous and next days (this does require dynamic generation of the table name, or creative uses of ALTER VIEW e.g. in cron-jobs, etc -- no big deal!). I've often found such "sharding approaches" to have excellent (and sometimes unexpected) returns on investment, as a DB scales up beyond initial assumptions, compared to monolithic ones...