MySQL architecture for n * (n - 1) / 2 algorithm - mysql

I'm currently developing a website where users can search for other users based on attributes (age, height, town, education, etc.). I now want to implement some kind of rating between user profiles. The rating is calculated via its own algorithm based on similiarity between the 2 given profiles. User A has a rating "match rating" of 85 with User B and 79 with User C for example. B and C have a rating of 94 and so on....
The user should be able to search for certain attributes and filter the results by rating.
Since the rating differs from profile to profile and also depends on the user doing the search, I can't simply add a field to my users table and use ORDER BY. So far I came up with 2 solutions:
My first solution was to have a nightly batch job, that calculates the rating for every possible user combination and stores it in a separate table (user1, user2, rating). I then can join this table with the user table and order the result by rating. After doing some math I figured that this solution doesn't scale that well.
Based on the formula n * (n - 1) / 2 there are 45 possible combination for 10 users. For 1.000 users I suddenly have to insert 499.500 rating combinations into my rating table.
The second solution was to leave MySQL be and just calculate the rating on the fly within my application. This also doesn't scale well. Let's say the search should only return 100 results to the UI (with the highest rated on top). If I have 10.000 users and I want to do a search for every user living in New York sorted by rating, I have to load EVERY user that is living in NY into my app (let's say 3.000), apply the algorithm and then return only the top 100 to the user. This way I have loaded 2.900 useless user objects from the DB and wasted CPU on the algorithm without ever doing anything with it.
Any ideas how I can design this in my MySQL db or web app so that a user can have an individual rating with every other user in a way that the system scales beyond a couple thousand users?

If you have to match every user against every other user, the algorithm is O(N^2), whatever you do.
If you can exploit some sort of 1-dimensional "metric", then you can try and associate each user with a single synthetic value. But that's awkward and could be impossible.
But what you can do is to note which users require a change in their profiles (whenever any of the parameters on which the matching is based, changes). At that point you can batch-recalculate the table for those users only, thus working in O(N): if you have 10000 users and only 10 require recalculation, you have to examine 100,000 records instead of 100,000,000.
Other strategies would be to only run the main algorithm for records which have the greater chance of being compared: in your example, "same city". Or when updating records (but this would require to store (user_1, user_2, ranking, last_calculated), only recalculate those records with high ranking, very old, or never calculated. Lowest ranked matches aren't likely to change so much that they float to the top in a short time.
UPDATE
The problem is also operating with O(N^2) storage space.
How to reduce this space? I think I can see two approaches. One is to not put some information in the match table at all. The "match" function makes the more sense the more it is rigid and steep; having ten thousand "good matches" would mean that matching means very little. So we would still need lotsa recalculations when User1 changes some key data, in case it brings some of User1's "no-no" matches back into the "maybe" zone. But we would keep a smaller clique of active matches for each user.
Storage would still grow quadratically, but less steeply.
Another strategy would be to recalculate the match, and then we would need to develop some method for quickly selecting which users are likely to have a good match (thus limiting the number of rows retrieved by the JOIN), and some method to quickly calculate a match; which could entail somehow rewriting the match between User1 and User2 to a very simple function of a subset of DataUser1, DataUser2 (maybe using ancillary columns).
The challenge would be to leverage MySQL capabilities and offload some calculations the the MySQL engine.
To this purpose you might perhaps "map" some data, at input time (therefore in O(k)), to spatial information, or to strings and employ Levenshtein distance.
The storage for a single user would grow, but it would grow linearly, not quadratically, and MySQL SPATIAL indexes are very efficient.

If the search should only return the top 100 best matches, then why not just store those? It sounds like you would never want to search the bottom end of the results anyway, so just don't calculate them.
That way, your storage space is only o(n), rather than o(n^2), and updates should be, as well. If someone really wants to see matches past the first 100 (and you want to let them) then you have the option of running the query in real time at that point.

I agree with everything #Iserni says.
If you have a web app and users need to "login", then you might have an opportunity to create that user's rankings at that time and stash them into a temporary table (or rows in an existing table).
This will work in a reasonable amount of time (a few seconds) if all the data needed for the calculation fits into memory. The database engine should then be doing a full table scan and creating all the ratings.
This should work reasonably well for one user logging in. Passably for two . . . but it is not going to scale very well if you have, say, a dozen users logging in within one second.
Fundamentally, though, your rating does not scale well. You have to do a comparison of all users to all users to get the results. Whether this is batch (at night) or real-time (when someone has a query) doesn't change the nature of the problem. It is going to use a lot of computing resources, and multiple users making requests at the same time will be a bottleneck.

Related

Should you scale through tables or computation in Mysql?

I have a project with customers buying a product with platform based tokens. I have a mysql table that tracks a customer buying x amount and one tracking customer consumption(-x amount). In order to display their Amount of tokens they have left on the platform and query funds left on spending I wanted to query (buys - comsumed). But I remembered that people alsways talk about space is cheaper than computation(Not just $ but querytime as well). Should I have a seperate table for querying amount that gets updated with each buy or consume ?
So far I have always tried to use the least amount of tables to make it simple and have easy oversight, but I start to question if that is right...
There is no right answer, keep in mind the goal of the application, and updates in software likely to happen.
If you keep in these 2 tables transactions the user may have, then the new column was necessary, cause you had to sum the columns. If one row is for one user (likely your case), then 90% you should use those 2 tables only.
I would suggest you not have that extra column. As far with my expierence, in that kind of situations has the down of the bigger the project becomes, the more difficult is for you and the other developers, to have in mind to update the new column, because is dependent variable.
Also, when the user buy products or consumption tokens, you will have to update the new token, so energy and time loss as well.
You can store the (buys - consumed) in session, and update when is needed(if real time update is not necessary, not multiple devices).
If you need continuous update, so multiple queries over time, then memory loss over energy-time loss is greater, so you should have that 3 table - column.

Database too large - store as a row or serialise data?

I have Quiz App that constitutes many Modules containing Questions. Each question has many Categories (many-to-many). Every time a quiz is completed, the user's score is sent to the Scores Table. (I've attached an entity-relation diagram for clarification purposes).
I have been thinking of breaking down the user scores according to categories (i.e. a user when completing a quiz will get an overall quiz score along with score for each category).
However, if each quiz consists of at least 30 questions, there could around 15-20 categories per quiz. So if one user completes a quiz, then it would create a minimum of 15-20 rows in the scores table. With multiple users, the Scores table would get really big really fast.
I assume this would affect the performance of retrieving data from the Scores table. For example, if I wanted to calculate the average score for a user for a specific category.
Does anyone have a better suggestion for how I can still be able to store scores based on categories?
I thought about serialising the JSON data, but of course, this has its limitations.
The DB should be able to handle millions of rows and there is nothing inherently wrong with your design. A few things I would suggest:
Put indexes in the following (or combinations of) user id, exam id (which I assume is what you call scorable id ) exam type (scorable Type?) and creation date.
As your table grows, partition it. Potential candidates could be creation date buckets (by year or year/month would probably work well) or maybe if students are in particular classes you could have class buckets
As your table grow even more you could move the partitions to different different disks (how you partitioned the data will be even more crucial here because if the data has to go across too many partitions you may end up hurting performance instead of helping)
Beyond that another suggestion would be to break the scores table into two score and scoreDetail. The score table would contain top level stuff like user id ,exam id, overall score, etc... While the child table would contain the scores by category (philosophy, etc....). I would bet 80% of the time people only care about the top score anyways. This way you only reach out to the bigger table when some one wants to get the details of their score in a particular exam.
Finally, you probably want to have the score by category in rows rather than columns to make it easier to do analysis and aggregations, but this is not necessarily a performance booster and really depends on how you plan to use the data.
In the end though, the best optimizations really depend on how you plan to use your data. I would suggest just creating a random data set that represents a few years worth of data and play with that.
I doubt that serialization would give you a significant benefit.
I would even dare to say that you'd kind of limit the power of a database by doing so.
Relational databases are designed to store a lot of rows in their tables, and they also usually use their own compression algorithms, so you should be fine.
Additionally, you will need to deserialize every time you want to read from your table. That would eliminate the possibility to use SQL statements for sorting, filtering, JOINing etc.
So in the end you will probably cause yourself more trouble by serializing than by simply storing the rows.

Effectively fetching large number of tuples from Solr

I am stuck in a rather tricky problem. I am implementing a feature in my website, wherein, a person get all the results matching a particular criteria. The matching criteria can be anything. However, for the sake of simplicity, let's call the matching criteria as 'age'. Which means, the feature will return all the students names, from database (which is in hundreds of thousands) with the student whose age matches 'most' with the parameter supplied, on top.
My approaches:
1- I have a Solr server. Since I need to implement this in a paginated way, I would need to query Solr several times (since my solr page size is 10) to find the 'near-absolute' matching student real-time. This is computationally very intensive. This problem boils down to effectively fetching this large number of tuples from Solr.
2- I tried processing it in a batch (and by increasing the solr page size to 100). This data received is not guaranteed to be real-time, when somebody uses my feature. Also, to make it optimal, I would need to have data learning algos to find out which all users are 'most likely' to use my feature today. Then I'll batch process them on priority. Please do remember that number of users are so high that I cannot run this batch for 'all' the users everyday.
On one hand where I want to show results real-time, I have to compromise on performance (hitting Solr multiple times, thus slightly unfeasible), while on the other, my result set wouldn't be real-time if I do a batch processing, plus I can't do it everyday, for all the users.
Can someone correct my seemingly faulty approaches?
Solr indexing is done on MySQL db contents.
As I understand it, your users are not interested in 100K results. They only want the top-10 (or top-100 or a similar low number) results, where the person's age is closest to a number you supply.
This sounds like a case for Solr function queries: https://cwiki.apache.org/confluence/display/solr/Function+Queries. For the age example, that would be something like sort=abs(sub(37, age)) desc, score desc, which would return the persons with age closest to 37 first and prioritize by score in case of ties.
I think what you need is using solr cursors which will enable you to paginate effectively through large resultsets Solr cursors or deep paging

Managing set of users

We have a website with many users. To manage users who transacted on a given day, we use Redis and stored a list of binary numbers as the values. For instance, if our system had five users, and user 2 and 5 transacted on 2nd January, our key for 2nd January will look like '01001'. This also helps us to determine unique users over a given period and new users using simple bit operations. However, with growing number of users, we are running out of memory to store all these keys.
Is there any alternative database that we can use to store the data in a similar manner? If not, how should we store the data to get similar performance?
Redis' nemory usage can be affected by many parameters so I would also try looking in INFO ALL for starters.
With every user represented by a bit, 400K daily visitors should take at least 50KB per value, but due to sparsity in the bitmap index that could be much larger. I'd also suspect that since newer users are more active, the majority of your bitmaps' "active" flags are towards its end, causing it to reach close to its maximal size (i.e. total number of users). So the question you should be trying to answer is how to store these 400K visits efficiently w/o sacrificing the functionality you're using. That actually depends what you're doing with the recorded visits.
For example, if you're only interested in total counts, you could consider using the HyperLogLog data structure to count your transacting users with a low error rate and small memory/resources footprint. On the other hand, if you're trying to track individual users, perhaps keep a per user bitmap mapped to the days since signing up with your site.
Furthermore, there are bitmap compression techniques that you could consider implementing in your application code/Lua scripting/hacking Redis. The best answer would depend on what you're trying to do of course.

What is a more efficient way to keep a daily ranking log for each user with MySQL?

I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???