Event feed implementation - will it scale? - mysql

Situation:
I am currently designing a feed system for a social website whereby each user has a feed of their friends' activities. I have two possible methods how to generate the feeds and I would like to ask which is best in terms of ability to scale.
Events from all users are collected in one central database table, event_log. Users are paired as friends in the table friends. The RDBMS we are using is MySQL.
Standard method:
When a user requests their feed page, the system generates the feed by inner joining event_log with friends. The result is then cached and set to timeout after 5 minutes. Scaling is achieved by varying this timeout.
Hypothesised method:
A task runs in the background and for each new, unprocessed item in event_log, it creates entries in the database table user_feed pairing that event with all of the users who are friends with the user who initiated the event. One table row pairs one event with one user.
The problems with the standard method are well known – what if a lot of people's caches expire at the same time? The solution also does not scale well – the brief is for feeds to update as close to real-time as possible
The hypothesised solution in my eyes seems much better; all processing is done offline so no user waits for a page to generate and there are no joins so database tables can be sharded across physical machines. However, if a user has 100,000 friends and creates 20 events in one session, then that results in inserting 2,000,000 rows into the database.
Question:
The question boils down to two points:
Is this worst-case scenario mentioned above problematic, i.e. does table size have an impact on MySQL performance and are there any issues with this mass inserting of data for each event?
Is there anything else I have missed?

I think your hypothesised system generates too much data; firstly on the global scale the storage and indexing requirements on user_feed seems to escalate exponentially as your user-base becomes larger and more interconnected (both presumably desirable for a social network); secondly consider if in the course of a minute 1000 users each entered a new message and each had 100 friends - then your background thread has 100 000 inserts to do and might quickly fall behind.
I wonder if a compromise might be made between your two proposed solutions where a background thread updates a table last_user_feed_update which contains a single row for each user and a timestamp for the last time that users feed was changed.
Then although the full join and query would be required to refresh the feed, a quick query to the last_user_feed table will tell if a refresh is required or not. This seems to mitigate the biggest problems with your standard method as well as avoid the storage size difficulties but that background thread still has a lot of work to do.

The Hypothesized method works better when you limit the maximum number of friends.. a lot of sites set a safe upper boundary, including Facebook iirc. It limits 'hiccups' from when your 100K friends user generates activity.
Another problem with the hypothesized model is that some of the friends you are essentially pre-generating cache for may sign up and hardly ever log in. This is a pretty common situation for free sites, and you may want to limit the burden that these inactive users will cost you.
I've thought about this problem many times - it's not a problem MySQL is going to be good at solving. I've thought of ways I could use memcached and each user pushes what their latest few status items are to "their key" (and in a feed reading activity you fetch and aggregate all your friend's keys)... but I haven't tested this. I'm not sure of all the pros/cons yet.

Related

How should something like SO's vote count be stored in a database?

I'm assuming votes on StackOverflow are relations between between users and posts. It would be expensive to count the votes for each page load, so I'm assuming it's cached somewhere. Is there a best practice for storing values that can be computed from other DB data?
I could store it in something like Redis, but then it'll be expensive to sort questions by votes.
I could store it as a new column in the posts table, but it'll be confusing to other engineers because derived values aren't typically stored with actual data.
I could create an entity-attribute-value table just for derived data, so I could join it with the posts table. There's a slight performance hit for the join and I don't like the idea of a table filled with unstructured data, since it would easily end up being filled with unused data.
I'm using MySQL 8, are there other options?
One more consideration is that this data doesn't need to be consistent, it's ok if the vote total is off slightly. So when a vote is created, the vote total doesn't need to be updated immediately, a job can run periodically to update the vote.
"Best practice" is very much situational, and often based on opinion. Here's how I look at it.
Your question seems to be about how to make a database-driven application perform at scale, and what trade-offs are acceptable.
I'd start by sticking to the relational, normalized data model for as long as you can. You say "It would be expensive to count the votes for each page load" - probably not that expensive, because you'll be joining on foreign keys, and unless you're talking about very large numbers of records and/or requests, that should scale pretty well.
If scalability and performance are challenges, I'd build a test rig, and optimize those queries, subject them to load and performance testing and add hardware capacity before doing anything else.
This is because normalized databases and applications without duplication/caching are easier to maintain, less likely to develop weird bugs, and easier to extend in future.
If you reach the point where that doesn't work anymore, I'd look at caching. There are a range of options here - you mention 3. The challenge is that once you reach the point where the normalized database because a performance bottleneck, there are usually lots of potential queries which become the bottleneck - if you optimize the "how many votes does a post get?" query, you move the problem to the "how many people have viewed this post?" query.
So, at this point I typically try to limit the requests to the database by caching in the application layer. This can take the form of a Redis cache. In descending order of effectiveness, you can:
Cache entire pages. This reduces the number of database hits dramatically, but is hard to do with a personalized site like SO.
Cache page fragments, e.g. the SO homepage has a few dozen questions; you could cache each question as a snippet of HTML, and assemble those snippets to render the page. This allows you to create a personalized page, by assembling different fragments for different users.
Cache query results. This means the application server would need to interpret the query results and convert to HTML; you would do this for caching data you'd use to assemble the page. For SO, for instance, you might cache "Leo Jiang's avatar path is x, and they are following tags {a, b, c}".
The problem with caching, of course, is invalidation and the trade-off between performance and up-to-date information. You can also get lots of weird bugs with caches being out of sync across load balancers.

Managing set of users

We have a website with many users. To manage users who transacted on a given day, we use Redis and stored a list of binary numbers as the values. For instance, if our system had five users, and user 2 and 5 transacted on 2nd January, our key for 2nd January will look like '01001'. This also helps us to determine unique users over a given period and new users using simple bit operations. However, with growing number of users, we are running out of memory to store all these keys.
Is there any alternative database that we can use to store the data in a similar manner? If not, how should we store the data to get similar performance?
Redis' nemory usage can be affected by many parameters so I would also try looking in INFO ALL for starters.
With every user represented by a bit, 400K daily visitors should take at least 50KB per value, but due to sparsity in the bitmap index that could be much larger. I'd also suspect that since newer users are more active, the majority of your bitmaps' "active" flags are towards its end, causing it to reach close to its maximal size (i.e. total number of users). So the question you should be trying to answer is how to store these 400K visits efficiently w/o sacrificing the functionality you're using. That actually depends what you're doing with the recorded visits.
For example, if you're only interested in total counts, you could consider using the HyperLogLog data structure to count your transacting users with a low error rate and small memory/resources footprint. On the other hand, if you're trying to track individual users, perhaps keep a per user bitmap mapped to the days since signing up with your site.
Furthermore, there are bitmap compression techniques that you could consider implementing in your application code/Lua scripting/hacking Redis. The best answer would depend on what you're trying to do of course.

Using Plone 4 and pas.plugins.sqlalchemy with many users

I've been using pas.plugins.sqlalchemy to provide an RDBMS backend for authentication and memberdata storage, using MySQL. Authentication works perfectly and member data is correctly stored and retrived on the RDBMS. The current users are over 20.000
However, user enumeration takes ages. I have checked the "Many users" in the Plone Control Panel / Users and Groups section but even a simple user search takes a near infinite amount of time. By debugging the plugin.py script I noticed that enumerateUsers() is called as many times as the number of users stored; therefore, an enormous amount of CPU time is needed to complete a simple search request, as the query is matched against each username, one user at a time, one query at a time.
Am I missing something here? Isn't pas.plugins.sqlalchemy useful especially when you have a very large number users? Currently, I have the sql plugin as top priority in my *acl_users/plugins/User Enumeration* setup. Should I change this?
I've pretty much inherited maintenance pas.plugins.sqlalchemy - but I haven't personally used it for more than a handful of users, yet. If you file a bug at https://github.com/auspex/pas.plugins.sqlalchemy/issues, I'll see what I can do.
I don't think it can make much difference what order the enumeration occurs - it still has to enumerate all the users in the SQL db. So it either does them before the ones found in the ZODB, or after. It sounds as if the problem begins with Zope - calling enumerateUsers() once per user seems excessive - but even so, it shouldn't be necessary to make a request to the relational db per enumeration.

Considerations for very large SQL tables?

I'm building, basically, an ad server. This is a personal project that I'm trying to impress my boss with, and I'd love any form of feedback about my design. I've already implemented most of what I describe below, but it's never too late to refactor :)
This is a service that delivers banner ads (http://myserver.com/banner.jpg links to http://myserver.com/clicked) and provides reporting on subsets of the data.
For every ad impression served and every click, I need to record a row that has (id, value) [where value is the cash value of this transaction; e.g. -$.001 per served banner ad at $1 CPM, or +$.25 for a click); my output is all based on earnings per impression [abbreviated EPC]: (SUM(value)/COUNT(impressions)), but on subsets of the data, like "Earnings per impression where browser == 'Firefox'". The goal is to output something like "Your overall EPC is $.50, but where browser == 'Firefox', your EPC is $1.00", so that the end user can quickly see significant factors in their data.
Because there's a very large number of these subsets (tens of thousands), and reporting output only needs to include the summary data, I'm precomputing the EPC-per-subset with a background cron task, and storing these summary values in the database. Once in every 2-3 hits, a Hit needs to query the Hits table for other recent Hits by a Visitor (e.g. "find the REFERER of the last Hit"), but usually, each Hit only performs an INSERT, so to keep response times down, I've split the app across 3 servers [bgprocess, mysql, hitserver].
Right now, I've structured all of this as 3 normalized tables: Hits, Events and Visitors. Visitors are unique per visitor session, a Hit is recorded every time a Visitor loads a banner or makes a click, and Events map the distinct many-to-many relationship from Visitors to Hits (e.g. an example Event is "Visitor X at Banner Y", which is unique, but may have multiple Hits). The reason I'm keeping all the hit data in the same table is because, while my above example only describes "Banner impressions -> clickthroughs", we're also tracking "clickthroughs -> pixel fires", "pixel fires -> second clickthrough" and "second clickthrough -> sale page pixel".
My problem is that the Hits table is filling up quickly, and slowing down ~linearly with size. My test data only has a few thousand clicks, but already my background processing is slowing down. I can throw more servers at it, but before launching the alpha of this, I want to make sure my logic is sound.
So I'm asking you SO-gurus, how would you structure this data? Am I crazy to try to precompute all these tables? Since we rarely need to access Hit records older than one hour, would I benefit to split the Hits table into ProcessedHits (with all historical data) and UnprocessedHits (with ~last hour's data), or does having the Hit.at Date column indexed make this superfluous?
This probably needs some elaboration, sorry if I'm not clear, I've been working for past ~3 weeks straight on it so far :) TIA for all input!
You should be able to build an app like this in a way that it won't slow down linearly with the number of hits.
From what you said, it sounds like you have two main potential performance bottlenecks. The first is inserts. If you can have your inserts happen at the end of the table, that will minimize fragmentation and maximize throughput. If they're in the middle of the table, performance will suffer as fragmentation increases.
The second area is the aggregations. Whenever you do a significant aggregation, be careful that you don't cause all in-memory buffers to get purged to make room for the incoming data. Try to minimize how often the aggregations have to be done, and be smart about how you group and count things, to minimize disk head movement (or maybe consider using SSDs).
You might also be able to do some of the accumulations at the web tier based entirely on the incoming data rather than on new queries, perhaps with a fallback of some kind if the server goes down before the collected data is written to the DB.
Are you using INNODB or MyISAM?
Here are a few performance principles:
Minimize round-trips from the web tier to the DB
Minimize aggregation queries
Minimize on-disk fragmentation and maximize write speeds by inserting at the end of the table when possible
Optimize hardware configuration
Generally you have detailed "accumulator" tables where records are written in realtime. As you've discovered, they get large quickly. Your backend usually summarizes these raw records into cubes or other "buckets" from which you then write reports. Your cubes will probably define themselves once you map out what you're trying to report and/or bill for.
Don't forget fraud detection if this is a real project.

Forum Schema: should the "Topics" table countain topic_starter_Id? Or is it redundant information?

I'm creating a forum app in php and have a question regarding database design:
I can get all the posts for a specific topic.All the posts have an auto_increment identity column as well as a timestamp.
Assuming I want to know who the topic starter was, which is the best solution?
Get all the posts for the topic and order by timestamp. But what happens if someone immediately replies to the topic. Then I have the first two posts with the same timestamp(unlikely but possible). I can't know who the first one was. This is also normalized but becomes expensive after the table grows.
Get all the posts for the topic and order by post_id. This is an auto_increment column. Can I be guaranteed that the database will use an index id by insertion order? Will a post inserted later always have a higher id than previous rows? What if I delete a post? Would my database reuse the post_id later? This is mysql I'm using.
The easiest way off course is to simply add a field to the Topics table with the topic_starter_id and be done with it. But it is not normalized. I believe this is also the most efficient method after topic and post tables grow to millions of rows.
What is your opinion?
Zed's comment is pretty much spot on.
You generally want to achieve normalization, but denormalization can save potentially expensive queries.
In my experience writing forum software (five years commercially, five years as a hobby), this particular case calls for denormalization to save the single query. It's perfectly sane and acceptable to store both the first user's display name and id, as well as the last user's display name and id, just so long as the code that adds posts to topics always updates the record. You want one and only one code path here.
I must somewhat disagree with Charles on the fact that the only way to save on performance is to de-normalize to avoid an extra query.
To be more specific, there's an optimization that would work without denormalization (and attendant headaches of data maintenance/integrity), but ONLY if the user base is sufficiently small (let's say <1000 users, for the sake of argument - depends on your scale. Our apps use this approach with 10k+ mappings).
Namely, you have your application layer (code running on web server), retrieve the list of users into a proper cache (e.g. having data expiration facilities). Then, when you need to print first/last user's name, look it up in a cache on server side.
This avoids an extra query for every page view (as you need to only retrieve the full user list ONCE per N page views, when cache expires or when user data is updated which should cause cache expiration).
It adds a wee bit of CPU time and memory usage on web server, but in Yet Another Holy War (e.g. spend more resources on DB side or app server side) I'm firmly on the "don't waste DB resources" camp, seeing how scaling up DB is vastly harder than scaling up a web or app server.
And yes, if that (or other equally tricky) optimization is not feasible, I agree with Charles and Zed that you have a trade-off between normalization (less headaches related to data integrity) and performance gain (one less table to join in some queries). Since I'm an agnostic in that particular Holy War, I just go with what gives better marginal benefits (e.g. how much performance loss vs. how much cost/risk from de-normalization)