Considerations for very large SQL tables?

Considerations for very large SQL tables? - mysql

I'm building, basically, an ad server. This is a personal project that I'm trying to impress my boss with, and I'd love any form of feedback about my design. I've already implemented most of what I describe below, but it's never too late to refactor :)
This is a service that delivers banner ads (http://myserver.com/banner.jpg links to http://myserver.com/clicked) and provides reporting on subsets of the data.
For every ad impression served and every click, I need to record a row that has (id, value) [where value is the cash value of this transaction; e.g. -$.001 per served banner ad at $1 CPM, or +$.25 for a click); my output is all based on earnings per impression [abbreviated EPC]: (SUM(value)/COUNT(impressions)), but on subsets of the data, like "Earnings per impression where browser == 'Firefox'". The goal is to output something like "Your overall EPC is $.50, but where browser == 'Firefox', your EPC is $1.00", so that the end user can quickly see significant factors in their data.
Because there's a very large number of these subsets (tens of thousands), and reporting output only needs to include the summary data, I'm precomputing the EPC-per-subset with a background cron task, and storing these summary values in the database. Once in every 2-3 hits, a Hit needs to query the Hits table for other recent Hits by a Visitor (e.g. "find the REFERER of the last Hit"), but usually, each Hit only performs an INSERT, so to keep response times down, I've split the app across 3 servers [bgprocess, mysql, hitserver].
Right now, I've structured all of this as 3 normalized tables: Hits, Events and Visitors. Visitors are unique per visitor session, a Hit is recorded every time a Visitor loads a banner or makes a click, and Events map the distinct many-to-many relationship from Visitors to Hits (e.g. an example Event is "Visitor X at Banner Y", which is unique, but may have multiple Hits). The reason I'm keeping all the hit data in the same table is because, while my above example only describes "Banner impressions -> clickthroughs", we're also tracking "clickthroughs -> pixel fires", "pixel fires -> second clickthrough" and "second clickthrough -> sale page pixel".
My problem is that the Hits table is filling up quickly, and slowing down ~linearly with size. My test data only has a few thousand clicks, but already my background processing is slowing down. I can throw more servers at it, but before launching the alpha of this, I want to make sure my logic is sound.
So I'm asking you SO-gurus, how would you structure this data? Am I crazy to try to precompute all these tables? Since we rarely need to access Hit records older than one hour, would I benefit to split the Hits table into ProcessedHits (with all historical data) and UnprocessedHits (with ~last hour's data), or does having the Hit.at Date column indexed make this superfluous?
This probably needs some elaboration, sorry if I'm not clear, I've been working for past ~3 weeks straight on it so far :) TIA for all input!

You should be able to build an app like this in a way that it won't slow down linearly with the number of hits.
From what you said, it sounds like you have two main potential performance bottlenecks. The first is inserts. If you can have your inserts happen at the end of the table, that will minimize fragmentation and maximize throughput. If they're in the middle of the table, performance will suffer as fragmentation increases.
The second area is the aggregations. Whenever you do a significant aggregation, be careful that you don't cause all in-memory buffers to get purged to make room for the incoming data. Try to minimize how often the aggregations have to be done, and be smart about how you group and count things, to minimize disk head movement (or maybe consider using SSDs).
You might also be able to do some of the accumulations at the web tier based entirely on the incoming data rather than on new queries, perhaps with a fallback of some kind if the server goes down before the collected data is written to the DB.
Are you using INNODB or MyISAM?
Here are a few performance principles:
Minimize round-trips from the web tier to the DB
Minimize aggregation queries
Minimize on-disk fragmentation and maximize write speeds by inserting at the end of the table when possible
Optimize hardware configuration

Generally you have detailed "accumulator" tables where records are written in realtime. As you've discovered, they get large quickly. Your backend usually summarizes these raw records into cubes or other "buckets" from which you then write reports. Your cubes will probably define themselves once you map out what you're trying to report and/or bill for.
Don't forget fraud detection if this is a real project.

Related

How should something like SO's vote count be stored in a database?

I'm assuming votes on StackOverflow are relations between between users and posts. It would be expensive to count the votes for each page load, so I'm assuming it's cached somewhere. Is there a best practice for storing values that can be computed from other DB data?
I could store it in something like Redis, but then it'll be expensive to sort questions by votes.
I could store it as a new column in the posts table, but it'll be confusing to other engineers because derived values aren't typically stored with actual data.
I could create an entity-attribute-value table just for derived data, so I could join it with the posts table. There's a slight performance hit for the join and I don't like the idea of a table filled with unstructured data, since it would easily end up being filled with unused data.
I'm using MySQL 8, are there other options?
One more consideration is that this data doesn't need to be consistent, it's ok if the vote total is off slightly. So when a vote is created, the vote total doesn't need to be updated immediately, a job can run periodically to update the vote.

"Best practice" is very much situational, and often based on opinion. Here's how I look at it.
Your question seems to be about how to make a database-driven application perform at scale, and what trade-offs are acceptable.
I'd start by sticking to the relational, normalized data model for as long as you can. You say "It would be expensive to count the votes for each page load" - probably not that expensive, because you'll be joining on foreign keys, and unless you're talking about very large numbers of records and/or requests, that should scale pretty well.
If scalability and performance are challenges, I'd build a test rig, and optimize those queries, subject them to load and performance testing and add hardware capacity before doing anything else.
This is because normalized databases and applications without duplication/caching are easier to maintain, less likely to develop weird bugs, and easier to extend in future.
If you reach the point where that doesn't work anymore, I'd look at caching. There are a range of options here - you mention 3. The challenge is that once you reach the point where the normalized database because a performance bottleneck, there are usually lots of potential queries which become the bottleneck - if you optimize the "how many votes does a post get?" query, you move the problem to the "how many people have viewed this post?" query.
So, at this point I typically try to limit the requests to the database by caching in the application layer. This can take the form of a Redis cache. In descending order of effectiveness, you can:
Cache entire pages. This reduces the number of database hits dramatically, but is hard to do with a personalized site like SO.
Cache page fragments, e.g. the SO homepage has a few dozen questions; you could cache each question as a snippet of HTML, and assemble those snippets to render the page. This allows you to create a personalized page, by assembling different fragments for different users.
Cache query results. This means the application server would need to interpret the query results and convert to HTML; you would do this for caching data you'd use to assemble the page. For SO, for instance, you might cache "Leo Jiang's avatar path is x, and they are following tags {a, b, c}".
The problem with caching, of course, is invalidation and the trade-off between performance and up-to-date information. You can also get lots of weird bugs with caches being out of sync across load balancers.

Best and most efficient way for ELO-score calculation for users in database

I'm having a hard time wrapping my head around the issue of an ELO-score-like calculation for a large amount of users on our platform.
For example. For every user in a large set of users, a complex formule, based on variable amounts of "things done", will result in a score for each user for a match-making-like principle.
For our situation, it's based on the amount of posts posted, connections accepted, messages sent, amount of sessions in a time period of one month, .. other things done etc.
I had two ideas to go about doing this:
Real-time: On every post, message, .. run the formula for that user
Once a week: Run the script to calculate everything for all users.
The concerns about these two I have:
Real-time: Would be an overkill of queries and calculations for each action a user performs. If let's say, 500 users are active, all of them are performing actions, the database would be having a hard time I think. There would them also run a script to re-calculate the score for inactive users (to lower their score)
Once a week: If we have for example 5.000 users (for our first phase), than that would result into running the calculation formula 5.000 times and could take a long time and will increase in time when more users join.
The calculation-queries for a single variable in a the entire formula of about 12 variables are mostly a simple 'COUNT FROM table', but a few are like counting "all connections of my connections" which takes a few joins.
I started with "logging" every action into a table for this purpose, just the counter values and increase/decrease them with every action and running the formula with these values (a record per week). This works but can't be applied for every variable (like the connections of connections).
Note: Our server-side is based on PHP with MySQL.
We're also running Redis, but I'm not sure if this could improve those bits and pieces.
We have the option to export/push data to other servers/databases if needed.
My main example is the app 'Tinder' which uses a sort-like algorithm for match making (maybe with less complex data variables because they're not using groups and communities that you can join)
I'm wondering if they run that real-time on every swipe, every setting change, .. or if they have like a script that runs continiously for a small batch of users each time.
Where it all comes down to. What would be the most efficient/non-database-table-locking way to do this, with keeping the idea in mind that there will be a moment that we're having 50.000 users for example?

The way I would handle this:
Implement the realtime algorithm.
Measure. Is it actually slow? Try optimizing
Still slow? Move the algorithm to a separate asynchronous process. Have the process run whenever there's an update. Really this is the same thing as 1, but it doesn't slow down PHP requests and if it gets busy, it can take more time to catch up.
Still slow? Now you might be able to optimize by batching several changes.
If you have 5000 users right now, make sure it runs well with 5000 users. You're not going to grow to 50.000 overnight, so adjust and invest in this as your problem changes. You might be surprised where your performance problems are.
Measuring is key though. If you really want to support 50K users right now, simulate and measure.

I suspect you should use the database as the "source of truth" aka "persistent storage".
Then fetch whatever is needed from the dataset when you update the ratings. Even lots of games by 5000 players should not take more than a few seconds to fetch and compute on.
Bottom line: Implement "realtime"; come back with table schema and SELECTs if you find that the table fetching is a significant fraction of the total time. Do the "math" in a programming language, not SQL.

Approach to update report (or summary) table?

I have a log table that contains a large number of user transactions (logs). I am trying to create a webpage that displays statistics (count, average, and some complex calculations...) of the user transactions, but want to fetch from a Statistics table instead of querying the original transaction table because of the performance concern. One possible way might be updating the Statistics table whenever a row is inserted. And, another way can be updating the Statistics table periodically.
Both options sound inefficient, so I am wondering there is any particular method to achieve it in common database systems?

If you don't need statistics in real time (if near real time is ok for you, usually it is for most people), one thing that reports that need some complex calculations usually do is to geneate these reports in a periodic manner (let's say every X minutes, depends on how big is your data of course).
This way your users can access static data, which is pretty much easy to serve, and you won't push too much load into your analytics server.

Getting top line metrics FAST from a large MySQL DB?

I'm painfully aware there probably isn't a magic bullet to this, but it's becoming a problem. Each user has hundreds of thousands of rows of metrics data across 3 tables, this is updated on a second by second basis.
When a user logs in, I want to quickly deliver them top line stats for a number of their assets (i.e. alongside each asset in navi they have top level stats).
I've tried a number of ideas; but please - if someone has some advice or experience in this area it'd be great. Stuff tried or looked into so far:-
Produce static versions of top line stats every hour or so - This is intensive across all users and all assets. So how this can be done regularly, I'm not sure.
Call stats via AJAX, so they can be processed and fill in (getting top level stats right now can take up to 10 seconds for a larger user) once page has loaded. This could also cache stats in session to save redoing queries each page load.
Query run at 30 min intervals, i.e. you log on, it'll query and then it'll hopefully use query cache every time it's loaded (only 1/2 seconds) until the next 30min interval.
The first one seems to have most legs, but I'm not sure how to do this, given only a small number of users will be needing those stats - it seems awfully expensive to do it for everyone all the time.

Produce static versions of top line stats every hour or so - This is
intensive across all users and all assets. So how this can be done
regularly, I'm not sure.
Call stats via AJAX, so they can be processed and fill in (getting
top level stats right now can take up to 10 seconds for a larger
user) once page has loaded. This could also cache stats in session to
save redoing queries each page load.
Query run at 30 min intervals, i.e. you log on, it'll query and then
it'll hopefully use query cache every time it's loaded (only 1/2
seconds) until the next 30min interval.
Your option 1 and 3 in mySQL is known as a materialized view MySQL doesn't currently support them but the concept can be completed link provides examples
hundreds of thousands of records isn't that much. good indexes and the use of analytic queries will get you quite far. Sadly this concept isn't implemented in full but there are workarounds as well as indicated in the link provided.
It really depends on top line stats. are you wanting real time data down to the second or are 10-20 or even 30 minute intervals acceptable? Using event scheduler one can schedule the creation/update of reporting table(s) which contain summarized data faster to query. This data then is available at fractions of seconds delivery time as all the heavy lifting has already been completed. Your focus can then be on indexing these tables to improve performance without worrying about impacts to production tables.

You are in the datawarehousing domain with your setup. This means, that not all the NF1 rules apply. So my approach would be to use triggers to fill a seperate stats table.

Event feed implementation - will it scale?

Situation:
I am currently designing a feed system for a social website whereby each user has a feed of their friends' activities. I have two possible methods how to generate the feeds and I would like to ask which is best in terms of ability to scale.
Events from all users are collected in one central database table, event_log. Users are paired as friends in the table friends. The RDBMS we are using is MySQL.
Standard method:
When a user requests their feed page, the system generates the feed by inner joining event_log with friends. The result is then cached and set to timeout after 5 minutes. Scaling is achieved by varying this timeout.
Hypothesised method:
A task runs in the background and for each new, unprocessed item in event_log, it creates entries in the database table user_feed pairing that event with all of the users who are friends with the user who initiated the event. One table row pairs one event with one user.
The problems with the standard method are well known – what if a lot of people's caches expire at the same time? The solution also does not scale well – the brief is for feeds to update as close to real-time as possible
The hypothesised solution in my eyes seems much better; all processing is done offline so no user waits for a page to generate and there are no joins so database tables can be sharded across physical machines. However, if a user has 100,000 friends and creates 20 events in one session, then that results in inserting 2,000,000 rows into the database.
Question:
The question boils down to two points:
Is this worst-case scenario mentioned above problematic, i.e. does table size have an impact on MySQL performance and are there any issues with this mass inserting of data for each event?
Is there anything else I have missed?

I think your hypothesised system generates too much data; firstly on the global scale the storage and indexing requirements on user_feed seems to escalate exponentially as your user-base becomes larger and more interconnected (both presumably desirable for a social network); secondly consider if in the course of a minute 1000 users each entered a new message and each had 100 friends - then your background thread has 100 000 inserts to do and might quickly fall behind.
I wonder if a compromise might be made between your two proposed solutions where a background thread updates a table last_user_feed_update which contains a single row for each user and a timestamp for the last time that users feed was changed.
Then although the full join and query would be required to refresh the feed, a quick query to the last_user_feed table will tell if a refresh is required or not. This seems to mitigate the biggest problems with your standard method as well as avoid the storage size difficulties but that background thread still has a lot of work to do.

The Hypothesized method works better when you limit the maximum number of friends.. a lot of sites set a safe upper boundary, including Facebook iirc. It limits 'hiccups' from when your 100K friends user generates activity.
Another problem with the hypothesized model is that some of the friends you are essentially pre-generating cache for may sign up and hardly ever log in. This is a pretty common situation for free sites, and you may want to limit the burden that these inactive users will cost you.
I've thought about this problem many times - it's not a problem MySQL is going to be good at solving. I've thought of ways I could use memcached and each user pushes what their latest few status items are to "their key" (and in a feed reading activity you fetch and aggregate all your friend's keys)... but I haven't tested this. I'm not sure of all the pros/cons yet.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008