Getting top line metrics FAST from a large MySQL DB? - mysql

I'm painfully aware there probably isn't a magic bullet to this, but it's becoming a problem. Each user has hundreds of thousands of rows of metrics data across 3 tables, this is updated on a second by second basis.
When a user logs in, I want to quickly deliver them top line stats for a number of their assets (i.e. alongside each asset in navi they have top level stats).
I've tried a number of ideas; but please - if someone has some advice or experience in this area it'd be great. Stuff tried or looked into so far:-
Produce static versions of top line stats every hour or so - This is intensive across all users and all assets. So how this can be done regularly, I'm not sure.
Call stats via AJAX, so they can be processed and fill in (getting top level stats right now can take up to 10 seconds for a larger user) once page has loaded. This could also cache stats in session to save redoing queries each page load.
Query run at 30 min intervals, i.e. you log on, it'll query and then it'll hopefully use query cache every time it's loaded (only 1/2 seconds) until the next 30min interval.
The first one seems to have most legs, but I'm not sure how to do this, given only a small number of users will be needing those stats - it seems awfully expensive to do it for everyone all the time.

Produce static versions of top line stats every hour or so - This is
intensive across all users and all assets. So how this can be done
regularly, I'm not sure.
Call stats via AJAX, so they can be processed and fill in (getting
top level stats right now can take up to 10 seconds for a larger
user) once page has loaded. This could also cache stats in session to
save redoing queries each page load.
Query run at 30 min intervals, i.e. you log on, it'll query and then
it'll hopefully use query cache every time it's loaded (only 1/2
seconds) until the next 30min interval.
Your option 1 and 3 in mySQL is known as a materialized view MySQL doesn't currently support them but the concept can be completed link provides examples
hundreds of thousands of records isn't that much. good indexes and the use of analytic queries will get you quite far. Sadly this concept isn't implemented in full but there are workarounds as well as indicated in the link provided.
It really depends on top line stats. are you wanting real time data down to the second or are 10-20 or even 30 minute intervals acceptable? Using event scheduler one can schedule the creation/update of reporting table(s) which contain summarized data faster to query. This data then is available at fractions of seconds delivery time as all the heavy lifting has already been completed. Your focus can then be on indexing these tables to improve performance without worrying about impacts to production tables.

You are in the datawarehousing domain with your setup. This means, that not all the NF1 rules apply. So my approach would be to use triggers to fill a seperate stats table.

Related

Running a cron to update 1 million records in every hour fails

We have an E-commerce system with more than 1 million users with a total or 4 to 5 million records in order table. We use codeigniter framework as back end and Mysql as database.
Due to this excessive number of users and purchases, we use cron jobs to update the order details and referral bonus points in every hour to make the things work.
Now we have a situation that these data updates exceeds one hour and the next batch of updates reach before finishing the previous one, there by leading into a deadlock and failure of the system.
I'd like to know about the different possible architectural and database scaling options and suggestions to get rid of this situation. We are using only the monolithic architecture to run this application.
Don't use cron. Have a single process that starts over when it finishes. If one pass lasts more than an hour, the next one will start late. (Checking PROCESSLIST is clumsy and error-prone. OTOH, this continually-running approach needs a "keep-alive" cronjob.)
Don't UPDATE millions of rows. Instead, find a way to put the desired info in a separate table that the user joins to. Presumably, that extra table would only 1 row (if everyone is controlled by the same game) or a small number of rows (if there are only a small number of patterns to handle).
Do have the slowlog turned on, with a small value for long_query_time (possibly "1.0", maybe lower). Use pt-query-digest to summarize it to find the "worst" queries. Then we can help you make them take less time, thereby helping to calm your busy system and improve the 'user experience'.
Do use batched INSERT. (A one INSERT with 100 rows runs about 10 times as fast as 100 single-row INSERTs.) Batching UPDATEs is tricky, but can be done with IODKU.
Do use batches of 100-1000 rows. (This is somewhat optimal considering the various things that can happen.)
Do use transactions judiciously. Do check for errors (including deadlocks) at every step.
Do tell us what you are doing in the hourly update. We might be able to provide more targeted advice than that 15-year-old book.
Do realize that you have scaled beyond the capabilities of the typical 3rd-party package. That is, you will have to learn the details of SQL.
I have some ideas here for you - mixed up with some questions.
Assuming you are limited in what you can do (i.e. you can't re-architect you way out of this) and that the database can't be tuned further:
Make the list of records to be processed as small as possible
i.e. Does the job have to run over all records? These 4-5 million records - are they all active orders, or that's how many you have in total for all time? Obviously just process the bare minimum.
Split and parallel process
You mentioned "batches" but never explained what that meant - can you elaborate?
Can you get multiple instances of the cron job to run at once, each covering a different segment of the records?
Multi-Record Operations
The easy (lazy) way to program updates is to do it in a loop that iterates through each record and processes it individually, but relational databases can do updates over multiple records at once. I'm pretty sure there's a proper term for that but I can't recall it. Are you processing each row individually or doing multi-record updates?
How does the cron job query the database? Have you hand-crafted the most efficient queries possible, or are you using some ORM / framework to do stuff for you?

Best and most efficient way for ELO-score calculation for users in database

I'm having a hard time wrapping my head around the issue of an ELO-score-like calculation for a large amount of users on our platform.
For example. For every user in a large set of users, a complex formule, based on variable amounts of "things done", will result in a score for each user for a match-making-like principle.
For our situation, it's based on the amount of posts posted, connections accepted, messages sent, amount of sessions in a time period of one month, .. other things done etc.
I had two ideas to go about doing this:
Real-time: On every post, message, .. run the formula for that user
Once a week: Run the script to calculate everything for all users.
The concerns about these two I have:
Real-time: Would be an overkill of queries and calculations for each action a user performs. If let's say, 500 users are active, all of them are performing actions, the database would be having a hard time I think. There would them also run a script to re-calculate the score for inactive users (to lower their score)
Once a week: If we have for example 5.000 users (for our first phase), than that would result into running the calculation formula 5.000 times and could take a long time and will increase in time when more users join.
The calculation-queries for a single variable in a the entire formula of about 12 variables are mostly a simple 'COUNT FROM table', but a few are like counting "all connections of my connections" which takes a few joins.
I started with "logging" every action into a table for this purpose, just the counter values and increase/decrease them with every action and running the formula with these values (a record per week). This works but can't be applied for every variable (like the connections of connections).
Note: Our server-side is based on PHP with MySQL.
We're also running Redis, but I'm not sure if this could improve those bits and pieces.
We have the option to export/push data to other servers/databases if needed.
My main example is the app 'Tinder' which uses a sort-like algorithm for match making (maybe with less complex data variables because they're not using groups and communities that you can join)
I'm wondering if they run that real-time on every swipe, every setting change, .. or if they have like a script that runs continiously for a small batch of users each time.
Where it all comes down to. What would be the most efficient/non-database-table-locking way to do this, with keeping the idea in mind that there will be a moment that we're having 50.000 users for example?
The way I would handle this:
Implement the realtime algorithm.
Measure. Is it actually slow? Try optimizing
Still slow? Move the algorithm to a separate asynchronous process. Have the process run whenever there's an update. Really this is the same thing as 1, but it doesn't slow down PHP requests and if it gets busy, it can take more time to catch up.
Still slow? Now you might be able to optimize by batching several changes.
If you have 5000 users right now, make sure it runs well with 5000 users. You're not going to grow to 50.000 overnight, so adjust and invest in this as your problem changes. You might be surprised where your performance problems are.
Measuring is key though. If you really want to support 50K users right now, simulate and measure.
I suspect you should use the database as the "source of truth" aka "persistent storage".
Then fetch whatever is needed from the dataset when you update the ratings. Even lots of games by 5000 players should not take more than a few seconds to fetch and compute on.
Bottom line: Implement "realtime"; come back with table schema and SELECTs if you find that the table fetching is a significant fraction of the total time. Do the "math" in a programming language, not SQL.

How does pagination of results in databases work?

This is a general question that applies to MySQL, Oracle DB or whatever else might be out there.
I know for MySQL there is LIMIT offset,size; and for Oracle there is 'ROW_NUMBER' or something like that.
But when such 'paginated' queries are called back to back, does the database engine actually do the entire 'select' all over again and then retrieve a different subset of results each time? Or does it do the overall fetching of results only once, keeps the results in memory or something, and then serves subsets of results from it for subsequent queries based on offset and size?
If it does the full fetch every time, then it seems quite inefficient.
If it does full fetch only once, it must be 'storing' the query somewhere somehow, so that the next time that query comes in, it knows that it has already fetched all the data and just needs to extract next page from it.
In that case, how will the database engine handle multiple threads? Two threads executing the same query?
I am very confused :(
I desagree with #Bill Karwin. First of all, do not make assumptions in advance whether something will be quick or slow without taking measurements, and complicate the code in advance to download 12 pages at once and cache them because "it seems to me that it will be faster".
YAGNI principle - the programmer should not add functionality until deemed necessary.
Do it in the simplest way (ordinary pagination of one page), measure how it works on production, if it is slow, then try a different method, if the speed is satisfactory, leave it as it is.
From my own practice - an application that retrieves data from a table containing about 80,000 records, the main table is joined with 4-5 additional lookup tables, the whole query is paginated, about 25-30 records per page, about 2500-3000 pages in total. Database is Oracle 12c, there are indexes on a few columns, queries are generated by Hibernate.
Measurements on production system at the server side show that an average time (median - 50% percentile) of retrieving one page is about 300 ms. 95% percentile is less than 800 ms - this means that 95% of requests for retrieving a single page is less that 800ms, when we add a transfer time from the server to the user and a rendering time of about 0.5-1 seconds, the total time is less than 2 seconds. That's enough, users are happy.
And some theory - see this answer to know what is purpose of Pagination pattern
Yes, the query is executed over again when you run it with a different OFFSET.
Yes, this is inefficient. Don't do that if you have a need to paginate through a large result set.
I'd suggest doing the query once, with a large LIMIT — enough for 10 or 12 pages. Then save the result in a cache. When the user wants to advance through several pages, then your application can fetch the 10-12 pages you saved in the cache and display the page the user wants to see. That is usually much faster than running the SQL query for each page.
This works well if, like most users, your user reads only a few pages and then changes their query.
Re your comment:
By cache I mean something like Memcached or Redis. A high-speed, in-memory key/value store.
MySQL views don't store anything, they're more like a macro that runs a predefined query for you.
Oracle supports materialized views, so that might work better, but querying the view would have the overhead of interpreting an SQL query.
A simpler in-memory cache should be much faster.

Considerations for very large SQL tables?

I'm building, basically, an ad server. This is a personal project that I'm trying to impress my boss with, and I'd love any form of feedback about my design. I've already implemented most of what I describe below, but it's never too late to refactor :)
This is a service that delivers banner ads (http://myserver.com/banner.jpg links to http://myserver.com/clicked) and provides reporting on subsets of the data.
For every ad impression served and every click, I need to record a row that has (id, value) [where value is the cash value of this transaction; e.g. -$.001 per served banner ad at $1 CPM, or +$.25 for a click); my output is all based on earnings per impression [abbreviated EPC]: (SUM(value)/COUNT(impressions)), but on subsets of the data, like "Earnings per impression where browser == 'Firefox'". The goal is to output something like "Your overall EPC is $.50, but where browser == 'Firefox', your EPC is $1.00", so that the end user can quickly see significant factors in their data.
Because there's a very large number of these subsets (tens of thousands), and reporting output only needs to include the summary data, I'm precomputing the EPC-per-subset with a background cron task, and storing these summary values in the database. Once in every 2-3 hits, a Hit needs to query the Hits table for other recent Hits by a Visitor (e.g. "find the REFERER of the last Hit"), but usually, each Hit only performs an INSERT, so to keep response times down, I've split the app across 3 servers [bgprocess, mysql, hitserver].
Right now, I've structured all of this as 3 normalized tables: Hits, Events and Visitors. Visitors are unique per visitor session, a Hit is recorded every time a Visitor loads a banner or makes a click, and Events map the distinct many-to-many relationship from Visitors to Hits (e.g. an example Event is "Visitor X at Banner Y", which is unique, but may have multiple Hits). The reason I'm keeping all the hit data in the same table is because, while my above example only describes "Banner impressions -> clickthroughs", we're also tracking "clickthroughs -> pixel fires", "pixel fires -> second clickthrough" and "second clickthrough -> sale page pixel".
My problem is that the Hits table is filling up quickly, and slowing down ~linearly with size. My test data only has a few thousand clicks, but already my background processing is slowing down. I can throw more servers at it, but before launching the alpha of this, I want to make sure my logic is sound.
So I'm asking you SO-gurus, how would you structure this data? Am I crazy to try to precompute all these tables? Since we rarely need to access Hit records older than one hour, would I benefit to split the Hits table into ProcessedHits (with all historical data) and UnprocessedHits (with ~last hour's data), or does having the Hit.at Date column indexed make this superfluous?
This probably needs some elaboration, sorry if I'm not clear, I've been working for past ~3 weeks straight on it so far :) TIA for all input!
You should be able to build an app like this in a way that it won't slow down linearly with the number of hits.
From what you said, it sounds like you have two main potential performance bottlenecks. The first is inserts. If you can have your inserts happen at the end of the table, that will minimize fragmentation and maximize throughput. If they're in the middle of the table, performance will suffer as fragmentation increases.
The second area is the aggregations. Whenever you do a significant aggregation, be careful that you don't cause all in-memory buffers to get purged to make room for the incoming data. Try to minimize how often the aggregations have to be done, and be smart about how you group and count things, to minimize disk head movement (or maybe consider using SSDs).
You might also be able to do some of the accumulations at the web tier based entirely on the incoming data rather than on new queries, perhaps with a fallback of some kind if the server goes down before the collected data is written to the DB.
Are you using INNODB or MyISAM?
Here are a few performance principles:
Minimize round-trips from the web tier to the DB
Minimize aggregation queries
Minimize on-disk fragmentation and maximize write speeds by inserting at the end of the table when possible
Optimize hardware configuration
Generally you have detailed "accumulator" tables where records are written in realtime. As you've discovered, they get large quickly. Your backend usually summarizes these raw records into cubes or other "buckets" from which you then write reports. Your cubes will probably define themselves once you map out what you're trying to report and/or bill for.
Don't forget fraud detection if this is a real project.

Gathering pageviews MySQL layout

Hey, does anyone know the proper way to set up a MySQL database to gather pageviews? I want to gather these pageviews to display in a graph later. I have a couple ways mapped out below.
Option A:
Would it be better to count pageviews each time someone visits a site and create a new row for every pageview with a time stamp. So, 50,000 views = 50,000 rows of data.
Option B:
Count the pageviews per day and have one row that counts the pageviews. every time someone visits the site the count goes up. So, 50,000 views = 1 row of data per day. Every day a new row will be created.
Are any of the options above the correct way of doing what I want? or is there a better more efficient way?
Thanks.
Option C would be to parse access logs from the web server. No extra storage needed, all sorts of extra information is stored, and even requests to images and JavaScript files are stored.
..
However, if you just want to track visits to pages where you run your own code, I'd definitely go for Option A, unless you're expecting extreme amounts of traffic on your site.
That way you can create overviews per hour of the day, and store more information than just the timestamp (like the visited page, the user's browser, etc.). You might not need that now, but later on you might thank yourself for not losing that information.
If at some point the table grows too large, you can always think of ways on how to deal with that.
If you care about how your pageviews vary with time in a day, option A keeps that info (though you might still do some bucketing, say per-hour, to reduce overall data size -- but you might do that "later, off-line" while archiving all details). Option B takes much less space because it throws away a lot of info... which you might or might not care about. If you don't know whether you care, I think that, in doubt, you should keep more data rather than less -- it's reasonably easy to "summarize and archive" overabundant data, but it's NOT at all easy to recover data you've aggregated away;-). So, aggregating is riskier...
If you do decide to keep abundant per-day data, one strategy is to use multiple tables, say one per day; this will make it easiest to work with old data (summarize it, archive it, remove it from the live DB) without slowing down current "logging". So, say, pageviews for May 29 would be in PV20090529 -- a different table than the ones for the previous and next days (this does require dynamic generation of the table name, or creative uses of ALTER VIEW e.g. in cron-jobs, etc -- no big deal!). I've often found such "sharding approaches" to have excellent (and sometimes unexpected) returns on investment, as a DB scales up beyond initial assumptions, compared to monolithic ones...