Database design for statistics of views/likes/downloads? - mysql

So i am trying to work out a database design to keep track of views, likes and downloads. Now the amount of entries to keep track of is expected to be 1m or more, so normally i would just track each entry daily, but with 1m i am having concerns about performance and maybe even size on harddisk.
Customers wish is being able to show top statistics in ranges like last week, last month, last year. So i am not sure if should split the data up by adding up numbers from days to weeks or months and delete everything else that isn't relevant any more or just keep it a bit more flexible by tracking all of it, being able to query needed statistics freely.
Database: MySQL
I have to take save the counters once a day.

I think you need to keep track of it daily and then by the end of the week you delete the daily stats and you do so for the weekly data by the second week and do on for the months and year if you want to keep your last stats. You can create a job to make this job. I hope this may help you.

Due the operation only appending the data, store it on time series format.
Whenever you receive the clicks/views data goes in, count it and put the result as smallest dimension as we can. If we can summarizing it hourly, it's best choice ~ good for performance. Whenever we need to know data from higher dimension, just sum it up. Don't counting it (the statistic) on demand, for example : Scanning 10 millions of rows just for counting 1 day data, this is heavy operation.
With this approach you will save two thing :
Storage, you can backup older data that more than x times ( example more than 3 months), so the db size keep compact
Performance

Related

How to store logging data?

I have built and app which does random stuff and I want to collect various statistics which I want to display in graphs. My problem is I'm not sure how to store the data in a database except writing each log into new row which seems very inefficient.
Example Data (5 minute averages):
Number of Users
Number of Online Users
Number of Actions
What would be the best way to store this information? Do I need a separate table for thing that I'm logging or could they all go into one table?
Often you don't need the full resolution data kept for all time and you can re-process it periodically into lower resolution data to save space. For example you could store one day of full resolution (5 minute averages) but periodically re-average that data into 1 hour bins/1 day bins/ 1 month bins/ etc while also culling the data.
This allows you to have the data you need to display nice graphs of the activity over different time ranges (hour, day, week, month, etc) while limiting the number of rows to just what your application requires.
There are also excellent applications to store and display time-series data. MRTG and RRDTool come to mind. See this tutorial for good food for thought:
rrdtutorial

How to display totals per user after sign in - database with 25,000 users - millions of rows?

I'm dealing with a database for about 25,000 users who add about 6 rows on average every day (employees keeping logs for work orders). Basically the database grows indefinite and contains millions of rows (to divide among these 25,000 users).
After a user logs in, I would like the system to display some of their totals such as miles driven in truck number xyz for their entire work career, total time worked on order item xyz and so on. Basically, every time a user logs in, these totals need to be present instantly. In addition, once a user adds a row for a work order, the totals need to reflect this change instantly.
Is it advised to build a totals table per user that gets updated with every entry. Or should I just query the database and have it calculate the total on the fly each time a user logs in (no total tables). Would that however create a bottleneck if users log in every second and the database needs to spit out a total based on millions of rows? How does google do it? :)
Thanks.
You might find that a simple query is fast enough with an appropiate index (e.g. index user_id). This should reduce the number of rows that need to be scanned.
But if this is not fast enough, you could calculate the result for all users overnight, and cache this result in another table. You can then do the following:
Get the total up to the last cache update directly from the cache table.
Get the total since the last cache update from the main table.
Add these two numbers to get the overall total.
Another option is to use triggers to keep the pre-calculated result accurate, even when rows are inserted, updated or deleted.
Rather than do a join a the million row table, i think you can create a summary table.
it can be populated running a cron at night for example.
If you want it "instant", then stay away from keeping the totals in tables as then you have to worry about updating them through some process every time the underlying data changes.
As long as your indexes are good, and you have some decent hardware then I don't see a problem with querying for these totals every time.
As far as Google, they have lots and lots of servers, basically keep their entire index in RAM, and have virtually unlimited computing power.
If you actually find that after indexing your tables the search/update is too slow for your liking, consider splitting the logs table into several. Depending on your design and interest in speed up it could be spliced multiple ways:
log_truck_miles (driver, truck_id, miles)
log_work_times (worker, job_id, minutes) ...etc.
Another way you could split is quantize worker IDs -- log entries for user_id below 5,000 go into table log_0_5. 5,000 to 10,000 go to log_5_10

Right design for MySQL database

I want to build a MySQL database for storing the ranking of a game every 1h.
Since this database will become quite large in a short time, I figured it's important to have a proper design. Therefor some advice would be gratefully appreciated.
In order to keep it as small as possible, I decided to log only the first 1500 positions of the ranking. Every ranking of a player holds the following values:
ranking position, playername, location, coordinates, alliance, race, level1, level2, points1, points2, points3, points4, points5, points6, date/time
My approach was to simply grab all values of each top 1500 player every hour by a php script and insert them into the MySQL as one row. So every day the MySQL will grow 36,000 rows. I will have a second script that deletes every row that is older than 28 days, otherwise the database would get insanely huge. Both scripts will run as a cronjob.
The following queries will be performed on this data:
The most important one is simply the query for a certain name. It should return all stats for the player for every hour as an array.
The second is a query in which all players have to be returned that didn't gain points1 during a certain time period from the latest entry. This should return a list of players that didn't gain points (for the last 24h for example).
The third is a query in which all players should be listed that lost a certain amount or more points2 in a certain time period from the latest entry.
The queries shouldn't take a lifetime, so I thought I should probably index playernames, points1 and points2.
Is my approach to this acceptable or will I run into a performance/handling disaster? Is there maybe a better way of doing this?
Here is where you risk a performance problem:
Your indexes will speed up your reads, but will considerably slow down your writes. Especially since your DB will have over 1 million rows in that one table at any given time. Since your writes are happening via cron, you should be okay as long as you insert your 1500 rows in batches rather than one round trip to the DB for every row. I'd also look into query compiling so that you save that overhead as well.
Ranhiru Cooray is correct, you should only store data like the player name once in the DB. Create a players table and use the primary key to reference the player in your ranking table. The same will go for location, alliance and race. I'm guessing that those are more or less enumerated values that you can store in another table to normalize your design and be returned in your results with appropriates JOINs. Normalizing your data will reduce the amount of redundant information in your database which will decrease it's size and increase it's performance.
Your design may also be flawed in your ranking position. Can that not be calculated by the DB when you select your rows? If not, can it be done by PHP? It's the same as with invoice tables, you never store the invoice total because it is redundant. The items/pricing/etc can be used to calculate the order totals.
With all the adding/deleting, I'd be sure to run OPTIMIZE frequently and keep good backups. MySQL tables---if using MyISAM---can become corrupted easily in high writing/deleting scenarios. InnoDB tends to fair a little better in those situations.
Those are some things to think about. Hope it helps.

What is a more efficient way to keep a daily ranking log for each user with MySQL?

I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???

Storing affiliate leads and conversions

I've created an affilaite system that tracks leads and conversions. The leads and conversions records will go into the millions so I need a good way to store them. Users will need to track the stats hourly, daily, weekly and monthly.
Whats the best way to store the leads and conversions?
For this type of system, you need to keep all of the detail records. Reason being at some point someone is going to contest an invoice.
However, you should have some roll up tables. Each hour, compute current hours work and store the results. Do the same for daily, weekly, and monthly.
If some skewing is okay, you can compute the daily amounts off of the 24 hourly computed records. Weekly, off of the last 7 daily records. For monthly you might want to compute back off of the hourly records, because each month doesn't quite add up to 4 full weeks.. Also, it helps reduce noise from any averaging you might be doing.
I'd recommend a two step archival process. The first one should run once a day and move the records into a separate "hot" database. Try to keep 3 months hot for any type of research queries you need to do.
The second archive process is up to you. You could simply move any records older than 3 months into some type of csv file and simply back it up. After some period of time (a year?) delete them depending on your data retention agreements.
Depending on the load, you may need to have multiple web servers handling the lead and conversion pixels firing. One option is to store the raw data records on each web/mysql server, and then run an archival process every 5-10 minutes that stores them in a highly normalized table structure, and which performs any required roll-ups to achieve the performance you are looking for.
Make sure you keep row size as small as possible, store IP's as unsigned ints, store referees as INTs that reference lookup tables, etc.