Storing affiliate leads and conversions - mysql

I've created an affilaite system that tracks leads and conversions. The leads and conversions records will go into the millions so I need a good way to store them. Users will need to track the stats hourly, daily, weekly and monthly.
Whats the best way to store the leads and conversions?

For this type of system, you need to keep all of the detail records. Reason being at some point someone is going to contest an invoice.
However, you should have some roll up tables. Each hour, compute current hours work and store the results. Do the same for daily, weekly, and monthly.
If some skewing is okay, you can compute the daily amounts off of the 24 hourly computed records. Weekly, off of the last 7 daily records. For monthly you might want to compute back off of the hourly records, because each month doesn't quite add up to 4 full weeks.. Also, it helps reduce noise from any averaging you might be doing.
I'd recommend a two step archival process. The first one should run once a day and move the records into a separate "hot" database. Try to keep 3 months hot for any type of research queries you need to do.
The second archive process is up to you. You could simply move any records older than 3 months into some type of csv file and simply back it up. After some period of time (a year?) delete them depending on your data retention agreements.

Depending on the load, you may need to have multiple web servers handling the lead and conversion pixels firing. One option is to store the raw data records on each web/mysql server, and then run an archival process every 5-10 minutes that stores them in a highly normalized table structure, and which performs any required roll-ups to achieve the performance you are looking for.
Make sure you keep row size as small as possible, store IP's as unsigned ints, store referees as INTs that reference lookup tables, etc.

Related

Database design for heavy timed data logging - Car Tracking System

I am a making a car tracking system and i want to store data that each car sends after every 5 seconds in a MySql database. Assuming that i have 1000 cars transmitting data to my system after 5 seconds, and the data is stored in one table. At some point i would want to query this table to generate reports for specific vehicle. I am confused between logging all the vehicles data in one table or creating a table for each vehicle (1000 tables). Which is more efficient?
OK 86400 seconds per day / 5 = 17280 records per car and day.
Will result in 17,280,000 records per day. This is not an issue for MYSQL in general.
And a good designed table will be easy to query.
If you go for one table for each car - what is, when there will be 2000 cars in future.
But the question is also: how long do you like to store the data?
It is easy to calculate when your database is 200 GB, 800GB, 2TB,....
One table, not one table per car. A database with 1000 tables will be a dumpster fire when you try to back it up or maintain it.
Keep the rows of that table as short as you possibly can; it will have many records.
Index that table both on timestamp and on (car_id, timestamp) . The second index will allow you to report on individual cars efficiently.
Read https://use-the-index-luke.com/
This is the "tip of the iceberg". There are about 5 threads here and on dba.stackexchange relating to tracking cars/trucks. Here are some further tips.
Keep datatypes as small as possible. Your table(s) will become huge -- threatening to overflow the disk, and slowing down queries due to "bulky rows mean that fewer rows can be cached in RAM".
Do you keep the "same" info for a car that is sitting idle overnight? Think of how much disk space this is taking.
If you are using HDD disks, plain on 100 INSERTs/second before you need to do some redesign of the ingestion process. (1000/sec for SSDs.) There are techniques that can give you 10x, maybe 100x, but you must apply them.
Will you be having several servers collecting the data, then doing simple inserts into the database? My point is that that may be your first bottleneck.
PRIMARY KEY(car_id, ...) so that accessing data for one car is efficient.
Today, you say the data will be kept forever. But have you computed how big your disk will need to be?
One way to shrink the data drastically is to consolidate "old" data into, say, 1-minute intervals after, say, one month. Start thinking about what you want to keep. For example: min/max/avg speed, not just instantaneous speed. Have an extra record when any significant change occurs (engine on; engine off; airbag deployed; etc)
(I probably have more tips.)

Database design for statistics of views/likes/downloads?

So i am trying to work out a database design to keep track of views, likes and downloads. Now the amount of entries to keep track of is expected to be 1m or more, so normally i would just track each entry daily, but with 1m i am having concerns about performance and maybe even size on harddisk.
Customers wish is being able to show top statistics in ranges like last week, last month, last year. So i am not sure if should split the data up by adding up numbers from days to weeks or months and delete everything else that isn't relevant any more or just keep it a bit more flexible by tracking all of it, being able to query needed statistics freely.
Database: MySQL
I have to take save the counters once a day.
I think you need to keep track of it daily and then by the end of the week you delete the daily stats and you do so for the weekly data by the second week and do on for the months and year if you want to keep your last stats. You can create a job to make this job. I hope this may help you.
Due the operation only appending the data, store it on time series format.
Whenever you receive the clicks/views data goes in, count it and put the result as smallest dimension as we can. If we can summarizing it hourly, it's best choice ~ good for performance. Whenever we need to know data from higher dimension, just sum it up. Don't counting it (the statistic) on demand, for example : Scanning 10 millions of rows just for counting 1 day data, this is heavy operation.
With this approach you will save two thing :
Storage, you can backup older data that more than x times ( example more than 3 months), so the db size keep compact
Performance

Best Way to logging sensor data with MySQL

I am new to SQL, please advise.
I wish to logging incoming data from sensor every 5 seconds for future graph plotting
What is the best way to design database in MySQL?
Could i log with timestamp and use AVG functions when i like to display graph by hour, day, week, month ?
Or Could I log and make average every minute, hour, day to reduce database size
Is it possible to use trigger function to make average when collect data over 1 minute ?
The answer is that it depends on how much data you are actually going to be logging, how often you are going to be querying it, and how fast your response time needs to be. If it's just one sensor, every 5 seconds, you could probably go on for eternity without running into too many problems with regular sql queries to pull out averages, sums, etc. in a reasonable period of time.
I will say that from experience, you can do a lot with SQL and time series data, but you have to be very careful how you design your queries. I've worked with time series tables with billions of rows and tens of thousands of individual sensors among those rows; it's possible to achieve very fast execution over that many time series rows, but you might spend a week trying to fine-tune the database. It's definitely a trade-off between flexibility and speed.
Again, for your purposes, it probably is not going to make very much difference if you are just talking about one sensor; just write a regular SQL query. However, if you anticipate adding several hundred more sensors or increasing the sample rate, you may want to consider doing periodic "rollup" functions as you suggest. And in that case, I would be more inclined to write a custom solution using a NoSQL database (e.g. Cassandra, Couchbase, etc.) and using a program that runs periodically to do the rollup. If you are interested, I can provide details, but I really don't think you will need to go that far.
This post has a pretty good discussion on storing time series data in SQL vs NOSQL: https://dba.stackexchange.com/questions/7634/timeseries-sql-or-nosql
You should read about RRDtool.
From RRDtool website:
RRDtool is the OpenSource industry standard, high performance data
logging and graphing system for time series data.
http://oss.oetiker.ch/rrdtool/
If you don't want to use it (it may be too complicated, too big for your application etc.) - take a look how is this made, how information is stored etc.

How to store logging data?

I have built and app which does random stuff and I want to collect various statistics which I want to display in graphs. My problem is I'm not sure how to store the data in a database except writing each log into new row which seems very inefficient.
Example Data (5 minute averages):
Number of Users
Number of Online Users
Number of Actions
What would be the best way to store this information? Do I need a separate table for thing that I'm logging or could they all go into one table?
Often you don't need the full resolution data kept for all time and you can re-process it periodically into lower resolution data to save space. For example you could store one day of full resolution (5 minute averages) but periodically re-average that data into 1 hour bins/1 day bins/ 1 month bins/ etc while also culling the data.
This allows you to have the data you need to display nice graphs of the activity over different time ranges (hour, day, week, month, etc) while limiting the number of rows to just what your application requires.
There are also excellent applications to store and display time-series data. MRTG and RRDTool come to mind. See this tutorial for good food for thought:
rrdtutorial

What is a more efficient way to keep a daily ranking log for each user with MySQL?

I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???