Storing Click Data in MongoDB - mysql

My application tracks clicks from adverts shown on remote sites, and redirects users to a product sales page.
I'm currently using MySQL to store click information (date, which link was used, ip address, custom data sent from the advertiser etc). The table is getting so big that it no longer fits our needs, which are:
High throughput (the app is processing 5 - 10M clicks per day and this is projected to grow)
Ability to report on the data by date range (e.g. how many clicks for link 1 over the past month grouped by country)
My initial idea was to move clicks into Redis (we only need to store them for 30 days, at which point they expire if they don't lead to a sale) and then make a new MySQL table to store generated stats by day, where we just update a counter per link when it's clicked.
When we started using the statistics table the database quickly fell over because of the amount of queries to that table.
Would it be best to keep the clicks in Redis, and have a separate MongoDB (or other noSQL DB) for the reporting? or could Mongo be used to store the whole click (just like we've been doing in MySQL) or is the volume too high?
Also I remember reading that MongoDB is not good at reclaiming space from deleted records, would this cause us issues since 90% of the clicks would be deleted after 30 days anyway?
Thanks

MongoDB is enough to solve this problem as compared to store in Radis and move to MongoDB. Since, the amount of data is very large, so you can create a indexes on timestamp or field having high carnality. This make you query fast, also MongoDB provide aggregation which help in generating the report. I don't think, is there any issue with deletion.

Related

How to efficiently handle the backend of popcat.click

I'm wondering how the makers of popcat.click handles storing and retrieving the number of clicks per country with huge traffic and quick response times. From the network tab I see that clicks are batched and posted to the backend (instead of 1 post per click).
My theory so far is that the batched click data is dumped into a table, and a cron job is used to periodically get the number of clicks from this table to calculate and increment the country counter in a different table which can then be queried quickly.
I think multiple "dump" tables would have to be used to avoid data loss, clearing the data from one when it's processed and dumping data into the next.
Am I along the right lines? What other approaches/services could be used?

Backup and replace MySQL database each day for speed

I was thinking of making a GPS tracking app that stores clients GPS coordinates every second in a MySQL database. I would have one row per GPS entry (lat,lon, speed, time, elevation, user id, item_id, etc) each second for each user.
My calculation is that it could take up to 3 or 5 meg of MySQL space per day for each user to store all their GPS data.
I dont know how MySQL works, but im assuming it would certainly slow down if i have thousands of users running this poor database into the ground each day, and the size would reach tera bytes.
So i thought each day to back up the whole database, delete it and start a new data base each day. So this would create a bunch of smaller databases that may be better than having one extremely large database whose size would keep growing forever.
But then im wondering about duplicate IDs in the databases may be a problem, when a new database is created, all the id would start over creating a conflict with all the previous ids in the other databases.
Anybody knows how the big boys do this kind of thing? Im sure they dont have one incredibly large database with everything in there that keeps growing by terabytes each day. Does database access slow down when the size grows larger and larger? I dont know how it works really
Thanks

Incremental/decremental DB design

My question is about a good design for a DB that will hold information about a list of items that can be incremented or decremented every X seconds. The idea is optimize it so there won't be duplicate information.
Example:
I have a script running every 5 seconds collecting information about the computers connected in a WiFi network and I want to store this information in a DB. I don't want to save anything in the DB in the case that in the scan n are the same users than in the scan n-1.
Is there any specific DB design that can be useful to store information about a new wifi client connected to the network or about an existent wifi client that left it?
What kind of DB is better for this kind of incremental/decremental use case?
Thank you
If this is like "up votes" or "likes", then the one piece of advice is to use a 'parallel' table to hold just the counter and the id of the item. This will minimize interference (locking) whenever doing the increment/decrement.
Meanwhile, if you need the rest of the info on an item, plus the counter, then it is simple enough to JOIN the two tables.
If you have fewer than a dozen increments/decrements per second, it doesn't really matter.

How to store logging data?

I have built and app which does random stuff and I want to collect various statistics which I want to display in graphs. My problem is I'm not sure how to store the data in a database except writing each log into new row which seems very inefficient.
Example Data (5 minute averages):
Number of Users
Number of Online Users
Number of Actions
What would be the best way to store this information? Do I need a separate table for thing that I'm logging or could they all go into one table?
Often you don't need the full resolution data kept for all time and you can re-process it periodically into lower resolution data to save space. For example you could store one day of full resolution (5 minute averages) but periodically re-average that data into 1 hour bins/1 day bins/ 1 month bins/ etc while also culling the data.
This allows you to have the data you need to display nice graphs of the activity over different time ranges (hour, day, week, month, etc) while limiting the number of rows to just what your application requires.
There are also excellent applications to store and display time-series data. MRTG and RRDTool come to mind. See this tutorial for good food for thought:
rrdtutorial

What is a more efficient way to keep a daily ranking log for each user with MySQL?

I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???