How to efficiently handle the backend of popcat.click - mysql

I'm wondering how the makers of popcat.click handles storing and retrieving the number of clicks per country with huge traffic and quick response times. From the network tab I see that clicks are batched and posted to the backend (instead of 1 post per click).
My theory so far is that the batched click data is dumped into a table, and a cron job is used to periodically get the number of clicks from this table to calculate and increment the country counter in a different table which can then be queried quickly.
I think multiple "dump" tables would have to be used to avoid data loss, clearing the data from one when it's processed and dumping data into the next.
Am I along the right lines? What other approaches/services could be used?

Related

Sync data from multiple local mysql instances to one cloud database

I am looking for a solution to sync data from multiply small instances to one big cloud instance.
I have many devices gathering data logs, every device has there own database, so I need a solution to sync data from them to one instance. The delay is not important but I want to sync the data with a max delay of 5-10 min.
Is there any ready solution for it?
Assuming all the data is independent, INSERT all the data into a single table. That table would, of course, have a device_id column to distinguish where the numbers are coming from.
What is the total number of rows per second you need to handle? If less than 1000/second, there should be no problem inserting the rows into the same table as the arrive.
Are you using HTTP? Or something else to do the INSERTs? PHP? Java?
With this, you will rarely see more than a 1 second delay between the reading being taken and the table having the value.
I recommend
PRIMARY KEY(device_id, datetime)
And the use of Summary tables rather than slogging through that big Fact table to do graphs and reports.
Provide more details if you would like further advice.

Graphing login events, but need extra data. Rewrite history or post-process?

We have been tracking user login events for a while now in a MongoDB collection. Each event contains the userID, datetime, and a couple other fundamental attributes about the event.
For a new feature, we want to present a graph of these login events, with different groups representing cohorts related to the user who did the event. Specifically, we want to group by the "Graduation Year" attribute of the user.
In our event log, we do not record the Graduation Year of the user who's logging in, so cannot easily query that directly. We see two ways to go forward, plus a 3rd "in-between" option:
Instead of making a single MongoDB query to get the logins, we make that query PLUS a second one to our Relational DB to get the secondary user data we require, and merge the two together.
We could optionally query for all the users, load them into memory, and loop through the Events, or we could go through the events and find only the User IDs that logged in and query for those specific User IDs. (Then loop again, merging them in.)
The post-processing could be done on the server-side or we could send all the data to the client. (Currently our plan is to just send the raw event data to the client for processing into the graph.)
Upsides: The event log is made to track events. User "Graduation Year" is not relevant to the event in question; it's relevant to the user who did the event. This seems to separate concerns more properly. As well, if we later decide we want to group on a different piece of metadata (let's say: male vs female), it's easy to just join that data in as well.
Downsides: Part of the beauty of our event log is that it quickly can spit out tons of aggregate data that's ready-to-use. If there are 10,000 users, we may have 100,000 logins. It seems crazy to need to loop through 100,000 logins whenever this data is requested new (as in, not cached).
We can write a script that does a one-time load of all the events (presumably in batches), then requests the user metadata and merges it in, re-writing the Event Log to include the relevant data.
Upsides: The event log is our single point of interaction when loading the data. Client requests all the logins; gets 100,000 rows; sorts them and groups them according to Graduation Year; [Caches it;] and graphs it. Will have a script ready to re-add more data if it came to that, down the road.
Downsides: We're essentially rewriting history. We're polluting our event log with secondary data that isn't explicitly about the event we claim to be tracking. Need to rewrite or modify the script to add more data that we didn't know we wanted to track, if we had to, down the road.
We replicate the Users table in MongoDB, perhaps only as-needed (say when an event's metadata is unavailable), and do a join (I guess that's a "$lookup" in Mongo) to this table.
Upsides: MongoDB does the heavy lifting of merging the data.
Downsides: We need to replicate and keep-up-to-date, somehow, a secondary collection of our Users' relevant metadata. I don't think MongoDB's $lookup works like a join in MySQL, and maybe isn't really any more performant at all? Although I'd look into this before we implemented.
For the sake of estimation, let's just say that any given visitor to our site will never have to load more than 100,000 logins and 10,000 users.
For what it's worth, Option #2 seems most preferable to me, even though it involves rewriting history, for performance reasons. Although I am aware that, at some point, if we were sending a user's browser multiple years of login data (that is, all 100,000 imaginary logins), maybe that's already too much data for their browser to process and render quickly, and perhaps we'd already be better off grouping it and aggregating it as some sort of regularly-scheduled process on the backend. (I don't know!)
As a Data Warehouse, 100K rows, is quite small.
Performance in DW depends on building and maintaining "Summary Tables". This makes a pre-determined set of possibly queries very efficient, without having to scan the entire 'Fact' table. My discussion of Summary Tables (in MySQL): http://mysql.rjweb.org/doc.php/summarytables

Incremental/decremental DB design

My question is about a good design for a DB that will hold information about a list of items that can be incremented or decremented every X seconds. The idea is optimize it so there won't be duplicate information.
Example:
I have a script running every 5 seconds collecting information about the computers connected in a WiFi network and I want to store this information in a DB. I don't want to save anything in the DB in the case that in the scan n are the same users than in the scan n-1.
Is there any specific DB design that can be useful to store information about a new wifi client connected to the network or about an existent wifi client that left it?
What kind of DB is better for this kind of incremental/decremental use case?
Thank you
If this is like "up votes" or "likes", then the one piece of advice is to use a 'parallel' table to hold just the counter and the id of the item. This will minimize interference (locking) whenever doing the increment/decrement.
Meanwhile, if you need the rest of the info on an item, plus the counter, then it is simple enough to JOIN the two tables.
If you have fewer than a dozen increments/decrements per second, it doesn't really matter.

Storing Click Data in MongoDB

My application tracks clicks from adverts shown on remote sites, and redirects users to a product sales page.
I'm currently using MySQL to store click information (date, which link was used, ip address, custom data sent from the advertiser etc). The table is getting so big that it no longer fits our needs, which are:
High throughput (the app is processing 5 - 10M clicks per day and this is projected to grow)
Ability to report on the data by date range (e.g. how many clicks for link 1 over the past month grouped by country)
My initial idea was to move clicks into Redis (we only need to store them for 30 days, at which point they expire if they don't lead to a sale) and then make a new MySQL table to store generated stats by day, where we just update a counter per link when it's clicked.
When we started using the statistics table the database quickly fell over because of the amount of queries to that table.
Would it be best to keep the clicks in Redis, and have a separate MongoDB (or other noSQL DB) for the reporting? or could Mongo be used to store the whole click (just like we've been doing in MySQL) or is the volume too high?
Also I remember reading that MongoDB is not good at reclaiming space from deleted records, would this cause us issues since 90% of the clicks would be deleted after 30 days anyway?
Thanks
MongoDB is enough to solve this problem as compared to store in Radis and move to MongoDB. Since, the amount of data is very large, so you can create a indexes on timestamp or field having high carnality. This make you query fast, also MongoDB provide aggregation which help in generating the report. I don't think, is there any issue with deletion.

Read vs Write tables database design

I have a user activity tracking log table where it logs all user activity as they occur. This is extremely high write table due to the in depth tracking of click by click tracking. Up to here the database design is perfect. Problem is the next step.
I need to output the data for the business folks + these people can query to fetch past activity data. Hence there is semi-medium to high read also. I do not like the idea of reading and writing from the same high traffic table.
So ideally I want to split the tables: The first one for quick writes (less to no fks), then copy that data over fully formatted & pulling in all the labels for the ids into a read table for reading use.
So questions:
1) Is this the best approach for me?
2) If i do keep 2 tables, how to keep them in sync? I cant copy the data to the read table instant as it writes to the write table - it will defeat the whole purpose of having seperate tables then, nor can i keep the read table to be old because the activity data tracked links with other user data like session_id, etc so if these IDs are not ready when their usecase calles for it the writes will fail.
I am using MySQL for user data and HBase for some large tables, with php codeignitor for my app.
Thanks.
Yes, having 2 separate tables is the best approach. I've had the same problem to solve a few months ago, though for a daemon-type application and not a website.
Eventually I ended up with 1 MEMORY table keeping "live" data which is inserted/updated/deleted on almost every event and another table that had duplicates of the live data rows, but without the unnecesary system columns - my history table, which was used for reading only per request.
The live table is only relevant to the running process, so I don't care if the contained data is lost due to a server failure - whatever data needs to be read later is already stored in the history table. So ... there's no problem in duplicating the data in the two tables - your goal is performance, not normalization.