My question is about a good design for a DB that will hold information about a list of items that can be incremented or decremented every X seconds. The idea is optimize it so there won't be duplicate information.
Example:
I have a script running every 5 seconds collecting information about the computers connected in a WiFi network and I want to store this information in a DB. I don't want to save anything in the DB in the case that in the scan n are the same users than in the scan n-1.
Is there any specific DB design that can be useful to store information about a new wifi client connected to the network or about an existent wifi client that left it?
What kind of DB is better for this kind of incremental/decremental use case?
Thank you
If this is like "up votes" or "likes", then the one piece of advice is to use a 'parallel' table to hold just the counter and the id of the item. This will minimize interference (locking) whenever doing the increment/decrement.
Meanwhile, if you need the rest of the info on an item, plus the counter, then it is simple enough to JOIN the two tables.
If you have fewer than a dozen increments/decrements per second, it doesn't really matter.
Related
I am looking for a solution to sync data from multiply small instances to one big cloud instance.
I have many devices gathering data logs, every device has there own database, so I need a solution to sync data from them to one instance. The delay is not important but I want to sync the data with a max delay of 5-10 min.
Is there any ready solution for it?
Assuming all the data is independent, INSERT all the data into a single table. That table would, of course, have a device_id column to distinguish where the numbers are coming from.
What is the total number of rows per second you need to handle? If less than 1000/second, there should be no problem inserting the rows into the same table as the arrive.
Are you using HTTP? Or something else to do the INSERTs? PHP? Java?
With this, you will rarely see more than a 1 second delay between the reading being taken and the table having the value.
I recommend
PRIMARY KEY(device_id, datetime)
And the use of Summary tables rather than slogging through that big Fact table to do graphs and reports.
Provide more details if you would like further advice.
The concept of DB sharding at high level makes sense, split up DB nodes so not a single one is responsible for all of the persistent data. However I'm a little confused on what constitutes the "shard". Does it duplicate entire tables across shards, or usually just a single one?
For instance if we take twitter as an example, at the most basic level we need a users and a tweets table. If we shard based on user ID, with 10 shards, it would reason that the shard function is userID mod 10 === shard location. However what does this mean for the tweets table? Is that separate (a single DB table) or then is every single tweet divided up between the 10 tables, based on the whichever user ID created the tweet?
If it is the latter, and say we shard on something other than user ID, tweet created timestamp for example, how would we know where to look up info relating to the user if all tables are sharded based on tweet creation time (which the user has no concept of)?
Sharding is splitting the data across multiple servers. The choice of how to split is very critical, and may be obvious.
At first glance, splitting tweets by userid sounds correct. But what other things are there? Is there any "grouping" or do you care who "receives" each tweet?
A photo-sharing site is probably best split on Userid, with meta info for the user's photos also on the same server with the user. (Where the actual photos live is another discussion.) But what do you do with someone who manages to upload a million photos? Hopefully that won't blow out the disk on whichever shard he is on.
One messy case is Movies. Should you split on movies? Reviews? Users who write reviews? Genres?
Sure, "mod 10" is convenient for saying which shard a user is on. That is, until you need an 11th shard! I prefer a compromise between "hashing" and "dictionary". First do mod 4096, then lookup in a 'dictionary' that maps 4096 values to 10 shards. Then, write a robust tool to move one group of users (all with the same mod-4096 value) from one shard to another. In the long run, this tool will be immensely convenient for handling hardware upgrades, software upgrades, trump-sized tweeters, or moving everyone else out of his way, etc.
If you want to discuss sharding tweets further, please provide the main tables that are involved. Also, I have strong opinions on how to you could issue unique ids, if you need them, for the tweets. (There are fiasco ways to do it.)
We have a website with many users. To manage users who transacted on a given day, we use Redis and stored a list of binary numbers as the values. For instance, if our system had five users, and user 2 and 5 transacted on 2nd January, our key for 2nd January will look like '01001'. This also helps us to determine unique users over a given period and new users using simple bit operations. However, with growing number of users, we are running out of memory to store all these keys.
Is there any alternative database that we can use to store the data in a similar manner? If not, how should we store the data to get similar performance?
Redis' nemory usage can be affected by many parameters so I would also try looking in INFO ALL for starters.
With every user represented by a bit, 400K daily visitors should take at least 50KB per value, but due to sparsity in the bitmap index that could be much larger. I'd also suspect that since newer users are more active, the majority of your bitmaps' "active" flags are towards its end, causing it to reach close to its maximal size (i.e. total number of users). So the question you should be trying to answer is how to store these 400K visits efficiently w/o sacrificing the functionality you're using. That actually depends what you're doing with the recorded visits.
For example, if you're only interested in total counts, you could consider using the HyperLogLog data structure to count your transacting users with a low error rate and small memory/resources footprint. On the other hand, if you're trying to track individual users, perhaps keep a per user bitmap mapped to the days since signing up with your site.
Furthermore, there are bitmap compression techniques that you could consider implementing in your application code/Lua scripting/hacking Redis. The best answer would depend on what you're trying to do of course.
Situation:
I am currently designing a feed system for a social website whereby each user has a feed of their friends' activities. I have two possible methods how to generate the feeds and I would like to ask which is best in terms of ability to scale.
Events from all users are collected in one central database table, event_log. Users are paired as friends in the table friends. The RDBMS we are using is MySQL.
Standard method:
When a user requests their feed page, the system generates the feed by inner joining event_log with friends. The result is then cached and set to timeout after 5 minutes. Scaling is achieved by varying this timeout.
Hypothesised method:
A task runs in the background and for each new, unprocessed item in event_log, it creates entries in the database table user_feed pairing that event with all of the users who are friends with the user who initiated the event. One table row pairs one event with one user.
The problems with the standard method are well known – what if a lot of people's caches expire at the same time? The solution also does not scale well – the brief is for feeds to update as close to real-time as possible
The hypothesised solution in my eyes seems much better; all processing is done offline so no user waits for a page to generate and there are no joins so database tables can be sharded across physical machines. However, if a user has 100,000 friends and creates 20 events in one session, then that results in inserting 2,000,000 rows into the database.
Question:
The question boils down to two points:
Is this worst-case scenario mentioned above problematic, i.e. does table size have an impact on MySQL performance and are there any issues with this mass inserting of data for each event?
Is there anything else I have missed?
I think your hypothesised system generates too much data; firstly on the global scale the storage and indexing requirements on user_feed seems to escalate exponentially as your user-base becomes larger and more interconnected (both presumably desirable for a social network); secondly consider if in the course of a minute 1000 users each entered a new message and each had 100 friends - then your background thread has 100 000 inserts to do and might quickly fall behind.
I wonder if a compromise might be made between your two proposed solutions where a background thread updates a table last_user_feed_update which contains a single row for each user and a timestamp for the last time that users feed was changed.
Then although the full join and query would be required to refresh the feed, a quick query to the last_user_feed table will tell if a refresh is required or not. This seems to mitigate the biggest problems with your standard method as well as avoid the storage size difficulties but that background thread still has a lot of work to do.
The Hypothesized method works better when you limit the maximum number of friends.. a lot of sites set a safe upper boundary, including Facebook iirc. It limits 'hiccups' from when your 100K friends user generates activity.
Another problem with the hypothesized model is that some of the friends you are essentially pre-generating cache for may sign up and hardly ever log in. This is a pretty common situation for free sites, and you may want to limit the burden that these inactive users will cost you.
I've thought about this problem many times - it's not a problem MySQL is going to be good at solving. I've thought of ways I could use memcached and each user pushes what their latest few status items are to "their key" (and in a feed reading activity you fetch and aggregate all your friend's keys)... but I haven't tested this. I'm not sure of all the pros/cons yet.
The way I currently do it in mysql is is
UPDATE table SET hits=hits+1 WHERE id = 1;
This keeps live stats on the site, but as I understand, isnt the best way to go about doing this.
Edit:
let me clarify... this is for counting hits on specific item pages. I have a listing of movies, and I want to count how many views each movie page has gotten. After it +1s, it adds the movie ID to a session var, which stories the ids of all the pages the user viewed. If the ID of the page is in that array, it wont +1 it.
If your traffic is high enough, you shouldn't hit the database on every request. Try keeping the count in memory and sync the database on a schedule (for example, update the database on every 1000 requests or every minute.)
You could take an approach similar to Stack Overflow's view count. This basically increments the counter when the image is loaded. This has two useful aspects:
Robots often don't download images, so these don't increment the view.
Browsers cache images, so when you go back to a page, you're not causing work for the server.
The potentially slower code is run async from the rest of the page. This doesn't slow down the page from being visible.
To optimize the updates:
* Keep the counter in a single, narrow table, with a clustered index on the key.
* Have the table served up by a different database server / host.
* Use memcached and/or a queue to allow the write to either be delayed or run async.
If you don't need to display the view count in real time, then your best bet is to include the movie id in your URL somewhere, and use log scrapping to populate the database at the end of the day.
Not sure which web server you are using.
If your web server logs the requests to the site, say one line per request in a text file. Then you could just count the lines in your log files.
Your solution has a major problem in that it will lock the row in the database, therefore your site can only serve one request at a time.
it depends really on if you want hits or views
1 view from 1 ip = 1 person looking at a page
1 person refreshing the same page = multiple hits but only one view
i always prefer google analytics etc for something like this, you need to make sure that
this db update is only done once, or you could quite easily be flooded.
I'm not sure about what you're using, but you could set a cron job to automatically update the count every x minutes in Google App Engine. I think you'd use memcache to save the counts until your cron job is run. Although... GAE does have some stat reporting but you'd probably want to have your own data also. I think you can use memcache on other systems, and set cron jobs on them tool
Use logging software. Google Analytics is pretty and feature-filled (and generates zero load on your servers), but it'll miss non-JavaScript hits. If every single hit is important, use a server log analyzer like webalizer or awstats.
In general with MySQL:
If you use MyISAM table: there is a lock on the table so you better have to do an INSERT in a separate table. Then with a cron job, you UPDATE the values in your movie table.
If you use InnoDB table: there is a lock on the row so you can UPDATE the value directly.
That say, depending of the "maturity" and success of your project, you may need to implement a different solution, so:
1st advice: benchmark, benchmark, benchmark.
2nd advice: Using the data from the 1st advice, identify the bottleneck and select the solution for the issue you face but not a future issue you think you might have.
Here is a great video on this: http://www.youtube.com/watch?v=ZW5_eEKEC28
Hope this helps. :)