Scaling Pinterest User tables sharding and ensuring consistency when opening new shards - mysql

So this is very much a conceptual question (as much as I'd love to build a billion user app I don't think it's going to happen).
I've read the article by Pinterest on how they scaled their MySQL fleet a number of times ( https://medium.com/#Pinterest_Engineering/sharding-pinterest-how-we-scaled-our-mysql-fleet-3f341e96ca6f ) and I still don't get how they would "open up new shards" without effecting existing users.
The article states that every table is on every shard, including the User table.
So I'm assuming that when a user registers and they are assigned a random shard, this has to be done via a function that will always return the same result regardless of the number of shards.
e.g if I sign up with test#example.com they would potentially use that email to work out the shard id and this would have to take into consideration the number of currently 'open' shards. My initial assumption was that they would use something like the mod shard they mentioned later on in the article e.g.
md5($email) % number_of_shards
But as they open up the number of shards it would change the function result.
I then thought perhaps they had a separate DB to hold purely user info for authentication purposes and this would also contain a column with the assigned shard_id, but as I say the article implies that even the user table is on each shard.
Does anyone else have any ideas or insights into how something like this might work?

You are sharding on "user", correct? I see 3 general ways to split up the users.
The modulo approach to sharding has a big problem. When you add a shard, suddenly most users need to move most users to a different shard.
At the other extreme (from modulo) is the "dictionary" approach. You have some kind of lookup that says which shard each user is on. With millions of users, maintenance of the dictionary becomes a costly headache.
I prefer a hybrid:
Do modulo 4096 (or some suitably large number)
Use a dictionary with 4096 entries. This maps 4096 values into the current number of shards.
You have a package to migrate users from one shard to another. (This is a vital component of the system -- you will use it for upgrading, serious crashes, etc, load balancing, etc)
Adding a shard involves moving a few of the 4096 to the new shard and changing the dictionary. The users to move would probably come from the 'busiest' shards, thereby relieving the pressure on them.
Yes, item 4 impacts some users, but only a small percentage of them. You can soften the blow by picking 'idle' or 'small' or 'asleep' users to move. This would involve computing some metric for each of the 4096 clumps.

Related

Best way to shard Mysql database

I have a huge number of users so I am needed to shard the databases in n shards. So to proceed with this I have below options-
Divide my data in n shards basis userId modulus n operation. i.e. if I have 10 shards userId 1999 will be sent to 1999%10=9th shard
Problem-
The problem with this approach is if the number of shard increases in future reference to previous will not be maintained.
I can maintain a table with UserId and ShardId
Problem-
If my users increase in future to billions I'll need this mapping table to be shared which doesn't seem to be good solution.
I can maintain static mapping in code like 0-10000 in Shard 1 and more on.
Problem-
With the increase in shards and Users Code needed to be changed more often.
If any specific User in shard has huge data It'd get difficult to separate out the shard.
So, these are the three ways I could have found but all having some problem. What would be an alternate or better approach to shard the MySQL tables which can compensate with increased number of shards and users in future.
I prefer a hybrid of 1 and 2:
Hash the UserId into, say, 4096 values.
Look up that number in a 'dictionary' that has shard numbers in it.
If a shard gets too full, migrate all the users with some hash number to another shard.
If you add a shard, migrate a few hash numbers to it - preferable from busy shards.
This forces you to write a script for moving users, and make it robust. Once you have that, a lot of other admin tasks become 'simple':
Retire a machine
Upgrade the OS (one by one across shards)
Upgrade whatever software is on the machines
Migrate a hash number that is bulky but not busy to a old, slow, shard that has a big disk. Similarly migrate small and busy to a shard with more cores and faster disks.
Each shard could be an HA cluster (Galera, Group replication, etc) of servers for both reliability and read-scaling. (Sharding gives you write-scaling.
There would need to be a way to distribute the dictionary to all clients "promptly".
All of this works well if you have, say, each hash in 3 different shards for HA. Each of the 3 would be at geographic locations for robustness. The dictionary would have 4 columns to say where the copies are. The 4th would be used during migrations.

How does database sharding work?

The concept of DB sharding at high level makes sense, split up DB nodes so not a single one is responsible for all of the persistent data. However I'm a little confused on what constitutes the "shard". Does it duplicate entire tables across shards, or usually just a single one?
For instance if we take twitter as an example, at the most basic level we need a users and a tweets table. If we shard based on user ID, with 10 shards, it would reason that the shard function is userID mod 10 === shard location. However what does this mean for the tweets table? Is that separate (a single DB table) or then is every single tweet divided up between the 10 tables, based on the whichever user ID created the tweet?
If it is the latter, and say we shard on something other than user ID, tweet created timestamp for example, how would we know where to look up info relating to the user if all tables are sharded based on tweet creation time (which the user has no concept of)?
Sharding is splitting the data across multiple servers. The choice of how to split is very critical, and may be obvious.
At first glance, splitting tweets by userid sounds correct. But what other things are there? Is there any "grouping" or do you care who "receives" each tweet?
A photo-sharing site is probably best split on Userid, with meta info for the user's photos also on the same server with the user. (Where the actual photos live is another discussion.) But what do you do with someone who manages to upload a million photos? Hopefully that won't blow out the disk on whichever shard he is on.
One messy case is Movies. Should you split on movies? Reviews? Users who write reviews? Genres?
Sure, "mod 10" is convenient for saying which shard a user is on. That is, until you need an 11th shard! I prefer a compromise between "hashing" and "dictionary". First do mod 4096, then lookup in a 'dictionary' that maps 4096 values to 10 shards. Then, write a robust tool to move one group of users (all with the same mod-4096 value) from one shard to another. In the long run, this tool will be immensely convenient for handling hardware upgrades, software upgrades, trump-sized tweeters, or moving everyone else out of his way, etc.
If you want to discuss sharding tweets further, please provide the main tables that are involved. Also, I have strong opinions on how to you could issue unique ids, if you need them, for the tweets. (There are fiasco ways to do it.)

Managing set of users

We have a website with many users. To manage users who transacted on a given day, we use Redis and stored a list of binary numbers as the values. For instance, if our system had five users, and user 2 and 5 transacted on 2nd January, our key for 2nd January will look like '01001'. This also helps us to determine unique users over a given period and new users using simple bit operations. However, with growing number of users, we are running out of memory to store all these keys.
Is there any alternative database that we can use to store the data in a similar manner? If not, how should we store the data to get similar performance?
Redis' nemory usage can be affected by many parameters so I would also try looking in INFO ALL for starters.
With every user represented by a bit, 400K daily visitors should take at least 50KB per value, but due to sparsity in the bitmap index that could be much larger. I'd also suspect that since newer users are more active, the majority of your bitmaps' "active" flags are towards its end, causing it to reach close to its maximal size (i.e. total number of users). So the question you should be trying to answer is how to store these 400K visits efficiently w/o sacrificing the functionality you're using. That actually depends what you're doing with the recorded visits.
For example, if you're only interested in total counts, you could consider using the HyperLogLog data structure to count your transacting users with a low error rate and small memory/resources footprint. On the other hand, if you're trying to track individual users, perhaps keep a per user bitmap mapped to the days since signing up with your site.
Furthermore, there are bitmap compression techniques that you could consider implementing in your application code/Lua scripting/hacking Redis. The best answer would depend on what you're trying to do of course.

Mysql optimization for simple records - what is best?

I am developing a system that will eventually have millions of users. Each user of the system may have acces to different 'tabs' in the system. I am tracking this with a table called usertabs. There are two ways to handle this.
Way 1: A single row for each user containing userid and tab1-tab10 as int columns.
The advantage of this system is that the query to get a single row by userid is very fast while the disadvantage is that the 'empty' columns take up space. Another disadvantage is that when I needed to add a new tab, I would have to re-org the entire table which could be tedious if there are millions of records. But this wouldn't happen very often.
Way 2: A single row contains userid and tabid and that is all. There would be up to 10 rows per user.
The advantage of this system is easy sharding or other mechanism for optimized storage and no wasted space. Rows only exist when necessary. The disadvantage is up to 10 rows must be read every time I access a record. If these rows are scattered, they may be slower to access or maybe faster, depending on how they were stored?
My programmer side is leaning towards Way 1 while my big data side is leaning towards Way 2.
Which would you choose? Why?
Premature optimization, and all that...
Option 1 may seem "easier", but you've already identified the major downside - extensibility is a huge pain.
I also really doubt that it would be faster than option 2 - databases are pretty much designed specifically to find related bits of data, and finding 10 records rather than 1 record is almost certainly not going to make a difference you can measure.
"Scattered" records don't really matter, the database uses indices to be able to retrieve data really quickly, regardless of their physical location.
This does, of course, depend on using indices for foreign keys, as #Barmar comments.
If these rows are scattered, they may be slower to access or maybe faster, depending on how they were stored?
They don't have to be scattered if you use clustering correctly.
InnoDB tables are always clustered and if your child table's PK1 looks similar to: {user_id, tab_id}2, this will automatically store tabs belonging to the same user physically close together, minimizing I/O during querying for "tabs of the give user".
OTOH, if your child PK is: {tab_id, user_id}, this will store users connected to the same tab physically close together, making queries such as: "give me all users connected to given tab" very fast.
Unfortunately MySQL doesn't support leading-edge index compression (a-la Oracle), so you'll still pay the storage (and cache) price for repeating all these user_ids (or tab_ids in the second case) in the child table, but despite that, I'd still go for the solution (2) for flexibility and (probably) ease of querying.
1 Which InnoDB automatically uses as clustering key.
2 I.e. user's PK is at the leading edge of the child table's PK.

How can I fix this scaling issue with soft deleting items?

I have a database where most tables have a delete flag for the tables. So the system soft deletes items (so they are no longer accessible unless by admins for example)
What worries me is in a few years, when the tables are much larger, is that the overall speed of the system is going to be reduced.
What can I do to counteract effects like that.
Do I index the delete field?
Do I move the deleted data to an identical delete table and back when undeleted?
Do I spread out the data over a few MySQL servers over time? (based on growth)
I'd appreciate any and all suggestions or stories.
UPDATE:
So partitioning seems to be the key to this. But wouldn't partitioning just create two "tables", one with the deleted items and one without the deleted items.
So over time the deleted partition will grow large and the occasional fetches from it will be slow (and slower over time)
Would the speed difference be something I should worry about? Since I fetch most (if not all) data by some key value (some are searches but they can be slow for this setup)
I'd partition the table on the DELETE flag.
The deleted rows will be physically kept in other place, but from SQL's point of view the table remains the same.
Oh, hell yes, index the delete field. You're going to be querying against it all the time, right? Compound indexes with other fields you query against a lot, like parent IDs, might also be a good idea.
Arguably, this decision could be made later if and only if performance problems actually appear. It very much depends on how many rows are added at what rate, your box specs, etc. Obviously, the level of abstraction in your application (and the limitations of any libraries you are using) will help determine how difficult such a change will be.
If it becomes a problem, or you are certain that it will be, start by partitioning on the deleted flag between two tables, one that holds current data and one that holds historical/deleted data. IF, as you said, the "deleted" data will only be available to administrators, it is reasonable to suppose that (in most applications) the total number of users (here limited only to admins) will not be sufficient to cause a problem. This means that your admins might need to wait a little while longer when searching that particular table, but your user base (arguably more important in most applications) will experience far less latency. If performance becomes unacceptable for the admins, you will likely want to index the user_id (or transaction_id or whatever) field you access the deleted records by (I generally index every field by which I access the table, but at certain scale there can be trade-offs regarding which indexes are most worthwhile).
Depending on how the data is accessed, there are other simple tricks you can employ. If the admin is looking for a specific record most of the time (as opposed to, say, reading a "history" or "log" of user activity), one can often assume that more recent records will be looked at more often than old records. Some DBs include tuning options for making recent records easier to find than older records, but you'll have to look it up for your particular database. Failing that, you can manually do it. The easiest way would be to have an ancient_history table that contains all records older than n days, weeks or months, depending on your constraints and suspected usage patterns. Newer data then lives inside a much smaller table. Even if the admin is going to "browse" all the records rather than searching for a specific one, you can start by showing the first n days and have a link to see all days should they not find what they are looking for (eg, most online banking applications that lets you browse transactions but shows only the first 30 days of history unless you request otherwise.)
Hopefully you can avoid having to go a step further, and sharding on user_id or some such scheme. Depending on the scale of the rest of your application, you might have to do this anyway. Unless you are positive that you will need to, I strongly suggest using vertical partitioning first (eg, keeping your forum_posts on a separate machine than your sales_records), as it is FAR easier to setup and maintain. If you end up needing to shard on user_id, I suggest using google ;-]
Good luck. BTW, I'm not a DBA so take this with a grain of salt.