How does database sharding work? - mysql

The concept of DB sharding at high level makes sense, split up DB nodes so not a single one is responsible for all of the persistent data. However I'm a little confused on what constitutes the "shard". Does it duplicate entire tables across shards, or usually just a single one?
For instance if we take twitter as an example, at the most basic level we need a users and a tweets table. If we shard based on user ID, with 10 shards, it would reason that the shard function is userID mod 10 === shard location. However what does this mean for the tweets table? Is that separate (a single DB table) or then is every single tweet divided up between the 10 tables, based on the whichever user ID created the tweet?
If it is the latter, and say we shard on something other than user ID, tweet created timestamp for example, how would we know where to look up info relating to the user if all tables are sharded based on tweet creation time (which the user has no concept of)?

Sharding is splitting the data across multiple servers. The choice of how to split is very critical, and may be obvious.
At first glance, splitting tweets by userid sounds correct. But what other things are there? Is there any "grouping" or do you care who "receives" each tweet?
A photo-sharing site is probably best split on Userid, with meta info for the user's photos also on the same server with the user. (Where the actual photos live is another discussion.) But what do you do with someone who manages to upload a million photos? Hopefully that won't blow out the disk on whichever shard he is on.
One messy case is Movies. Should you split on movies? Reviews? Users who write reviews? Genres?
Sure, "mod 10" is convenient for saying which shard a user is on. That is, until you need an 11th shard! I prefer a compromise between "hashing" and "dictionary". First do mod 4096, then lookup in a 'dictionary' that maps 4096 values to 10 shards. Then, write a robust tool to move one group of users (all with the same mod-4096 value) from one shard to another. In the long run, this tool will be immensely convenient for handling hardware upgrades, software upgrades, trump-sized tweeters, or moving everyone else out of his way, etc.
If you want to discuss sharding tweets further, please provide the main tables that are involved. Also, I have strong opinions on how to you could issue unique ids, if you need them, for the tweets. (There are fiasco ways to do it.)

Related

Scaling Pinterest User tables sharding and ensuring consistency when opening new shards

So this is very much a conceptual question (as much as I'd love to build a billion user app I don't think it's going to happen).
I've read the article by Pinterest on how they scaled their MySQL fleet a number of times ( https://medium.com/#Pinterest_Engineering/sharding-pinterest-how-we-scaled-our-mysql-fleet-3f341e96ca6f ) and I still don't get how they would "open up new shards" without effecting existing users.
The article states that every table is on every shard, including the User table.
So I'm assuming that when a user registers and they are assigned a random shard, this has to be done via a function that will always return the same result regardless of the number of shards.
e.g if I sign up with test#example.com they would potentially use that email to work out the shard id and this would have to take into consideration the number of currently 'open' shards. My initial assumption was that they would use something like the mod shard they mentioned later on in the article e.g.
md5($email) % number_of_shards
But as they open up the number of shards it would change the function result.
I then thought perhaps they had a separate DB to hold purely user info for authentication purposes and this would also contain a column with the assigned shard_id, but as I say the article implies that even the user table is on each shard.
Does anyone else have any ideas or insights into how something like this might work?
You are sharding on "user", correct? I see 3 general ways to split up the users.
The modulo approach to sharding has a big problem. When you add a shard, suddenly most users need to move most users to a different shard.
At the other extreme (from modulo) is the "dictionary" approach. You have some kind of lookup that says which shard each user is on. With millions of users, maintenance of the dictionary becomes a costly headache.
I prefer a hybrid:
Do modulo 4096 (or some suitably large number)
Use a dictionary with 4096 entries. This maps 4096 values into the current number of shards.
You have a package to migrate users from one shard to another. (This is a vital component of the system -- you will use it for upgrading, serious crashes, etc, load balancing, etc)
Adding a shard involves moving a few of the 4096 to the new shard and changing the dictionary. The users to move would probably come from the 'busiest' shards, thereby relieving the pressure on them.
Yes, item 4 impacts some users, but only a small percentage of them. You can soften the blow by picking 'idle' or 'small' or 'asleep' users to move. This would involve computing some metric for each of the 4096 clumps.

What is faster traverse through record or large field?

Situation: I'm trying to create a page which has members that are allowed to view it's content and each member is defined by an ID. To access the information for the page I want to make sure the user is a member of the page.
Question: Is it faster to have one large record that has to be traversed through to look up IDs or create one field that's indexed that have several (billions) of records but that all very small?
Rdms servers like MySQL are optimized for the lookup of a few short records from among many. If you structure your data that way and index it correctly, you'll get decent performance.
On the other hand, long lists of data requiring linear search will not scale up well.
By the way, plan for tens of millions of records, not billions. If you actually need to handle billions, you'll probably have the resources to scale up your system. In the meantime RAM and disk storage are still on a Moore's-law cost curve, even if CPUs aren't.
That scenario would typically be implemented with three tables. A simplified version would look something like this:
users(id, name, password, ... )
pages(id, title, .... )
page_access(user_id, page_id, permission)
where user_id and page_id are references to the users and pages tables respectively. In your application you would then check the current page and user against the page_access table to see if permission should be granted or denied.
MySQL has no trouble processing millions upon millions of well indexed rows, so long as you have sufficient hardware powering it. You want to aim for normalised data, not cramming everything into a single row in a single table.

Is is necessary to link or join tables in MySQL?

I've created many databases before, but I have never linked two tables together. I've tried looking around, but cannot find WHY one would need to link two or more tables together.
There is a good tutorial here that goes over database relationships, but does not explain why they would be needed. He just simply says that they are.
Are they truly necessary? I understand that (in his example) all orders have a customer, and so one would link the orders table to the customers table, but I just don't see why this would be absolutely necessary. I can (and have) created shopping carts and other complex databases that work just fine without creating any table relationships.
I've just started playing around with MySQL Workbench v6.0 for a new project that has a fairly large and complex database, and so I'm wondering if I am losing anything by creating the entire project without relationships?
NOTE: Please let me know if this question is too general or off topic, and I will change it. I understand that a lot can be said about this topic, and so I'm really just looking to know if I am opening myself up to any security issues or significant performance issues by not using relationships. Please be specific in your response; "Yes you are opening yourself up to performance issues" is useless and not helpful for myself, nor for anyone else looking at this thread at a later date. Please include details and specifics in your response.
Thank you in advance!
As Sam D points out in the comments, entire books can be written about database design and why having tables with relationships can make a lot of sense.
That said, theoretically, you lose absolutely no expressive/computational power by just putting everything in the same table. The primary arguments against doing so likely deal with performance and maintenance issues that might arise.
The answer revolves around granularity, space consumption, speed, and detail.
Inherently different types of data will be more granular than others, as items can always be rolled up to a larger umbrella. For a chain of stores, items sold can be rolled up into transactions, transactions can be rolled up into register batches, register batches can be rolled up to store sales, store sales can be rolled up to company sales. The two options then are:
Store the data at the lowest grain in a single table
Store the data in separate tables that are dedicated to purpose
In the first case, there would be a lot of redundant data, as each item sold at location 3 of 430 would have store, date, batch, transaction, and item information. That redundant data takes up a large volume of space, when you could very easily create separated tables for their unique purpose.
In this example, lets say there were a thousand transactions a day totaling a million items sold from that one store. By creating separate tables you would have:
Stores = 430 records
Registers = 10 records
Transactions = 1000 records
Items sold = 1000000 records
I'm sure your asking where the space savings comes in ... it is in the detail for each record. The store table has names, address, phone, etc. The register has number, purchase date, manager who reconciles, etc. Transactions have customer, date, time, amount, tax, etc. If these values were duplicated for every record over a single table it would be a massive redundancy of data adding up to far more space consumption than would occur just by linking a field in one table (transaction id) to a field in another table (item id) to show that relationship.
Additionally, the amount of space consumed, as well as the size of the overall table, inversely impacts the speed of you querying that data. By keeping tables small and capitalizing on the relationship identifiers to link between them, you can greatly increase the response time. Every time the query engine needs to find a value, it traverses the table until it finds it (that is a grave oversimplification, but not untrue), so the larger and broader the table the longer the seek time. These problems do not exist with insignificant volumes of data, but for organizations that deal with millions, billions, trillions of records (I work for one of them) storing everything in a single table would make the application unusable.
There is so very, very much more on this topic, but hopefully this gives a bit more insight.
Short answer: In a relational database like MySQL Yes. Check this out about referential integrity http://databases.about.com/cs/administration/g/refintegrity.htm
That does not mean that you have to use relational databases for your project. In fact the trend is to use Non-Relational databases (NoSQL), like MongoDB to achieve same results with better performance. More about RDBMS vs NoSQL http://www.zdnet.com/rdbms-vs-nosql-how-do-you-pick-7000020803/
I think that with this example you will understand better:
Let's we want to create on-line store. We have at minimum Users, Payments and Events (events about the pages where the user navigates or other actions). In this scenario we want to link in a secure and relational way the Users with the Payments. We do not want a Payment to be lost or assigned to another User. So we can use a RDBMS like MySQL to create the tables Users and Payments and linked the with proper Foreign Keys. However for the events, we are going to be a lot of them per users (maybe millions) and we need to track them in a fast way without killing the relation database. In that case a No-SQL database like MongoDB makes totally sense.
To sum up to can use an hybrid of SQL and NO-SQL, but either if you use one, the other or both kind of solutions, do it properly.

How to handle massive storage of records in database for user authorization purposes?

I am using Ruby on Rails 3.2.2 and MySQL. I would like to know if it is "advisable" / "desirable" to store in a database table related to a class all records related to two others classes for each "combination" of their instances.
That is, I have User and Article models. In order to store all user-article authorization objects, I would like to implement a ArticleUserAuthorization model so that
given N users and M articles there are N*M ArticleUserAuthorization records.
Making so, I can state and use ActiveRecord::Associations as the following:
class Article < ActiveRecord::Base
has_many :user_authorizations, :class_name => 'ArticleUserAuthorization'
has_many :users, :through => :user_authorizations
end
class User < ActiveRecord::Base
has_many :article_authorizations, :class_name => 'ArticleUserAuthorization'
has_many :articles, :through => :article_authorizations
end
However, the above approach of storing all combinations will result in a big database table containing billions billions billions of rows!!! Furthermore, ideally speaking, I am planning to create all authorization records when an User or an Article object is created (that is, I am planning to create all previously mentioned "combinations" at once or, better, in "delayed" batches... in any way, this process creates other billions billions of database table rows!!!) and make the viceversa when destroying (by deleting billions billions of database table rows!!!). Furthermore, I am planning to read and update those rows at once when an User or Article object is updated.
So, my doubts are:
Is this approach "advisable" / "desirable"? For example, what kind of performance problems may occur? or, is a bad "way" / "prescription" to admin / manage databases with very large database tables?
How may / could / should I proceed in my case (maybe, by "re-thinking" at all how to handle user authorizations in a better way)?
Note: I would use this approach because, in order to retrieve only "authorized objects" when retrieving User or Article objects, I think I need "atomic" user authorization rules (that is, one user authorization record for each user and article object) since the system is not based on user groups like "admin", "registered" and so on. So, I thought that the availability of a ArticleUserAuthorization table avoids to run methods related to user authorizations (note: those methods involve some MySQL querying that could worsen performance - see this my previous question for a sample "authorization" method implementation) on each retrieved object by "simply" accessing / joining the ArticleUserAuthorization table so to retrieve only "user authorized" objects.
The fact of the matter is that if you want article-level permissions per user then you need a way to relate Users to the Articles they can access. This neccesitates a minimum you need N*A (where A is the number of uniquely permissioned articles).
The 3NF approach to this would be, as you suggested, to have a UsersArticles set... which would be a very large table (as you noted).
Consider that this table would be accessed a whole lot...
This seems to me like one of the situations in which a slightly denormalized approach (or even noSQL) is more appropriate.
Consider the model that Twitter uses for their user follower tables:
Jeff Atwood on the subject
And High Scalability Blog
A sample from those pieces is a lesson learned at Twitter that querying followers from a normalized table puts tremendous stress on a Users table. Their solution was to denormalize followers so that a user's follower's are stored on their individual user settings.
Denormalize a lot. Single handedly saved them. For example, they store all a user IDs friend IDs together, which prevented a lot of costly joins.
- Avoid complex joins.
- Avoid scanning large sets of data.
I imagine a similar approach could be used to serve article permissions and avoid a tremendously stressed UsersArticles single table.
You don't have to re-invent the wheel. ACL(Access Control List) frameworks deals with same kind of problem for ages now, and most efficiently if you ask me. You have resources (Article) or even better resource groups (Article Category/Tag/Etc).On the other hand you have users (User) and User Groups. Then you would have a relatively small table which maps Resource Groups to User Groups. And you would have another relatively small table which holds exceptions to this general mapping. Alternatively you can have rule sets to satify for accessing an article.You can even have dynamic groups like : authors_friends depending on your user-user relation.
Just take a look at any decent ACL framework and you would have an idea how to handle this kind of problem.
If there really is the prospect of "a big database table containing billions billions billions of rows" then perhaps you should craft a solution for your specific needs around a (relatively) sparsely populated table.
Large database tables pose a significant performance challange in how quickly the system can locate the relevant row or rows. Indexes and primary keys are really needed here; however they add to the storage requirements and also require CPU cycles to be maintained as records are added, updated, and deleted. Evenso, heavy-duty database systems also have partitioning features (see http://en.wikipedia.org/wiki/Partition_(database) ) that address such row location performance issues.
A sparsely populated table can probably serve the purpose assuming some (computable or constant) default can be used whenever no rows are returned. Insert rows only wherever something other than the default is required. A sparsely populated table will require much less storage space and the system will be able to locate rows more quickly. (The use of user-defined functions or views may help keep the querying straightforward.)
If you really cannot make a sparsely populated table work for you, then you are quite stuck. Perhaps you can make that huge table into a collection of smaller tables, though I doubt that's of any help if your database system supports partitioning. Besides, a collection of smaller tables makes for messier querying.
So let's say you have millions or billions of Users who or may not have certain privileges regarding the millions or billions of Articles in your system. What, then, at the business level determines what a User is privileged to do with a given Article? Must the User be a (paying) subscriber? Or may he or she be a guest? Does the User apply (and pay) for a package of certain Articles? Might a User be accorded the privilege of editing certain Articles? And so on and so forth.
So let's say a certain User wants to do something with a certain Article. In the case of a sparsely populated table, a SELECT on that grand table UsersArticles will either return 1 row or none. If it returns a row, then one immediately knows the ArticleUserAuthorization, and can proceed with the rest of the operation.
If no row, then maybe it's enough to say the User cannot do anything with this Article. Or maybe the User is a member of some UserGroup that is entitled to certain privileges to any Article that has some ArticleAttribute (which this Article has or has not). Or maybe the Article has a default ArticleUserAuthorization (stored in some other table) for any User that does not have such a record already in UsersArticles. Or whatever...
The point is that many situations have a structure and a regularity that can be used to help reduce the resources needed by a system. Human beings, for instance, can add two numbers with up to 6 digits each without consulting a table of over half a trillion entries; that's taking advantage of structure. As for regularity, most folks have heard of the Pareto principle (the "80-20" rule - see http://en.wikipedia.org/wiki/Pareto_principle ). Do you really need to have "billions billions billions of rows"? Or would it be truer to say that about 80% of the Users will each only have (special) privileges for maybe hundreds or thousands of the Articles - in which case, why waste the other "billions billions billions" (rounded :-P).
You should look at a hierarchical role based access control (RBAC) solutions. You should also consider sensible defaults.
Are all users allowed to read an article by default? Then store the deny exceptions.
Are all users not allowed to read an article by default? Then store the allow exceptions.
Does it depend on the article whether the default is allow or deny? Then store that in the article, and store both allow and deny exceptions.
Are articles put into issues, and issues collected into journals, and journals collected into fields of knowledge? Then store authorizations between users and those objects.
What if a User is allowed to read a Journal but is denied a specific Article? Then store User-Journal:allow, User-Article:deny and the most specific instruction (in this case the article) takes precedence over the more general (in this case the default, and the journal).
Shard the ArticleUserAuthorization table by user_id. The principle is to reduce the effective dataset size on the access path. Some data will be accessed more frequently than others, also it be be accessed in a particular way. On that path the size of the resultset should be small. Here we do that by having a shard. Also, optimize that path more by maybe having an index if it is a read workload, cache it etc
This particular shard is useful if you want all the articles authorized by a user.
If you want to query by article as well, then duplicate the table and shard by article_id as well. When we have this second sharding scheme, we have denormalized the data. The data is now duplicated and the application would need to do extra work to maintain data-consistency. Writes also will be slower, use a queue for writes
Problem with sharding is that queries across shards is ineffectve, you will need a separate reporting database. Pick a sharding scheme and think about recomputing shards.
For truly massive databases, you would want to split it across physical machines. eg. one or more machines per user's articles.
some nosql suggestions are:
relationships are graphs. so look at graph databases. particularly
https://github.com/twitter/flockdb
redis, by storing the relationship in a list.
column-oriented database like hbase. can treat it like a sparse nested hash
all this depends on the size of your database and the types of queries
EDIT: modified answer. the question previously had 'had_one' relationships Also added nosql suggestions 1 & 2
First of all, it is good to think about default values and behaviors and not store them in the database. For example, if by default, a user cannot read an article unless specified, then, it does not have to be stored as false in the database.
My second thought is that you could have a users_authorizations column in your articles table and a articles_authorizations in your users table. Those 2 columns would store user ids and article ids in the form 3,7,65,78,29,78. For the articles table for example, this would mean users with ids 3,7,65,78,29,78 can access the articles. Then you would have to modify your queries to retrieve users that way:
#article = Article.find(34)
#users = User.find(#article.user_authorizations.split(','))
Each time an article and a user is saved or destroyed, you would have to create callbacks to update the authorization columns.
class User < ActiveRecord
after_save :update_articles_authorizations
def update_articles_authorizations
#...
end
end
Do the same for Article model.
Last thing: if you have different types of authorizations, don't hesitate creating more columns like user_edit_authorization.
With these combined techniques, the quantity of data and hits to the DB are minimal.
Reading through all the comments and the question I still doubt the validity of storing all the combinations. Think about the question in another way - who will populate that table? The author of the article or moderator, or someone else? And based on what rule? You wound imagine how difficult that is. It's impossible to populate all the combinations.
Facebook has a similar feature. When you write a post, you can choose who do you want to share it with. You can select 'Friends', 'Friends of Friends', 'Everyone' or custom list. The custom list allows you to define who will be included and excluded. So same as that, you only need to store the special cases, like 'include' and 'exclude', and all the remaining combinations fall into the default case. By dong this, N*M could be reduced significantly.

Forum Schema: should the "Topics" table countain topic_starter_Id? Or is it redundant information?

I'm creating a forum app in php and have a question regarding database design:
I can get all the posts for a specific topic.All the posts have an auto_increment identity column as well as a timestamp.
Assuming I want to know who the topic starter was, which is the best solution?
Get all the posts for the topic and order by timestamp. But what happens if someone immediately replies to the topic. Then I have the first two posts with the same timestamp(unlikely but possible). I can't know who the first one was. This is also normalized but becomes expensive after the table grows.
Get all the posts for the topic and order by post_id. This is an auto_increment column. Can I be guaranteed that the database will use an index id by insertion order? Will a post inserted later always have a higher id than previous rows? What if I delete a post? Would my database reuse the post_id later? This is mysql I'm using.
The easiest way off course is to simply add a field to the Topics table with the topic_starter_id and be done with it. But it is not normalized. I believe this is also the most efficient method after topic and post tables grow to millions of rows.
What is your opinion?
Zed's comment is pretty much spot on.
You generally want to achieve normalization, but denormalization can save potentially expensive queries.
In my experience writing forum software (five years commercially, five years as a hobby), this particular case calls for denormalization to save the single query. It's perfectly sane and acceptable to store both the first user's display name and id, as well as the last user's display name and id, just so long as the code that adds posts to topics always updates the record. You want one and only one code path here.
I must somewhat disagree with Charles on the fact that the only way to save on performance is to de-normalize to avoid an extra query.
To be more specific, there's an optimization that would work without denormalization (and attendant headaches of data maintenance/integrity), but ONLY if the user base is sufficiently small (let's say <1000 users, for the sake of argument - depends on your scale. Our apps use this approach with 10k+ mappings).
Namely, you have your application layer (code running on web server), retrieve the list of users into a proper cache (e.g. having data expiration facilities). Then, when you need to print first/last user's name, look it up in a cache on server side.
This avoids an extra query for every page view (as you need to only retrieve the full user list ONCE per N page views, when cache expires or when user data is updated which should cause cache expiration).
It adds a wee bit of CPU time and memory usage on web server, but in Yet Another Holy War (e.g. spend more resources on DB side or app server side) I'm firmly on the "don't waste DB resources" camp, seeing how scaling up DB is vastly harder than scaling up a web or app server.
And yes, if that (or other equally tricky) optimization is not feasible, I agree with Charles and Zed that you have a trade-off between normalization (less headaches related to data integrity) and performance gain (one less table to join in some queries). Since I'm an agnostic in that particular Holy War, I just go with what gives better marginal benefits (e.g. how much performance loss vs. how much cost/risk from de-normalization)